Machine Learning and Data Science in the medical domain deliver many promises, specifically in the diagnosis sector. This applies to *categorizing diseases, relating the disease to the cause*, etc. Machine learning can verify some “impossible-to-understand” phenomena due to medication, which is not yet conventional as per the medical community. It plays an important role in overall medical diagnosis and treatment. Previously unknown patterns can now be observed and analyzed to update the treatment process.
In this blog, a famous approach has been used to predict breast cancer among women. We will be answering the following questions in detail\
To have a proper overview, machine learning applications in the medical domain can range from disease diagnosis to advanced image processing techniques that aid the previously only made by pathologists and microbiologists. Five major areas in the medical domain in which ML is contributing significantly are:
Cancer classification is one such area where ML can deliver a robust predictive model to identify the cancer possibility based on given observations. Let's quickly define the problem statement and move towards the actual implementation.
Breast cancer is the most common malignancy (Malignant tumor) among women, accounting for women's second chief cause of cancer death. Breast Cancer occurs due to abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer. Tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose breast cancer.
The FNA test is a quick and simple process of removing certain fluid from the portion where swelling or soreness is involved. When tested, this fluid can be used to form a discretely labeled dataset that can be used to develop a machine learning model for breast cancer classification. This data uses certain features with ground truth results from the FNA test that can be used to check malignant cells and can lead to breast cancer in a patient.
ML techniques such as Artificial Neural Networks, Gradient Boost Method, SVM, etc., help to collaborate with the clinical data and can be used to predict the case with a great deal of accuracy. This article will guide you through some basic implementation steps to use SVM for predicting cancer based on certain observations.
The dataset we will be using for this purpose is the
load_breast_cancer dataset from the sklearn library. The breast cancer dataset is a classic and straightforward binary classification dataset. It can be imported using
Let us get to know our data a little more.
The dataset has a dimension 569x32 with each instance a label ‘M’ or ‘B,’ where M=malignant, B=benign.
The above-shown attributes are the features that are to be used to predict cancer.
Note: The first feature, ‘Unnamed: 0’, is an index and can be excluded from the final features.
In this dataset, there is a highly non-linear relation between the features, and hence a robust classifier is needed to make any prediction based on it. You can use RadViz ( a non-linear multi-dimensional visualization library) to visualize the dataset of every feature.
Radviz maps the features to a unit circle, and each instance in the dataset can be seen as either a ‘red’ or ‘green’ as per its label. The above visualization clearly shows the high correlation between the dataset instances, making it necessary for a strong classifier to solve this problem.
This step involves several activities such as:
from sklearn.preprocessing import LabelEncoderand once it is imported, an instance of the label encoder can be created, and the target attribute column (diagnosis) can be fitted.
train_test_splitmodule from the
sklearn.model_selectionlibrary to divide the dataset into training and testing datasets. The splitting can be done in a 75:25 ratio.
from sklearn.svm import SVCand create an instance of the model.
from sklearn.preprocessing import StandardScaler()to import the normalization function. When data is fed to this scaler, it will orient the data and a zero mean and unit standard deviation.
from sklearn.pipeline import make_pipeline.This pipeline will have the StandardScaler with the SVC
make_pipeline(StandardScaler(),SVC()). This pipeline will take the components incorporated in sequential order and process the input accordingly. The figure below shows that the training dataset is fed to the pipeline. Once the dataset is standardized, it will be fed to the Support Vector Classifier (SVC) to solve a classification problem.
.fitto fit the training data on the classifier pipeline.
The above figure shows the model pipeline to demonstrate the flow. The model used is SVC, which has a lot of tunable parameters, like
The above parameters are of high importance. Other parameters such as the ‘gamma’ value can be set to auto for the model to take care of itself.
Accuracy Score: This score is simply the percentage of correct prediction in the test set. For the above-given configuration, the accuracy is close to 95.8%.
Confusion MatrixThe confusion matrix can be imported from the metrics module of the sklearn library. The test set can be used to compare the predicted output and the ground truth.
ROC is the plot between False Positive and True Positive in the plot. It can be imported using.
from sklearn.metrics import roc_curve.
Early diagnosis of cancer has become an imperative step to save a life. With this in mind, many top MNCs have invested a significant amount of time and money. Let’s have a look at some of these MNCs' work.
Google has taken up initiatives to predict cancer's different forms using ML approaches. They have primarily focused on lung cancer, which is the predominant cause of death, even more than breast cancer. Their algorithm has outperformed radiologists in identifying cancerous cases from CT scan diagnosis images.
The two have collaborated. IBM provides its cognitive computing research capabilities all-inclusive of artificial intelligence, computer vision, and natural language processing, to enable cancerous tissue diagnosis to complement and enhance human expertise in the clinical domain. Mayo Clinic provides the facility of clinical trial data.
The technology helps extract the information more quickly (and painlessly) than any doctor would be able to, identifying patients who best match Mayo Clinic’s clinical trial criteria. This involves genomic analysis, matching patients to appropriate clinical trials, and generate evidence to support standard standard-of-care treatment commendations. Watson Health is an example of such cognitive services offered by IBM.
Machine Learning is acting as a lifesaver. It is helping doctors identify the disease quickly and also provides possible treatments as well. Breast cancer is a common disease among women that takes more than 4Million lives every year. This number constitutes 14% of the overall death caused by cancer among women. Machine learning algorithms like SVM, ANNs are perfectly capable of detecting the possibility of breast cancer in patients, which could have taken a long time for doctors to identify. Bigger companies like Google and IBM have invested in the field of cancer identification. Machine learning begineers can easily implement their own cancer classifier after reading this blog.
Get well-designed application and interview centirc content on ds-algorithms, machine learning, system design and oops. Content will be delivered weekly.