Machine Learning In Medical Science: Cancer Classification Model Using SVM

Machine Learning and Data Science in the medical domain deliver many promises, specifically in the diagnosis sector. This applies to *categorizing diseases, relating the disease to the cause*, etc. Machine learning can verify some “impossible-to-understand” phenomena due to medication, which is not yet conventional as per the medical community. It plays an important role in overall medical diagnosis and treatment. Previously unknown patterns can now be observed and analyzed to update the treatment process. 

Key takeaways from this article

In this blog, a famous approach has been used to predict breast cancer among women. We will be answering the following questions in detail\

  1. What are the different domains in medical science where machine learning can help?
  2. What are the methods that are used to predict the Malignant tumor (Cancer cells)?
  3. What are the steps involved in the Support Vector Machine based classifier implementation?
  4. How can we evaluate our model using the confusion matrix and ROC curve?
  5. Which are the major companies that are contributing to this area?

To have a proper overview, machine learning applications in the medical domain can range from disease diagnosis to advanced image processing techniques that aid the previously only made by pathologists and microbiologists. Five major areas in the medical domain in which ML is contributing significantly are:

  1. Disease identification (e.g., Cancer positive or negative)
  2. Drug discovery (e.g., Covid Vaccine RNA Patterns)
  3. Smart health records (e.g., tracking the heart rates and sense the wrong)
  4. Clinical decision making (e.g., predicting the perfect medicine)
  5. Medical imaging Oncology (e.g., Predicting cancer using diagnosis images)

ML in Health Care

Cancer classification is one such area where ML can deliver a robust predictive model to identify the cancer possibility based on given observations. Let's quickly define the problem statement and move towards the actual implementation.

Problem Statement

Breast cancer is the most common malignancy (Malignant tumor) among women, accounting for women's second chief cause of cancer death. Breast Cancer occurs due to abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer. Tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose breast cancer. 

The FNA test is a quick and simple process of removing certain fluid from the portion where swelling or soreness is involved. When tested, this fluid can be used to form a discretely labeled dataset that can be used to develop a machine learning model for breast cancer classification. This data uses certain features with ground truth results from the FNA test that can be used to check malignant cells and can lead to breast cancer in a patient.
ML techniques such as Artificial Neural Networks, Gradient Boost Method, SVM, etc., help to collaborate with the clinical data and can be used to predict the case with a great deal of accuracy. This article will guide you through some basic implementation steps to use SVM for predicting cancer based on certain observations.

Steps to implement

Download and load dataset

The dataset we will be using for this purpose is the load_breast_cancer dataset from the sklearn library. The breast cancer dataset is a classic and straightforward binary classification dataset. It can be imported using sklearn.datasets.load_breast_cancer.

Let us get to know our data a little more.
The dataset has a dimension 569x32 with each instance a label ‘M’ or ‘B,’ where M=malignant, B=benign.

Raw dataset

The above-shown attributes are the features that are to be used to predict cancer. 
Note: The first feature, ‘Unnamed: 0’, is an index and can be excluded from the final features.

In this dataset, there is a highly non-linear relation between the features, and hence a robust classifier is needed to make any prediction based on it. You can use RadViz ( a non-linear multi-dimensional visualization library) to visualize the dataset of every feature.

Dataset overview RadViz Visualization

Radviz maps the features to a unit circle, and each instance in the dataset can be seen as either a ‘red’ or ‘green’ as per its label. The above visualization clearly shows the high correlation between the dataset instances, making it necessary for a strong classifier to solve this problem.


This step involves several activities such as:

  • Assigning numerical values to categorical data (target labels): We can use a label encoder to define the target values in this task. The label encoder can be imported using the command, from sklearn.preprocessing import LabelEncoderand once it is imported, an instance of the label encoder can be created, and the target attribute column (diagnosis) can be fitted. 
  • Standardize every instance of the features: In this step, standardization of the data can be performed. The standardizer built here can later be incorporated into the pipeline that we will build.

Model Formation

  • Obtaining the training and testing data sample: Using the above dataset, we can call the train_test_split module from the sklearn.model_selection library to divide the dataset into training and testing datasets. The splitting can be done in a 75:25 ratio.
  • Creating an instance of the model: Import the Support Vector Classifier (SVC) from the SVM module in the sklearn library using from sklearn.svm import SVC and create an instance of the model. 
    SVMs are one of the most popular classification algorithms and have sophisticated means of transforming non-linear data and enable a linear model fitting to the data (Cortes and Vapnik 1995).
    Kernelized support vector machines are strong methods of mapping a highly non-linear dataset to a relatively linear method to classify new dataset instance. Usefrom sklearn.preprocessing import StandardScaler() to import the normalization function. When data is fed to this scaler, it will orient the data and a zero mean and unit standard deviation.

Standardization formulae


  • Making an overall pipeline for the model training: A pipeline can have all the components grouped and executed sequentially. The pipeline can be imported from the scikit-learn library as from sklearn.pipeline import make_pipeline. This pipeline will have the StandardScaler with the SVC make_pipeline(StandardScaler(),SVC()). This pipeline will take the components incorporated in sequential order and process the input accordingly. The figure below shows that the training dataset is fed to the pipeline. Once the dataset is standardized, it will be fed to the Support Vector Classifier (SVC) to solve a classification problem.
  • Use the training set that was prepared earlier and then use .fit to fit the training data on the classifier pipeline.

Overall pipeline of the proposed methodology

The above figure shows the model pipeline to demonstrate the flow. The model used is SVC, which has a lot of tunable parameters, like

  • Regularizer, C: By default, it is 1. This parameter is given a positive float quantity, which will inversely relate the regularization's strength to the quantity.
  • Kernel: The kernel transforms the data into a different form. The purpose of the kernel is to transform data such that the classifier can easily classify it. The most preferred kernel is the ‘rbf’ due to its ability to account for non-linearity.

The above parameters are of high importance. Other parameters such as the ‘gamma’ value can be set to auto for the model to take care of itself.

Performance Evaluation 

Accuracy Score: This score is simply the percentage of correct prediction in the test set. For the above-given configuration, the accuracy is close to 95.8%.

Confusion MatrixThe confusion matrix can be imported from the metrics module of the sklearn library. The test set can be used to compare the predicted output and the ground truth. 

Confusion Matrix representation

ROC is the plot between False Positive and True Positive in the plot. It can be imported using. from sklearn.metrics import roc_curve.

ROC Plot of above svm method

Case studies of Companies Use-case

Early diagnosis of cancer has become an imperative step to save a life. With this in mind, many top MNCs have invested a significant amount of time and money. Let’s have a look at some of these MNCs' work.

DeepMind by Google

DeepMind of Google

Credit: DigitalHealth

Google has taken up initiatives to predict cancer's different forms using ML approaches. They have primarily focused on lung cancer, which is the predominant cause of death, even more than breast cancer. Their algorithm has outperformed radiologists in identifying cancerous cases from CT scan diagnosis images.

IBM Watson & Mayo Clinic

WatsonHealth Image

Credit: Watson

The two have collaborated. IBM provides its cognitive computing research capabilities all-inclusive of artificial intelligence, computer vision, and natural language processing, to enable cancerous tissue diagnosis to complement and enhance human expertise in the clinical domain. Mayo Clinic provides the facility of clinical trial data. 
The technology helps extract the information more quickly (and painlessly) than any doctor would be able to, identifying patients who best match Mayo Clinic’s clinical trial criteria. This involves genomic analysis, matching patients to appropriate clinical trials, and generate evidence to support standard standard-of-care treatment commendations. Watson Health is an example of such cognitive services offered by IBM.

Possible Interview Questions

  1. What are classification problems?
  2. Why SVM? What other algorithms can be tried in place of SVM?
  3. What features were used in the final feature set?
  4. Can we convert this to a multiclass classification problem?
  5. What more pre-processing of final data can be done?


Machine Learning is acting as a lifesaver. It is helping doctors identify the disease quickly and also provides possible treatments as well. Breast cancer is a common disease among women that takes more than 4Million lives every year. This number constitutes 14% of the overall death caused by cancer among women. Machine learning algorithms like SVM, ANNs are perfectly capable of detecting the possibility of breast cancer in patients, which could have taken a long time for doctors to identify. Bigger companies like Google and IBM have invested in the field of cancer identification. Machine learning begineers can easily implement their own cancer classifier after reading this blog.

Enjoy Learning! Enjoy Thinking!

We welcome your comments

Subscribe Our Newsletter

Get well-designed application and interview centirc content on ds-algorithms, machine learning, system design and oops. Content will be delivered weekly.