python, bioinformatics,

Cancer Subtype Classifier using Gene Expression

Pritesh Kamde

Pritesh Kamde Follow May 06, 2025 · 2 mins read

Cancer Subtype Classifier using Gene Expression

Share this

Cancer Subtype Classifier – Demo using Gene Expression and ML

🧩 Problem Statement

In cancer genomics, identifying the cancer subtype of a tumor sample is crucial for:

Accurate diagnosis
Treatment planning
Precision medicine research

However, high-dimensional RNA-seq data makes subtype classification challenging, especially across multiple cancer types.

❓ Why Are We Solving This?

Automatically classifying tumor samples based on gene expression can:

Help researchers label or validate samples
Support oncologists with second-opinion diagnostics
Serve as a preprocessing layer for further mutation or survival analysis

Approach

We built a multi-class machine learning classifier that:

Takes a tumor sample’s RNA-seq gene expression (20,531 genes)
Predicts the primary cancer type (subtype)
Visualizes performance, top genes, and live predictions

Dataset

We used the TCGA-PANCAN-HiSeq-801x20531 dataset:

Samples: 801 tumor samples
Genes: 20,531 (RNA-Seq HiSeq platform)
Source: The Cancer Genome Atlas (TCGA)
Format: CSV matrix (samples × genes) with labels

Cancer types in this pancancer set include:

BRCA – Breast invasive carcinoma
COAD – Colon adenocarcinoma
LUAD – Lung adenocarcinoma
KIRC – Kidney renal clear cell carcinoma
HNSC – Head and neck squamous cell carcinoma
OV – Ovarian serous cystadenocarcinoma
THCA – Thyroid carcinoma
…and others

Tech Stack

Component	Tool
Machine Learning Model	XGBoost Classifier
Label Encoding	scikit-learn LabelEncoder
Pipeline	scikit-learn Pipeline
Dashboard UI	Streamlit
Data Processing	pandas, seaborn, matplotlib
Model Persistence	joblib

Working

Load data_small.csv (trimmed gene matrix) and labels.csv
Encode cancer subtypes as numeric labels
Train an XGBoost classifier in a Scikit-learn pipeline
Save the trained pipeline (classifier_pipeline.joblib)
Load the model in a Streamlit dashboard:
- View performance metrics
- See top gene importances
- Select a sample and get predicted cancer type

Output

✅ Accuracy: ~85% on test data (varies by split)
🔬 Top genes shown by feature importance
🧠 Interactive Streamlit web app for exploration

Cancer Subtype Classifier Demo

Join Newsletter

Get the latest news right in your inbox. We never spam!

Pritesh Kamde

Written by Pritesh Kamde Follow

I’m Pritesh Kamde, a Full Stack Software Engineer with a Master’s in Information Systems from the University of Arizona (Eller College) and 3 years of experience building scalable fintech systems at Barclays. My background spans Java, Spring Boot, React, Python, Node.js, and cloud platforms like AWS and GCP. At Barclays, I designed enterprise-grade APIs and real-time dashboards for retail banking and credit systems. I’ve also worked across the MERN stack to consolidate internal tools for workforce planning. With a foundation in both backend engineering and front-end architecture, I enjoy building secure, high-performance systems that solve real business problems. Outside of work, I’ve served as a Graduate Assistant and certified tutor, mentoring students in business and tech courses. I’m passionate about creating software that drives impact—whether through data-driven platforms, seamless user experiences, or automating workflows. Currently open to full-time opportunities where I can contribute to high-growth teams driving innovation in finance, AI, or cloud-native platforms.

Medical Imaging

Work With Me