python, bioinformatics,

Cancer Subtype Classifier using Gene Expression

Pritesh Kamde Pritesh Kamde Follow May 06, 2025 · 2 mins read
Cancer Subtype Classifier using Gene Expression
Share this

Cancer Subtype Classifier – Demo using Gene Expression and ML

🧩 Problem Statement

In cancer genomics, identifying the cancer subtype of a tumor sample is crucial for:

  • Accurate diagnosis
  • Treatment planning
  • Precision medicine research

However, high-dimensional RNA-seq data makes subtype classification challenging, especially across multiple cancer types.


❓ Why Are We Solving This?

Automatically classifying tumor samples based on gene expression can:

  • Help researchers label or validate samples
  • Support oncologists with second-opinion diagnostics
  • Serve as a preprocessing layer for further mutation or survival analysis

Approach

We built a multi-class machine learning classifier that:

  1. Takes a tumor sample’s RNA-seq gene expression (20,531 genes)
  2. Predicts the primary cancer type (subtype)
  3. Visualizes performance, top genes, and live predictions

Dataset

We used the TCGA-PANCAN-HiSeq-801x20531 dataset:

  • Samples: 801 tumor samples
  • Genes: 20,531 (RNA-Seq HiSeq platform)
  • Source: The Cancer Genome Atlas (TCGA)
  • Format: CSV matrix (samples × genes) with labels

Cancer types in this pancancer set include:

  • BRCA – Breast invasive carcinoma
  • COAD – Colon adenocarcinoma
  • LUAD – Lung adenocarcinoma
  • KIRC – Kidney renal clear cell carcinoma
  • HNSC – Head and neck squamous cell carcinoma
  • OV – Ovarian serous cystadenocarcinoma
  • THCA – Thyroid carcinoma
    …and others

Tech Stack

Component Tool
Machine Learning Model XGBoost Classifier
Label Encoding scikit-learn LabelEncoder
Pipeline scikit-learn Pipeline
Dashboard UI Streamlit
Data Processing pandas, seaborn, matplotlib
Model Persistence joblib

Working

  1. Load data_small.csv (trimmed gene matrix) and labels.csv
  2. Encode cancer subtypes as numeric labels
  3. Train an XGBoost classifier in a Scikit-learn pipeline
  4. Save the trained pipeline (classifier_pipeline.joblib)
  5. Load the model in a Streamlit dashboard:
    • View performance metrics
    • See top gene importances
    • Select a sample and get predicted cancer type

Output

  • ✅ Accuracy: ~85% on test data (varies by split)
  • 🔬 Top genes shown by feature importance
  • 🧠 Interactive Streamlit web app for exploration

Cancer Subtype Classifier Demo

Join Newsletter
Get the latest news right in your inbox. We never spam!
Pritesh Kamde
Written by Pritesh Kamde Follow
I’m Pritesh Kamde, a Full Stack Software Engineer with a Master’s in Information Systems from the University of Arizona (Eller College) and 3 years of experience building scalable fintech systems at Barclays. My background spans Java, Spring Boot, React, Python, Node.js, and cloud platforms like AWS and GCP. At Barclays, I designed enterprise-grade APIs and real-time dashboards for retail banking and credit systems. I’ve also worked across the MERN stack to consolidate internal tools for workforce planning. With a foundation in both backend engineering and front-end architecture, I enjoy building secure, high-performance systems that solve real business problems. Outside of work, I’ve served as a Graduate Assistant and certified tutor, mentoring students in business and tech courses. I’m passionate about creating software that drives impact—whether through data-driven platforms, seamless user experiences, or automating workflows. Currently open to full-time opportunities where I can contribute to high-growth teams driving innovation in finance, AI, or cloud-native platforms.