| | --- |
| | license: mit |
| | tags: |
| | - rna-seq |
| | - bulk-rna |
| | - cancer |
| | - transcriptomics |
| | - graph-neural-network |
| | - transformer |
| | - performer |
| | - gcn |
| | - pytorch |
| | model_size: 48M |
| | pipeline_tag: feature-extraction |
| | library_name: pytorch |
| | --- |
| | |
| | # ๐งฌ CancerTranscriptome-Mini-48M |
| | *A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq* |
| |
|
| | **CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq. |
| | It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder. |
| |
|
| | This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes. |
| |
|
| | --- |
| |
|
| | ## ๐ฌ Origin & References |
| |
|
| | ### **Primary Reference (BulkFormer)** |
| | Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui. |
| | **โA large-scale foundation model for bulk transcriptomes.โ** |
| | bioRxiv (2025). |
| | doi: https://doi.org/10.1101/2025.06.11.659222 |
| |
|
| | ### **This Model (CancerTranscriptome-Mini-48M)** |
| | A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency. |
| | Source Code: https://github.com/alwalt/BioFM |
| |
|
| | --- |
| |
|
| | # ๐ Data Source |
| |
|
| | All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository: |
| |
|
| | **ARCHS4 Reference:** |
| | Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al. |
| | **โMassive mining of publicly available RNA-seq data from human and mouse.โ** |
| | *Nature Communications* 9, 1366 (2018). |
| | Dataset: https://maayanlab.cloud/archs4/ |
| |
|
| | ### **Filtering Procedure** |
| | - Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5 |
| | - Selected samples matching: |
| | `cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma` |
| | - Removed samples lacking clear disease annotations |
| | - Used ARCHS4 log-TPM matrices (gene ร sample) |
| | - Final dataset: ~76k cancer samples, 19,357 genes |
| |
|
| | No private, clinical, controlled-access, or proprietary data were used. |
| |
|
| | --- |
| |
|
| | # ๐ง Model Architecture (Summary) |
| |
|
| | CancerTranscriptome-Mini-48M includes: |
| |
|
| | ### **1. Gene Identity Embeddings** |
| | - Precomputed **ESM2 embeddings** for each protein-coding gene |
| | - Projected into model dimension (320) |
| |
|
| | ### **2. Rotary Expression Embeddings (REE)** |
| | - Deterministic sinusoidal continuous-value embedding |
| | - Masked positions zeroed (mask token = โ10) |
| |
|
| | ### **3. Graph Neural Network Layer** |
| | - **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph |
| | - Injects biological prior knowledge |
| |
|
| | ### **4. Expression Binning** |
| | - Learnable importance scores sort genes |
| | - Genes divided into 10 bins |
| | - Each bin receives its own **local Performer** attention |
| |
|
| | ### **5. Global Performer Attention** |
| | - 2 stacked Performer layers across all genes |
| |
|
| | ### **6. Prediction Head** |
| | - MLP โ scalar value per gene |
| | - Used for masked-expression reconstruction |
| |
|
| | Total parameters: **48,336,162 (~48M)** |
| |
|
| | --- |
| |
|
| | # ๐ฏ Intended Use |
| |
|
| | This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks: |
| |
|
| | - Tumor subtype prediction |
| | - Drug response modeling |
| | - Immune infiltration scoring |
| | - Survival / risk modeling |
| | - Gene expression imputation |
| | - Dimensionality reduction |
| | - Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets |
| |
|
| | --- |
| |
|
| | # ๐ How to Use |
| |
|
| | Download & run: |
| |
|
| | ```python |
| | import torch |
| | from model import BulkFormer # from this repo |
| | import safetensors.torch as st |
| | |
| | # Load model + weights |
| | model = BulkFormer( |
| | dim=320, |
| | graph=torch.load("edge_index.pt"), # provide your graph |
| | gene_emb=torch.load("esm2_gene_emb.pt"), |
| | gene_length=19357, |
| | bin_head=8, |
| | full_head=4, |
| | bins=10, |
| | gb_repeat=1, |
| | p_repeat=2 |
| | ) |
| | |
| | state = st.load_file("model.safetensors") |
| | model.load_state_dict(state) |
| | model.eval() |
| | |
| | # Example input: 19,357-gene log-TPM vector |
| | x = torch.randn(1, 19357) |
| | |
| | with torch.no_grad(): |
| | out = model(x) |
| | |
| | print(out.shape) # [1, 19357] |
| | |