Update README.md

b051c48 verified 4 months ago

4.11 kB

	---
	license: mit
	tags:
	- rna-seq
	- bulk-rna
	- cancer
	- transcriptomics
	- graph-neural-network
	- transformer
	- performer
	- gcn
	- pytorch
	model_size: 48M
	pipeline_tag: feature-extraction
	library_name: pytorch
	---

	# 🧬 CancerTranscriptome-Mini-48M
	A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq

	CancerTranscriptome-Mini-48M is a lightweight derivative of BulkFormer, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.
	It combines GCN-based gene graph propagation, Rotary Expression Embeddings (REE), local bin-wise Performer attention, and global Performer attention into a single unified encoder.

	This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.

	---

	## 🔬 Origin & References

	### Primary Reference (BulkFormer)
	Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.
	“A large-scale foundation model for bulk transcriptomes.”
	bioRxiv (2025).
	doi: https://doi.org/10.1101/2025.06.11.659222

	### This Model (CancerTranscriptome-Mini-48M)
	A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.
	Source Code: https://github.com/alwalt/BioFM

	---

	# 📊 Data Source

	All training samples originate from the ARCHS4 Human RNA-seq v2.5 public repository:

	ARCHS4 Reference:
	Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.
	“Massive mining of publicly available RNA-seq data from human and mouse.”
	Nature Communications 9, 1366 (2018).
	Dataset: https://maayanlab.cloud/archs4/

	### Filtering Procedure
	- Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5
	- Selected samples matching:
	`cancer \| tumor \| carcinoma \| leukemia \| lymphoma \| melanoma \| glioma`
	- Removed samples lacking clear disease annotations
	- Used ARCHS4 log-TPM matrices (gene × sample)
	- Final dataset: ~76k cancer samples, 19,357 genes

	No private, clinical, controlled-access, or proprietary data were used.

	---

	# 🧠 Model Architecture (Summary)

	CancerTranscriptome-Mini-48M includes:

	### 1. Gene Identity Embeddings
	- Precomputed ESM2 embeddings for each protein-coding gene
	- Projected into model dimension (320)

	### 2. Rotary Expression Embeddings (REE)
	- Deterministic sinusoidal continuous-value embedding
	- Masked positions zeroed (mask token = –10)

	### 3. Graph Neural Network Layer
	- GCNConv (Kipf & Welling) applied on a curated gene-gene graph
	- Injects biological prior knowledge

	### 4. Expression Binning
	- Learnable importance scores sort genes
	- Genes divided into 10 bins
	- Each bin receives its own local Performer attention

	### 5. Global Performer Attention
	- 2 stacked Performer layers across all genes

	### 6. Prediction Head
	- MLP → scalar value per gene
	- Used for masked-expression reconstruction

	Total parameters: 48,336,162 (~48M)

	---

	# 🎯 Intended Use

	This model produces context-aware gene embeddings for downstream cancer transcriptomic tasks:

	- Tumor subtype prediction
	- Drug response modeling
	- Immune infiltration scoring
	- Survival / risk modeling
	- Gene expression imputation
	- Dimensionality reduction
	- Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets

	---

	# 🚀 How to Use

	Download & run:

	```python
	import torch
	from model import BulkFormer # from this repo
	import safetensors.torch as st

	# Load model + weights
	model = BulkFormer(
	dim=320,
	graph=torch.load("edge_index.pt"), # provide your graph
	gene_emb=torch.load("esm2_gene_emb.pt"),
	gene_length=19357,
	bin_head=8,
	full_head=4,
	bins=10,
	gb_repeat=1,
	p_repeat=2
	)

	state = st.load_file("model.safetensors")
	model.load_state_dict(state)
	model.eval()

	# Example input: 19,357-gene log-TPM vector
	x = torch.randn(1, 19357)

	with torch.no_grad():
	out = model(x)

	print(out.shape) # [1, 19357]