algaGPT

Causal language model for binary classification of microalgal vs. contaminant protein sequences, built on nanoGPT. algaGPT distinguishes algal proteins from bacterial, archaeal, and fungal contaminants in metagenomic assemblies without requiring sequence homology.

Developed by Kourosh Salehi-Ashtiani, New York University Abu Dhabi.

Model Description

Property Value
Architecture nanoGPT (GPT-2, 12 layers, 12 heads, 768 embedding)
Task Binary classification via next-token prediction
Mode TI-inclusive (full-length amino acid sequences)
Training data ~58.6M protein sequences (1:1 algal:contaminant)
Algal sources 166 microalgal genomes across 10 phyla
Contaminant sources Bacterial, archaeal, and fungal sequences from NCBI nr

Performance

Metric Score
Recall >99%
Speed vs. BLASTp ~10,701x faster

Usage

Input: Amino acid string ending with >

Output: Classification tag (algal/contaminant) via next-token prediction

# See Nelson et al. (2025) for full inference pipeline
# Batch inference: https://github.com/SarahD4/dark-whiteGPLM/blob/main/batch_inference.py

Application in ELF-NET

algaGPT served as the primary proteome extraction tool in the ELF-NET study, purifying algal protein sequences from 2,044 TARA Oceans metagenome assemblies to yield 221.9 million sequences for downstream domain-environment coupling analysis.

Related Resources

Resource Link
ELF-NET analysis pipeline (371 scripts, 15 modules) github.com/olympus-terminal/ELF-NET
Bidirectional XGBoost models TARA-XGBoost-Bidirectional
VICReg joint embedding model TARA-WorldModel-VICReg
Dark-whiteGPLM checkpoints SarahDaakour/dark-whiteGPLM
Dark-whiteGPLM training data SarahDaakour/dark-whiteGPLM-data
Dark-whiteGPLM code github.com/SarahD4/dark-whiteGPLM
algaGPT-purified sequences (Data S2) Zenodo 10.5281/zenodo.18728837

Authors

David R. Nelson, Ashish Kumar Jaiswal, Noha Samir Ismail, Alexandra Mystikou, Kourosh Salehi-Ashtiani

New York University Abu Dhabi

Citation

algaGPT was introduced in:

@article{nelson2025la4sr,
  title   = {Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras},
  author  = {Nelson, David R. and Jaiswal, Ashish Kumar and Ismail, Noha Samir and Mystikou, Alexandra and Salehi-Ashtiani, Kourosh},
  journal = {Patterns},
  volume  = {6},
  pages   = {101373},
  year    = {2025},
  doi     = {10.1016/j.patter.2025.101373}
}

algaGPT was applied at scale in:

@article{nelson2026elfnet,
  title   = {Coupling of oceanographic state to the dark proteome: a foundation for genome-informed marine productivity modeling},
  author  = {Nelson, David Roy and Plouviez, Maxence and Daakour, Sarah and Jaiswal, Ashish and Fu, Weiqi and Amin, Shady A. and Salehi-Ashtiani, Kourosh},
  journal = {Forthcoming},
  year    = {2026}
}

Contact

Kourosh Salehi-Ashtiani -- [email protected]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support