algaGPT
Causal language model for binary classification of microalgal vs. contaminant protein sequences, built on nanoGPT. algaGPT distinguishes algal proteins from bacterial, archaeal, and fungal contaminants in metagenomic assemblies without requiring sequence homology.
Developed by Kourosh Salehi-Ashtiani, New York University Abu Dhabi.
Model Description
| Property | Value |
|---|---|
| Architecture | nanoGPT (GPT-2, 12 layers, 12 heads, 768 embedding) |
| Task | Binary classification via next-token prediction |
| Mode | TI-inclusive (full-length amino acid sequences) |
| Training data | ~58.6M protein sequences (1:1 algal:contaminant) |
| Algal sources | 166 microalgal genomes across 10 phyla |
| Contaminant sources | Bacterial, archaeal, and fungal sequences from NCBI nr |
Performance
| Metric | Score |
|---|---|
| Recall | >99% |
| Speed vs. BLASTp | ~10,701x faster |
Usage
Input: Amino acid string ending with >
Output: Classification tag (algal/contaminant) via next-token prediction
# See Nelson et al. (2025) for full inference pipeline
# Batch inference: https://github.com/SarahD4/dark-whiteGPLM/blob/main/batch_inference.py
Application in ELF-NET
algaGPT served as the primary proteome extraction tool in the ELF-NET study, purifying algal protein sequences from 2,044 TARA Oceans metagenome assemblies to yield 221.9 million sequences for downstream domain-environment coupling analysis.
Related Resources
| Resource | Link |
|---|---|
| ELF-NET analysis pipeline (371 scripts, 15 modules) | github.com/olympus-terminal/ELF-NET |
| Bidirectional XGBoost models | TARA-XGBoost-Bidirectional |
| VICReg joint embedding model | TARA-WorldModel-VICReg |
| Dark-whiteGPLM checkpoints | SarahDaakour/dark-whiteGPLM |
| Dark-whiteGPLM training data | SarahDaakour/dark-whiteGPLM-data |
| Dark-whiteGPLM code | github.com/SarahD4/dark-whiteGPLM |
| algaGPT-purified sequences (Data S2) | Zenodo 10.5281/zenodo.18728837 |
Authors
David R. Nelson, Ashish Kumar Jaiswal, Noha Samir Ismail, Alexandra Mystikou, Kourosh Salehi-Ashtiani
New York University Abu Dhabi
Citation
algaGPT was introduced in:
@article{nelson2025la4sr,
title = {Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras},
author = {Nelson, David R. and Jaiswal, Ashish Kumar and Ismail, Noha Samir and Mystikou, Alexandra and Salehi-Ashtiani, Kourosh},
journal = {Patterns},
volume = {6},
pages = {101373},
year = {2025},
doi = {10.1016/j.patter.2025.101373}
}
algaGPT was applied at scale in:
@article{nelson2026elfnet,
title = {Coupling of oceanographic state to the dark proteome: a foundation for genome-informed marine productivity modeling},
author = {Nelson, David Roy and Plouviez, Maxence and Daakour, Sarah and Jaiswal, Ashish and Fu, Weiqi and Amin, Shady A. and Salehi-Ashtiani, Kourosh},
journal = {Forthcoming},
year = {2026}
}
Contact
Kourosh Salehi-Ashtiani -- [email protected]