TARA-WorldModel-VICReg

Joint environment-proteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared 32-dimensional latent space.

This model represents an exploratory methodological approach deposited for transparency and reproducibility. The XGBoost bidirectional framework (TARA-XGBoost-Bidirectional) was retained as the primary modeling approach in the ELF-NET study.

Architecture

Environment branch: Input(env_dim) -> Linear(hidden) -> ReLU -> Dropout(0.3) -> Linear(32)
Pfam branch:        Input(pfam_dim) -> Linear(hidden) -> ReLU -> Dropout(0.3) -> Linear(32)
Property Value
Latent dimension 32
Parameters ~53K--64K (varies with Pfam input dimensionality)
VICReg loss weights variance = 25.0, invariance = 25.0, covariance = 1.0
Prediction head alpha 1.0

Training Data

Property Value
Source 1,151 samples with complete productivity data (Chl-a, POC, NFLH) from 1,810 total TARA Oceans samples
Environmental features Google Earth Engine oceanographic variables
Pfam features CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions

Performance

6-Fold Leave-One-Basin-Out (LOBO) CV

Target Joint Model R² Env-Only Baseline R² Cohen's d p-value
POC 0.532 0.422 0.026 0.38
Chl-a 0.516 0.561 -- --
NFLH 0.560 0.700 -- --

9-Fold Spatial Block CV (matching primary XGBoost design)

Pfam dim XGB Baseline R² VICReg R² Delta R²
pfam20 0.417 -2.045 -2.462
pfam32 0.417 -4.217 -4.634
pfam64 0.417 -1.262 -1.679

The negative R² under spatial CV reflects the MLP architecture's sensitivity to distribution shift on spatially distinctive held-out basins (Mediterranean, mid-Pacific), a known limitation of shallow neural networks on small tabular datasets (N ~ 1,100). This is an architecture confound, not evidence against the Pfam alignment signal itself.

Repository Contents

Directory Contents
checkpoints/ 24 model checkpoints (4 hyperparameter configurations x 6 ocean basin folds)
scripts/ Core training code (train_world_model.py, vicreg_loss.py, world_model.py)
results/ Per-fold metrics, training curves, hyperparameter sweep results, permutation tests
config/ Best hyperparameter configuration

Usage

import torch

checkpoint = torch.load(
    "checkpoints/20260127_111754/world_model_fold_Arctic_20260127_111754.pt",
    map_location="cpu",
    weights_only=False
)
state_dict = checkpoint["model_state_dict"]

Related Resources

Resource Link
ELF-NET analysis pipeline (371 scripts, 15 modules) github.com/olympus-terminal/ELF-NET
Bidirectional XGBoost models (primary approach) TARA-XGBoost-Bidirectional
algaGPT protein classifier GreenGenomicsLab/algaGPT
Dark-whiteGPLM checkpoints SarahDaakour/dark-whiteGPLM

References

  • Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.

Authors

David R. Nelson, Kourosh Salehi-Ashtiani

New York University Abu Dhabi

Citation

@article{nelson2026elfnet,
  title   = {Coupling of oceanographic state to the dark proteome: a foundation for genome-informed marine productivity modeling},
  author  = {Nelson, David Roy and Plouviez, Maxence and Daakour, Sarah and Jaiswal, Ashish and Fu, Weiqi and Amin, Shady A. and Salehi-Ashtiani, Kourosh},
  journal = {Forthcoming},
  year    = {2026}
}

Contact

Kourosh Salehi-Ashtiani -- [email protected]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support