TARA-WorldModel-VICReg

Joint environment-proteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared 32-dimensional latent space.

This model represents an exploratory methodological approach deposited for transparency and reproducibility. The XGBoost bidirectional framework (TARA-XGBoost-Bidirectional) was retained as the primary modeling approach in the ELF-NET study.

Architecture

Environment branch: Input(env_dim) -> Linear(hidden) -> ReLU -> Dropout(0.3) -> Linear(32)
Pfam branch:        Input(pfam_dim) -> Linear(hidden) -> ReLU -> Dropout(0.3) -> Linear(32)

Property	Value
Latent dimension	32
Parameters	~53K--64K (varies with Pfam input dimensionality)
VICReg loss weights	variance = 25.0, invariance = 25.0, covariance = 1.0
Prediction head alpha	1.0

Training Data

Property	Value
Source	1,151 samples with complete productivity data (Chl-a, POC, NFLH) from 1,810 total TARA Oceans samples
Environmental features	Google Earth Engine oceanographic variables
Pfam features	CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions

Performance

6-Fold Leave-One-Basin-Out (LOBO) CV

Target	Joint Model R²	Env-Only Baseline R²	Cohen's d	p-value
POC	0.532	0.422	0.026	0.38
Chl-a	0.516	0.561	--	--
NFLH	0.560	0.700	--	--

9-Fold Spatial Block CV (matching primary XGBoost design)

Pfam dim	XGB Baseline R²	VICReg R²	Delta R²
pfam20	0.417	-2.045	-2.462
pfam32	0.417	-4.217	-4.634
pfam64	0.417	-1.262	-1.679

The negative R² under spatial CV reflects the MLP architecture's sensitivity to distribution shift on spatially distinctive held-out basins (Mediterranean, mid-Pacific), a known limitation of shallow neural networks on small tabular datasets (N ~ 1,100). This is an architecture confound, not evidence against the Pfam alignment signal itself.

Repository Contents

Directory	Contents
`checkpoints/`	24 model checkpoints (4 hyperparameter configurations x 6 ocean basin folds)
`scripts/`	Core training code (`train_world_model.py`, `vicreg_loss.py`, `world_model.py`)
`results/`	Per-fold metrics, training curves, hyperparameter sweep results, permutation tests
`config/`	Best hyperparameter configuration

Usage

import torch

checkpoint = torch.load(
    "checkpoints/20260127_111754/world_model_fold_Arctic_20260127_111754.pt",
    map_location="cpu",
    weights_only=False
)
state_dict = checkpoint["model_state_dict"]

Related Resources

Resource	Link
ELF-NET analysis pipeline (371 scripts, 15 modules)	github.com/olympus-terminal/ELF-NET
Bidirectional XGBoost models (primary approach)	TARA-XGBoost-Bidirectional
algaGPT protein classifier	GreenGenomicsLab/algaGPT
Dark-whiteGPLM checkpoints	SarahDaakour/dark-whiteGPLM

References

Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.

Authors

David R. Nelson, Kourosh Salehi-Ashtiani

New York University Abu Dhabi

Citation

@article{nelson2026elfnet,
  title   = {Coupling of oceanographic state to the dark proteome: a foundation for genome-informed marine productivity modeling},
  author  = {Nelson, David Roy and Plouviez, Maxence and Daakour, Sarah and Jaiswal, Ashish and Fu, Weiqi and Amin, Shady A. and Salehi-Ashtiani, Kourosh},
  journal = {Forthcoming},
  year    = {2026}
}

Contact

Kourosh Salehi-Ashtiani -- [email protected]

Downloads last month: -; Downloads are not tracked for this model. How to track