momeaicrypto
/

ModernBERT-base-MoME-v0

+# ModernBERT-base-MoME-v0
+This is a specialized variant of **ModernBERT-base** designed for *Mixture of Multichain Experts* (MoME) routing tasks, particularly focusing on determining which blockchain or chain expert (e.g., Aptos, Ripple, Polkadot, Crust) should handle an incoming transaction or query. It retains the core architectural and performance benefits of ModernBERT, while integrating custom training on chain classification data.
+---
+## Table of Contents
+1. [Model Summary](#model-summary)
+2. [Usage](#usage)
+3. [Evaluation](#evaluation)
+4. [Limitations](#limitations)
+5. [Training](#training)
+6. [License](#license)
+7. [Citation](#citation)
+---
+## Model Summary
+**ModernBERT-base-MoME-v0** is an encoder-only model (BERT-style) derived from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). The original ModernBERT was trained on a large corpus of text and code (2T tokens), supporting context lengths of up to 8,192 tokens. Key enhancements include:
+- **Rotary Positional Embeddings (RoPE)** for long-context support
+- **Local-Global Alternating Attention** for efficient attention over extended sequences
+- **Unpadding + Flash Attention** for fast inference times
+**ModernBERT-base-MoME-v0** extends these capabilities with a fine-tuned head specialized in routing transactions or queries to the correct “chain expert” in a Mixture of Experts (MoME) system. By integrating specialized training data for chain classification (e.g., Polkadot, Aptos, Ripple, Crust), the model can better determine which chain is relevant for a given transaction payload.
+---
+## Usage
+You can load **ModernBERT-base-MoME-v0** using [Hugging Face Transformers](https://github.com/huggingface/transformers). The steps are largely identical to standard BERT usage, with two key notes:
+1. **Long-Context Support**
+   You can input sequences up to 8,192 tokens without degrading performance due to the model’s RoPE-based architecture.
+2. **Routing Head**
+   After the core BERT encoding, a classification head (or specialized projective layer) determines the most likely chain or domain.
+### Quickstart
+```python
+pip install -U transformers>=4.48.0
+pip install flash-attn  # optional but recommended if supported by your GPU
+```
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+model_id = "momeaicrypto/ModernBERT-base-MoME-v0"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+# Sample transaction or query
+text = "Transaction: {\"action\": \"transfer\", \"chain\": \"polkadot\", ...}"
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model(**inputs)
+# The logits from outputs.logits will indicate which chain this transaction likely belongs to.
+print("Logits:", outputs.logits)
+predicted_label = outputs.logits.argmax(dim=-1).item()
+print("Predicted chain ID:", predicted_label)
+```
+**Note**: If you want to adapt the model to a different classification scheme (e.g., additional chains), you can fine-tune via standard BERT classification recipes.
+---
+## Evaluation
+The base ModernBERT architecture has been shown to outperform or match other leading encoder-only models across GLUE, BEIR, MLDR, CodeSearchNet, and StackQA. For **ModernBERT-base-MoME-v0**, we specifically evaluate:
+- **Chain Classification Accuracy**: Using a specialized dataset of transactions labeled by their respective chains (Polkadot, Aptos, Ripple, Crust, etc.).
+- **Inference Efficiency on Long Inputs**: Verifying that the local-global alternating attention and Flash Attention enable high throughput, even for large transaction payloads or logs (up to 8,192 tokens).
+See the parent ModernBERT evaluation results for a broad performance context:
+| Model           | IR (DPR) BEIR | IR (ColBERT) BEIR | NLU (GLUE) | Code (CSN) |
+|-----------------|---------------|-------------------|-----------|-----------|
+| BERT            | 38.9          | 49.0             | 84.7      | 41.2      |
+| RoBERTa         | 37.7          | 48.7             | 86.4      | 44.3      |
+| **ModernBERT**  | 41.6          | 51.3             | 88.4      | 56.4      |
+*ModernBERT-base-MoME-v0* maintains the same strong backbone while adding chain-routing capabilities.
+---
+## Limitations
+1. **Domain-Specific Training**: While it handles chain routing, performance may degrade if you feed it data outside of the pre-trained or fine-tuned domain (e.g., medical or legal text).
+2. **Biases**: As with any large language model, biases in the underlying dataset can manifest in certain classification outcomes.
+3. **Context Length**: Though it can handle sequences up to 8,192 tokens, keep in mind that very long sequences can be slower on certain GPU hardware if Flash Attention is not installed.
+---
+## Training
+- **Base Model**: ModernBERT-base (149M parameters, 22 layers).
+- **Fine-Tuning**: Additional training on ~1k chain-labeled transactions, focusing on Polkadot, Aptos, Ripple, Crust, etc.
+- **Long Context**: Trained with RoPE and local-global alternating attention for efficient extended context usage.
+- **Optimizer**: StableAdamW with trapezoidal LR scheduling, consistent with the original ModernBERT approach.
+---
+## License
+This model inherits the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) from ModernBERT.
+---
+## Citation
+If you use **ModernBERT-base-MoME-v0** in your work, please cite the original ModernBERT:
+```
+@misc{modernbert,
+      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
+      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
+      year={2024},
+      eprint={2412.13663},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.13663},
+}
+```
+Additional references for the **MoME** (Mixture of Multichain Experts) concept should be included if relevant.