| | --- |
| | license: apache-2.0 |
| | tags: |
| | - pytorch |
| | - transformer |
| | - mamba |
| | - hybrid |
| | - matryoshka |
| | - nanochat |
| | - adaptive-compute |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # π Adamba: Adaptive Mamba |
| |
|
| | > **Ad**aptive **Mamba**: Elastic compute with dynamic Matryoshka scaling |
| |
|
| | **Project location [unixsysdev/adamba](https://github.com/unixsysdev/adamba)** |
| |
|
| | ## Available Checkpoints |
| |
|
| | | Variant | Parameters | Dim | Features | Status | Download | |
| | |---------|------------|-----|----------|--------|----------| |
| | | phase1_6b_base | 6.4B | 2048 | mamba_integration | β
| [Download](./checkpoints/phase1_6b_base.pt) | |
| | | phase2_6b_matryoshka | 6.4B | 2048 | matryoshka, early_exit | β³ | β | |
| | | phase3_9b_matryoshka | 9.3B | 2560 | matryoshka, early_exit | β³ | β | |
| | | phase3_20b_matryoshka | 20B | 4096 | matryoshka, early_exit | β³ | β | |
| | | sft_20b | 20B | 4096 | matryoshka, early_exit, sft | β³ | β | |
| | | rl_20b | 20B | 4096 | matryoshka, early_exit, rl_agent | β³ | β | |
| | |
| | ## Architecture Overview |
| | |
| | Adamba combines three efficiency techniques: |
| | |
| | | Technique | Implementation | Purpose | |
| | |-----------|----------------|---------| |
| | | **Matryoshka (MRL)** | Width: 128 β 4096 per layer | Elastic compute | |
| | | **Early Exit** | ConfidenceGate per layer | Skip when confident | |
| | | **Static SSM** | Mamba at full dim | Stable memory backbone | |
| | |
| | ``` |
| | βββββββββββββββββββββββββββββββββββββββββββββββββββ |
| | β PROMPT β LayerDimPredictor β [dim per layer] β |
| | β β |
| | β Attention + MLP: Dynamic (Matryoshka sliced) β |
| | β Mamba: Static (full dim) β |
| | β β |
| | β Gate > 0.95 β EXIT EARLY β |
| | β Gate < 0.50 β EXPAND remaining layers β |
| | βββββββββββββββββββββββββββββββββββββββββββββββββββ |
| | ``` |
| | |
| | ## Training Pipeline |
| | |
| | ``` |
| | nanochat-d32 (1.9B) |
| | β Surgery (add 32 Mamba layers) |
| | Phase 1: 6.4B (dim=2048) β Mamba integration |
| | β Enable Matryoshka |
| | Phase 2: 6.4B (dim=2048) β Full training |
| | β Progressive expand |
| | Phase 3: 9.3B β 20B (dim=4096) |
| | β Fine-tuning |
| | SFT: Instruction tuning |
| | RL: Agent capabilities |
| | ``` |
| | |
| | ## Model Details |
| | |
| | - **Base**: [karpathy/nanochat-d32](https://huggingface.co/karpathy/nanochat-d32) |
| | - **Architecture**: 64 blocks (32 Attention + 32 Mamba interleaved) |
| | - **Vocabulary**: 65,536 tokens |
| | - **Matryoshka Dims**: [128, 256, 512, 1024, 2048, 4096] |
| | |
| | ## Usage |
| | |
| | ```python |
| | # Coming soon - inference code |
| | # See: https://github.com/unixsysdev/adamba |
| | ``` |
| | |
| | ## Links |
| | |
| | - π **GitHub**: [unixsysdev/adamba](https://github.com/unixsysdev/adamba) |
| | - π **Training**: [WandB](https://wandb.ai/dalletest123/nano-fractal) |
| | |
| | ## License |
| | |
| | Apache 2.0 |
| | |