smol-llama πŸ¦™ (360M) β€” from-scratch LLaMA-style pretraining

This repository contains smol-llama, a ~360M parameter LLaMA-architecture causal LM trained from scratch for next-token prediction.
This project was primarily an educational + engineering effort to reproduce a SmolLM-like training setup at a smaller scale.

TL;DR: Wanted to see if it's possible to actually pretrain an LLM from scratch end-to-end β€” data pipeline β†’ tokenizer β†’ training β†’ checkpoint β†’ Hub.

Model Description

smol-llama is a compact implementation of the LLaMA architecture, featuring modern techniques like Grouped Query Attention (GQA), RoPE embeddings, and SwiGLU activations. It was trained on the weights-and-wires/fineweb-6b dataset.

NOTE: This is an early checkpoint / research-y base model. Expect imperfect generations.


Model Architecture

Component Value
Parameters 360M
Hidden Dimension 960
Layers 32
Attention Heads 15 (Query) / 5 (KV)
Head Dimension 64
Context Length 2048
Vocabulary Size 49,152
Architecture LLaMA-style decoder-only

Key Features:

  • Grouped Query Attention (GQA): 3:1 ratio for efficient inference
  • RoPE: Rotary Position Embeddings for better length generalization
  • RMSNorm: Root Mean Square Layer Normalization
  • SwiGLU: Gated linear unit activation in FFN
  • Flash Attention 2: Memory-efficient attention computation
  • Gradient Checkpointing: Enables training larger batches

Training Details

Dataset

Trained on weights-and-wires/fineweb-6b, a curated subset of the FineWeb dataset containing ~6 billion high-quality web tokens.

Training Hyperparameters

Hyperparameter Value
Optimizer AdamW (fused)
Learning Rate 3e-4 (peak)
LR Schedule Cosine with linear warmup
Warmup Steps 900
Total Steps 5,725 (~1 epoch)
Batch Size 64
Gradient Accumulation 8
Effective Batch Size 512 sequences
Context Length 2048 tokens
Tokens per Step ~1M
Total Tokens ~6B
Precision bfloat16
Gradient Clipping 1.0

Infrastructure

Resource Specification
GPU 1Γ— NVIDIA H100 (80GB PCIe)
Training Time ~22 hours
Throughput ~75,000 tokens/sec
Cloud Provider RunPod
Cost ~$53 (total)

Training Loss

The model was trained for one full epoch over the dataset with checkpoints saved every 200 steps. Final training loss: ~2.8 (see training checkpoints for intermediate metrics).

Quick Start

Installation

uv add torch transformers accelerate

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "weights-and-wires/smol-llama"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Remove token_type_ids if present (not used by LLaMA models)
if 'token_type_ids' in inputs:
  del inputs['token_type_ids']

outputs = model.generate(
  **inputs,
  max_new_tokens=100,
  temperature=0.7,
  top_p=0.9,
  do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Generation

# More controlled generation
outputs = model.generate(
  **inputs,
  max_new_tokens=200,
  temperature=0.8,
  top_k=50,
  top_p=0.95,
  repetition_penalty=1.1,
  do_sample=True,
  pad_token_id=tokenizer.eos_token_id,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Batch Generation

prompts = [
  "Once upon a time",
  "The key to success is",
  "In the year 2050,",
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

outputs = model.generate(
  **inputs,
  max_new_tokens=50,
  temperature=0.7,
  do_sample=True,
  pad_token_id=tokenizer.eos_token_id,
)

for i, output in enumerate(outputs):
  print(f"\nPrompt {i+1}: {prompts[i]}")
  print(f"Generated: {tokenizer.decode(output, skip_special_tokens=True)}")

Loading from Custom Checkpoint Format

If you want to load the original training checkpoints:

import torch
from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("weights-and-wires/smol-llama")

# Load custom checkpoint
checkpoint_path = "training_checkpoints/checkpoint_step_5000.pt"
ckpt = torch.load(checkpoint_path, map_location="cuda")

# Create model from scratch (you'll need the model definition)
from utils.model import Llama, ModelArgs
model = Llama(ModelArgs()).cuda().to(torch.bfloat16)

# Handle torch.compile prefix if present
state_dict = {k.replace("_orig_mod.", ""): v for k, v in ckpt['model'].items()}
model.load_state_dict(state_dict)
model.eval()

# Generate
def generate(prompt, max_tokens=50):
  input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
  
  with torch.no_grad():
    for _ in range(max_tokens):
      logits, _ = model(input_ids[:, -2048:])
      next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
      input_ids = torch.cat([input_ids, next_token], dim=1)
      if next_token.item() == tokenizer.eos_token_id:
        break
  
  return tokenizer.decode(input_ids[0])

print(generate("The meaning of life is"))

Training Checkpoints

Intermediate training checkpoints are available in the training_checkpoints/ folder:

Checkpoint Steps Tokens Seen Loss
checkpoint_step_200.pt 200 ~200M -
checkpoint_step_400.pt 400 ~400M -
... ... ... -
checkpoint_step_4800.pt 4,800 ~4.8B -
checkpoint_step_5000.pt 5,000 ~5B -

These checkpoints include full training state (model, optimizer, step, loss) and can be used to resume training or analyze training dynamics.

Limitations

This is a small model trained on a limited dataset (~6B tokens) for demonstration purposes. As such, it has several limitations:

  • Limited Knowledge: The model has only seen 6B tokens, compared to 100B+ for larger models
  • Generalization: May not perform well on out-of-distribution tasks
  • Factual Accuracy: Should not be relied upon for factual information
  • Biases: Inherits biases present in the web-scraped training data
  • No Instruction Tuning: This is a base model without instruction following or chat capabilities
  • No Safety Alignment: Has not undergone safety training or RLHF

Intended Use

This model is intended for:

  • Research and experimentation with small language models
  • Educational purposes and learning about LLM pre-training
  • Fine-tuning on downstream tasks
  • Exploring efficient training techniques
  • Prototyping and proof-of-concept projects

This model is NOT intended for:

  • Production deployments without further fine-tuning
  • Safety-critical applications
  • Generating factual information without verification
  • Applications requiring instruction following (use an instruction-tuned variant)

Training Code

The complete pre-training code is available in the model repository. Key components:

# Clone the repository
git clone https://github.com/weights-and-wires/smol-llama
cd smol-llama

# Install dependencies
uv add sync

# Run training (requires GPU)
uv run pretrain.py

See the repository files for complete implementation details including:

  • Custom LLaMA architecture (utils/model.py)
  • Rotary embeddings (utils/rotary.py)
  • Data loading utilities (utils/data.py)
  • Checkpoint management (utils/checkpoint.py)
  • Learning rate scheduling (utils/lr_schedule.py)

Citation

If you use this model in your research, please cite:

@misc{smol-llama-2026,
  author = {Kashif, Ananya},
  title = {smol-llama: A 360M Parameter LLaMA Model Trained From Scratch},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/weights-and-wires/smol-llama}
}

Also consider citing the FineWeb dataset:

@inproceedings{
  penedo2024the,
  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=n6SCkn2QaG}
}

Resources

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

Downloads last month
69
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train weights-and-wires/smol-llama