LLaMA 2.7B Distilled from 7B LoRA
This is a 2.7B parameter LLaMA model distilled from a 7B LoRA-tuned teacher model using the MiniLLM knowledge distillation framework.
Model Details
- Model Size: 2.7B parameters
- Teacher Model: LLaMA-7B with LoRA fine-tuning on Dolly
- Training Method: Knowledge Distillation (MiniLLM)
- Dataset: Databricks Dolly 15k
- Framework: MiniLLM
Key Features
- Compact Size: 60% smaller than the 7B teacher model
- Efficient Inference: Faster generation with reduced memory footprint
- Knowledge Preservation: Maintains performance through distillation
- Instruction Following: Fine-tuned for instruction-following tasks
Training Configuration
- Distillation Method: MiniLLM with sequence-level KL divergence
- Policy Exploration: Temperature 0.2
- Response Scoring: 0.5
- Max Length: 512 tokens
- Batch Size: 4
- Learning Rate: 5e-06
- Gradient Accumulation: 4
- Training Steps: 3000
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"YOUR_USERNAME/llama-2.7b-distilled-from-7b-lora",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/llama-2.7b-distilled-from-7b-lora")
# Generate response
prompt = "Instruction: Explain what is machine learning.\n\nResponse:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_length=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Performance Benefits
- Size: ~2.7B parameters (vs 7B teacher)
- Speed: ~2.5x faster inference
- Memory: ~60% less GPU memory required
- Quality: Maintains strong instruction-following capabilities
Limitations
- May have reduced capabilities compared to full 7B model
- Trained primarily on English instruction-following tasks
- May generate biased or incorrect responses
- Should be used responsibly with appropriate safety measures
Training Details
This model was trained using the MiniLLM framework with:
- Distillation Loss: Sequence-level KL divergence
- Policy Exploration: 4 samples per prompt
- Response Scoring: Length-normalized scoring with ratio 0.5
- Optimizer: AdamW
- Learning Rate Schedule: Cosine decay
Citation
If you use this model, please cite the MiniLLM paper:
@inproceedings{minillm,
title={MiniLLM: Knowledge Distillation of Large Language Models},
author={Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie},
booktitle={Proceedings of ICLR},
year={2024}
}
License
This model is released under Apache 2.0 license. Note that LLaMA models have specific usage terms from Meta.
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for minhchuxuan/llama-2.7b-distill-7b-lora
Base model
meta-llama/Llama-2-7b-hf