Code Completion with StarCoder
A Python implementation of an AI-powered code completion system using the StarCoder base model with LoRA fine-tuning. This project provides a lightweight yet powerful code completion capability that can be trained on custom datasets.
Features
- Fill-in-the-Middle (FIM) Capability: Handles both prefix-suffix code completion and middle-context completion
- LoRA Fine-tuning: Efficient parameter-efficient fine-tuning using Low-Rank Adaptation
- Modular Architecture: Clean separation between settings, model components, and training logic
- Customizable Training: Easily adjust hyperparameters through the settings file
- Apple Silicon Support: Optimized for running on Apple MPS devices / Note-Since currenlty huggingface does'nt support MLX backend out of the box, hence training runs on mlx backend is slow when compare to cuda device.
Requirements
- Python 3.8+
- PyTorch 2.0+
- Transformers 4.30+
- PEFT (Parameter-Efficient Fine-Tuning)
- Datasets
- Accelerate
- BitsAndBytes (for quantization)
π Installation
# Clone the repository
git clone https://github.com/deep-learner-ConfigurableAI/code-completion.git
cd code-completion
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install torch transformers datasets accelerate bitsandbytes peft tqdm
Project Structure
code-completion/
βββ LICENSE
βββ README.md
βββ src/
β βββ main.py # Entry point for the application
β βββ settings.py # Configuration settings
β βββ model.py # Core model implementation
β βββ runner.py # Training and inference logic
Configuration
All model and training configurations are centralized in src/settings.py. Key parameters include:
- Model checkpoint (
MODEL) - Training dataset (
DATASET) - Sequence length (
SEQ_LENGTH) - Training parameters (batch size, learning rate, etc.)
- LoRA configuration (rank, alpha, target modules)
- FIM transformation settings
Usage
Training
To train the model on your dataset:
- Update the
settings.pyfile with your desired configuration - Uncomment the
train_model()line inmain.py - Run the following command:
cd src
python main.py
Snapshot of Trainig Run: TrainOutput(global_step=1000, training_loss=0.7857105331420898, metrics={'train_runtime': 626.5932, 'train_samples_per_second': 12.767, 'train_steps_per_second': 1.596, 'train_tokens_per_second': 25841.328, 'total_flos': 9.961198190592e+16, 'train_loss': 0.7857105331420898, 'epoch': 0.8176614881439084})
Inference
To use the model for code completion:
- Ensure you have a trained model or use the provided checkpoint
- Uncomment the
code_completion_demo()line inmain.py - Run:
cd src
python main.py
Custom Inference
You can also use the model programmatically:
from model import load_model_tokenizer, get_code_completion
from runner import load_model_for_inference
# Load model
model, tokenizer = load_model_for_inference()
# Example code completion
prefix = "def calculate_total(items):"
suffix = " return total"
completed_code = get_code_completion(model, tokenizer, prefix, suffix)
print(completed_code)
How It Works
Fill-in-the-Middle (FIM): The model is trained to predict missing code in the middle of two context pieces (prefix and suffix).
LoRA Fine-tuning: Instead of fine-tuning all parameters, we use LoRA to efficiently adapt the pre-trained StarCoder model.
Dataset Processing: The training process formats the dataset into fixed-length chunks with FIM transformations applied.
Constant Length Dataset: For efficient training, we process examples into a constant length format.
Performance
The model's performance depends on:
- The quality and size of the training dataset
- The hyperparameters used (especially LoRA rank and learning rate)
- The number of training steps
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- BigCode Project for the StarCoder base model
- Hugging Face for their excellent Transformers library
- PEFT for the efficient fine-tuning implementation
- Downloads last month
- 3
Model tree for verma75preetam/peft-starcoder-lora-apple
Base model
bigcode/starcoderbase-1b