|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Salesforce/codet5-small |
|
|
tags: |
|
|
- cpp |
|
|
- complete |
|
|
--- |
|
|
|
|
|
|
|
|
# π Codelander |
|
|
|
|
|
--- |
|
|
|
|
|
## π Overview |
|
|
|
|
|
This specialized **CodeT5** model has been fine-tuned for **C++ code completion** tasks. |
|
|
It excels at understanding **C++ syntax** and **common programming patterns** to provide intelligent code suggestions as you type. |
|
|
|
|
|
--- |
|
|
|
|
|
## β¨ Key Features |
|
|
|
|
|
- πΉ Context-aware completions for C++ functions, classes, and control structures |
|
|
- πΉ Handles complex C++ syntax including **templates, STL, and modern C++ features** |
|
|
- πΉ Trained on **competitive programming solutions** from high-quality Codeforces submissions |
|
|
- πΉ Low latency suitable for **real-time editor integration** |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Performance |
|
|
|
|
|
| Metric | Value | |
|
|
|---------------------|---------| |
|
|
| Training Loss | 1.2475 | |
|
|
| Validation Loss | 1.0016 | |
|
|
| Training Epochs | 3 | |
|
|
| Training Steps | 14010 | |
|
|
| Samples per second | 6.275 | |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Installation & Usage |
|
|
|
|
|
### π§ Direct Integration with HuggingFace Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained("outlander23/codelander") |
|
|
tokenizer = AutoTokenizer.from_pretrained("outlander23/codelander") |
|
|
|
|
|
# Generate completion |
|
|
def get_completion(code_prefix, max_new_tokens=100): |
|
|
inputs = tokenizer(f"complete C++ code: {code_prefix}", return_tensors="pt") |
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
max_new_tokens=max_new_tokens, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
do_sample=True |
|
|
) |
|
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Model Architecture |
|
|
|
|
|
- Base Model: **Salesforce/codet5-base** |
|
|
- Parameters: **220M** |
|
|
- Context Window: **512 tokens** |
|
|
- Fine-tuning: **Seq2Seq training on C++ code snippets** |
|
|
- Training Time: ~ **5 hours** |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Data |
|
|
|
|
|
- Dataset: **open-r1/codeforces-submissions** |
|
|
- Selection: **Accepted C++ solutions only** |
|
|
- Size: **50,000+ code samples** |
|
|
- Processing: **Prefix-suffix pairs with random splits** |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
- β May generate syntactically correct but semantically incorrect code |
|
|
- β Limited knowledge of **domain-specific libraries** not present in training data |
|
|
- β May occasionally produce **incomplete code fragments** |
|
|
|
|
|
--- |
|
|
|
|
|
## π» Example Completions |
|
|
|
|
|
### β
Example 1: Factorial Function |
|
|
|
|
|
**Input:** |
|
|
```cpp |
|
|
int factorial(int n) { |
|
|
if (n <= 1) { |
|
|
return 1; |
|
|
} else { |
|
|
``` |
|
|
|
|
|
**Completion:** |
|
|
```cpp |
|
|
return n * factorial(n - 1); |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
- Training completed on: **2025-08-28 12:51:09 UTC** |
|
|
- Training epochs: **3/3** |
|
|
- Total steps: **14010** |
|
|
- Training loss: **1.2475** |
|
|
|
|
|
### π Epoch Performance |
|
|
|
|
|
| Epoch | Training Loss | Validation Loss | |
|
|
|-------|---------------|-----------------| |
|
|
| 1 | 1.2638 | 1.1004 | |
|
|
| 2 | 1.1551 | 1.0250 | |
|
|
| 3 | 1.1081 | 1.0016 | |
|
|
|
|
|
--- |
|
|
|
|
|
## π₯οΈ Compatibility |
|
|
|
|
|
- β
Compatible with **Transformers 4.30.0+** |
|
|
- β
Optimized for **Python 3.8+** |
|
|
- β
Supports both **CPU and GPU inference** |
|
|
|
|
|
--- |
|
|
|
|
|
## β€οΈ Credits |
|
|
|
|
|
Made with β€οΈ by **outlander23** |
|
|
|
|
|
> "Good code is its own best documentation." β *Steve McConnell* |
|
|
|
|
|
--- |