GraphCodeBERT Vulnerability Classifier
A multi-label code vulnerability detection model that identifies 31 vulnerability classes (30 CWEs + safe) mapped to the OWASP Top 10 2021 categories. Fine-tuned from CodeBERTa-small-v1 on 175K+ labeled code samples.
Quick Start
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "ayshajavd/graphcodebert-vuln-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
code = """
import sqlite3
def get_user(username):
query = f"SELECT * FROM users WHERE username = '{username}'"
conn = sqlite3.connect('db.sqlite')
return conn.execute(query).fetchone()
"""
inputs = tokenizer(code, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze()
TARGET_CWES = ["safe", "CWE-20", "CWE-22", "CWE-78", "CWE-79", "CWE-89", "CWE-94",
"CWE-119", "CWE-125", "CWE-190", "CWE-200", "CWE-264", "CWE-269", "CWE-276",
"CWE-284", "CWE-287", "CWE-310", "CWE-327", "CWE-330", "CWE-352", "CWE-362",
"CWE-399", "CWE-401", "CWE-416", "CWE-434", "CWE-476", "CWE-502", "CWE-601",
"CWE-787", "CWE-798", "CWE-918"]
threshold = 0.5
for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
if prob > threshold:
print(f"{cwe}: {prob:.3f}")
Model Details
| Property |
Value |
| Architecture |
RobertaForSequenceClassification (6 layers, 768 hidden, 83.5M params) |
| Base Model |
CodeBERTa-small-v1 |
| Task |
Multi-label classification (BCEWithLogitsLoss with class weights) |
| Labels |
31 (30 CWE categories + "safe") |
| Max Sequence Length |
512 tokens |
| Recommended Threshold |
0.5 (balanced precision/recall) or 0.3 (high recall, security-first) |
Supported Languages
Python, JavaScript, Java, C, C++, PHP, Go
The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul) and Python/JavaScript (from the multi-language datasets).
Evaluation Results (Test Set — 5,000 samples)
Threshold Comparison
| Threshold |
Macro F1 |
Micro F1 |
Weighted F1 |
Macro Precision |
Macro Recall |
| 0.2 |
0.066 |
0.301 |
0.859 |
0.048 |
0.562 |
| 0.3 |
0.081 |
0.458 |
0.865 |
0.057 |
0.502 |
| 0.4 |
0.101 |
0.626 |
0.870 |
0.070 |
0.439 |
| 0.5 |
0.125 |
0.739 |
0.870 |
0.088 |
0.366 |
Per-Class Performance (threshold=0.3)
OWASP A01:2021 — Broken Access Control
| CWE |
Name |
Support |
Precision |
Recall |
F1 |
| CWE-22 |
Path Traversal |
2 |
0.000 |
0.000 |
0.000 |
| CWE-200 |
Information Exposure |
30 |
0.063 |
0.800 |
0.117 |
| CWE-264 |
Permissions/Privileges |
23 |
0.025 |
0.696 |
0.049 |
| CWE-269 |
Improper Privilege Mgmt |
1 |
0.000 |
0.000 |
0.000 |
| CWE-276 |
Incorrect Permissions |
0 |
— |
— |
— |
| CWE-284 |
Access Control |
5 |
0.000 |
0.000 |
0.000 |
| CWE-352 |
CSRF |
1 |
0.000 |
0.000 |
0.000 |
| CWE-601 |
Open Redirect |
0 |
— |
— |
— |
OWASP A02:2021 — Cryptographic Failures
| CWE |
Name |
Support |
Precision |
Recall |
F1 |
| CWE-310 |
Cryptographic Issues |
5 |
0.000 |
0.000 |
0.000 |
| CWE-327 |
Broken Crypto Algorithm |
1 |
0.000 |
0.000 |
0.000 |
| CWE-330 |
Insufficient Randomness |
1 |
0.000 |
0.000 |
0.000 |
OWASP A03:2021 — Injection
| CWE |
Name |
Support |
Precision |
Recall |
F1 |
| CWE-20 |
Input Validation |
69 |
0.023 |
0.957 |
0.046 |
| CWE-78 |
Command Injection |
1 |
0.011 |
1.000 |
0.021 |
| CWE-79 |
XSS |
16 |
0.084 |
0.750 |
0.151 |
| CWE-89 |
SQL Injection |
15 |
0.096 |
1.000 |
0.174 |
| CWE-94 |
Code Injection |
27 |
0.123 |
1.000 |
0.220 |
| CWE-119 |
Buffer Overflow |
118 |
0.088 |
0.898 |
0.160 |
| CWE-125 |
Out-of-bounds Read |
35 |
0.048 |
0.829 |
0.091 |
| CWE-190 |
Integer Overflow |
14 |
0.033 |
1.000 |
0.064 |
| CWE-401 |
Memory Leak |
2 |
0.022 |
1.000 |
0.044 |
| CWE-416 |
Use After Free |
20 |
0.048 |
0.400 |
0.086 |
| CWE-476 |
NULL Pointer Deref |
30 |
0.032 |
0.867 |
0.061 |
| CWE-787 |
Out-of-bounds Write |
46 |
0.052 |
0.891 |
0.099 |
OWASP A04:2021 — Insecure Design
| CWE |
Name |
Support |
Precision |
Recall |
F1 |
| CWE-362 |
Race Condition |
11 |
0.035 |
0.636 |
0.065 |
| CWE-399 |
Resource Management |
21 |
0.008 |
0.857 |
0.015 |
| CWE-434 |
File Upload |
0 |
— |
— |
— |
OWASP A07–A10
| CWE |
Name |
Support |
Precision |
Recall |
F1 |
| CWE-287 |
Authentication |
0 |
— |
— |
— |
| CWE-798 |
Hardcoded Credentials |
0 |
— |
— |
— |
| CWE-502 |
Deserialization |
10 |
0.056 |
1.000 |
0.106 |
| CWE-918 |
SSRF |
0 |
— |
— |
— |
Key Metric: Safe Code Detection
| Class |
Support |
Precision |
Recall |
F1 |
| safe |
4,496 |
0.927 |
0.975 |
0.950 |
Model Strengths
- Excellent recall on many vulnerability classes (0.75–1.0 for SQL injection, buffer overflow, XSS, code injection, etc.)
- Strong safe code detection (F1=0.95) — reliably identifies secure code
- High sensitivity — at threshold 0.3, catches most real vulnerabilities (macro recall=0.50)
Model Limitations
- Low precision on rare classes — many false positives, especially on CWEs with few training examples
- Precision can be improved by using threshold=0.5 (macro F1 improves to 0.125 but recall drops)
- Classes with 0 test support cannot be evaluated
Design choice: For security applications, we prioritize recall (catching real vulnerabilities) over precision (reducing false positives). Missing a real vulnerability (false negative) is worse than flagging safe code (false positive).
Training Data
The model was trained on the code-security-vulnerability-dataset (175,419 samples), combining:
- BigVul — 265K C/C++ vulnerable functions from real CVEs
- CWE-enriched BigVul/PrimeVul — Balanced CWE-labeled subset
- Code Vulnerability Labeled — Multi-language (Python, JS, Java, PHP, Go)
- CyberNative DPO — Vulnerable/secure code pairs
Training Configuration
| Parameter |
Value |
| Epochs |
2 |
| Batch Size |
8 |
| Learning Rate |
5e-5 |
| Scheduler |
Cosine with warmup (50 steps) |
| Loss |
BCEWithLogitsLoss (class-weighted, pos_weight clipped to 30x) |
| Training Subset |
20K balanced samples |
| Optimizer |
AdamW (fused) |
Limitations
- Class imbalance: Many rare CWE types have very few training examples, leading to high false positive rates
- Sequence length: Limited to 512 tokens — long functions may be truncated
- Language bias: Strongest on C/C++ due to BigVul's dominance. Go and PHP performance may be lower
- Single-function analysis: Analyzes individual functions, not cross-function or cross-file vulnerabilities
- Not a replacement: Should complement manual review and established SAST tools (Semgrep, CodeQL, etc.)
Interactive Demo
Try the model in our Code Security Analyzer Space — paste any code and get a full security report with OWASP mapping, severity scores, attack chain analysis, and suggested fixes.
Citation
@misc{graphcodebert-vuln-classifier,
title={GraphCodeBERT Vulnerability Classifier: Multi-label CWE Detection Mapped to OWASP Top 10},
author={ayshajavd},
year={2025},
url={https://huggingface.co/ayshajavd/graphcodebert-vuln-classifier}
}