jacpacd
/

vuln-detector-codebert

Text Classification

code-classification

vulnerability-detection

automatic-vulnerability-detection

text-embeddings-inference

Model card Files Files and versions

vuln-detector-codebert / README.md

jacpacd's picture

Update README.md

122b7d0 verified 6 months ago

|

history blame contribute delete

2.79 kB

	---
	license: mit
	language:
	- code
	library_name: transformers
	tags:
	- text-classification
	- code-classification
	- vulnerability-detection
	- automatic-vulnerability-detection
	- secure-coding
	---

	# Vulnerability Detector for C Code (SARD)

	This model is a fine-tuned version of `microsoft/codebert-base` designed to detect vulnerabilities in C source code functions.

	## Model Description

	This is a binary text-classification model that takes a C function as input and classifies it as either Vulnerable (`LABEL_1`) or Safe (`LABEL_0`).

	The model was specifically fine-tuned on the [NIST SARD (Software Assurance Reference Dataset)](https://samate.nist.gov/SARD/), focusing on common C vulnerabilities like Memory Leaks, Buffer Overflows, and other CWEs present in the Juliet Test Suite. Due to the clean and structured nature of the SARD dataset, the model achieved a very high accuracy on the validation set.

	## Intended Uses & Limitations

	This model is intended as a proof-of-concept tool to assist developers in identifying potentially vulnerable code patterns during the development lifecycle.

	Limitations:
	* The model is highly specialized for the types of vulnerabilities found in the SARD dataset. Its performance on real-world, messy, or obfuscated code may be lower.
	* It should be used as an assistive tool, not as a replacement for comprehensive security audits or other static analysis tools.
	* The model classifies entire functions and may not pinpoint the exact line of code responsible for the vulnerability.

	## How to Use

	The model can be easily used with the `transformers` library `pipeline`.

	```python
	from transformers import pipeline

	# Load the classifier pipeline
	classifier = pipeline("text-classification", model="jacpacd/vuln-detector-codebert-c-sard")

	# Example of a vulnerable C function (Memory Leak)
	vulnerable_code = """
	void CWE401_Memory_Leak__strdup_char_01_bad()
	{
	char * data;
	data = NULL;
	{
	char myString[] = "myString";
	/* POTENTIAL FLAW: Allocate memory from the heap */
	data = strdup(myString);
	printLine(data);
	}
	/* POTENTIAL FLAW: No deallocation of memory */
	;
	}
	"""

	# Example of a safe C function
	safe_code = """
	void CWE401_Memory_Leak__strdup_char_01_goodB2G()
	{
	char * data;
	data = NULL;
	{
	char myString[] = "myString";
	data = strdup(myString);
	printLine(data);
	}
	/* FIX: Deallocate memory */
	free(data);
	}
	"""

	results_vuln = classifier(vulnerable_code)
	results_safe = classifier(safe_code)

	print(f"Vulnerable Code Prediction: {results_vuln[0]}")
	# Expected output: {'label': 'LABEL_1', 'score': 0.99...}

	print(f"Safe Code Prediction: {results_safe[0]}")
	# Expected output: {'label': 'LABEL_0', 'score': 0.99...}