license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
pipeline_tag: feature-extraction
tags:
- smart-contract
- web3
- software-engineering
- embedding
- codebert
- solidity
- code-understanding
library_name: transformers
datasets:
- web3se/smart-contract-intent-vul-dataset
SmartBERT V2 CodeBERT
Overview
SmartBERT V2 CodeBERT is a domain-adapted pre-trained model built on top of CodeBERT-base-mlm.
It is designed to learn high-quality semantic representations of smart contract code, particularly at the function level.
The model is further pre-trained on a large corpus of smart contracts using the Masked Language Modeling (MLM) objective.
This domain-adaptive pretraining enables the model to better capture semantic patterns, structure, and intent within smart contract functions compared to general-purpose code models.
SmartBERT V2 can be used for tasks such as:
- Smart contract intent detection
- Code similarity analysis
- Vulnerability analysis
- Smart contract classification
- Code embedding and retrieval
SmartBERT V2 is a pre-trained model specifically developed for SmartIntent V2. It was trained on 16,000 smart contracts, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage.
For production use or general smart contract representation tasks, we recommend SmartBERT V3: https://huggingface.co/web3se/SmartBERT-v3
Training Data
SmartBERT V2 was trained on a corpus of approximately 16,000 smart contracts, primarily written in Solidity and collected from public blockchain repositories.
To better model smart contract behavior, contracts were processed at the function level, enabling the model to learn fine-grained semantic representations of smart contract functions.
For benchmarking purposes in the SmartIntent V2, the pretraining corpus was intentionally limited to this 16,000-contract dataset.
The evaluation dataset (4,000 smart contracts) was strictly held out and not included in the pretraining data, ensuring that downstream evaluations remain unbiased and free from data leakage.
Preprocessing
During preprocessing, all newline (\n) and tab (\t) characters in the function code were normalized by replacing them with a single space.
This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.
Base Model
SmartBERT V2 is initialized from:
- Base Model: CodeBERT-base-mlm
CodeBERT is a transformer-based model trained on source code and natural language pairs.
SmartBERT V2 further adapts this model to the smart contract domain through continued pretraining.
Training Objective
The model is trained using the Masked Language Modeling (MLM) objective, following the same training paradigm as the original CodeBERT model.
During training:
- A subset of tokens is randomly masked.
- The model learns to predict the masked tokens based on surrounding context.
- This encourages the model to learn deeper structural and semantic representations of smart contract code.
Training Setup
Training was conducted using the HuggingFace Transformers framework with the following configuration:
- Hardware: 2 × Nvidia A100 (80GB)
- Training Duration: ~10 hours
- Training Dataset: 16,000 smart contracts
- Evaluation Dataset: 4,000 smart contracts
Example training configuration:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=20,
per_device_train_batch_size=64,
save_steps=10000,
save_total_limit=2,
evaluation_strategy="steps",
eval_steps=10000,
resume_from_checkpoint=checkpoint
)
Evaluation
The model was evaluated on a held-out dataset of approximately 4,000 smart contracts to monitor training stability and generalization during pretraining.
SmartBERT V2 is primarily intended as a representation learning model, providing high-quality embeddings for downstream smart contract analysis tasks.
How to Use
You can load SmartBERT V2 using the HuggingFace Transformers library.
import torch
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")
code = "function totalSupply() external view returns (uint256);"
inputs = tokenizer(
code,
return_tensors="pt",
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
# Option 1: CLS embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]
# Option 2: Mean pooling (recommended for code representation)
mean_embedding = outputs.last_hidden_state.mean(dim=1)
Mean pooling is often recommended when using the model for code representation or similarity tasks.
GitHub Repository
To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:
https://github.com/web3se-lab/SmartBERT
Citation
If you use SmartBERT in your research, please cite:
@article{huang2025smart,
title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
journal={arXiv preprint arXiv:2508.20086},
year={2025}
}
