Update README.md

82c2aac verified about 6 hours ago

5.97 kB

license: mit
language:
  - en
inference: true
base_model:
  - microsoft/codebert-base-mlm
pipeline_tag: feature-extraction
tags:
  - smart-contract
  - web3
  - software-engineering
  - embedding
  - codebert
  - solidity
  - code-understanding
library_name: transformers
datasets:
  - web3se/smart-contract-intent-vul-dataset

SmartBERT V2 CodeBERT

Overview

SmartBERT V2 CodeBERT is a domain-adapted pre-trained model built on top of CodeBERT-base-mlm.
It is designed to learn high-quality semantic representations of smart contract code, particularly at the function level.

The model is further pre-trained on a large corpus of smart contracts using the Masked Language Modeling (MLM) objective.
This domain-adaptive pretraining enables the model to better capture semantic patterns, structure, and intent within smart contract functions compared to general-purpose code models.

SmartBERT V2 can be used for tasks such as:

Smart contract intent detection
Code similarity analysis
Vulnerability analysis
Smart contract classification
Code embedding and retrieval

SmartBERT V2 is a pre-trained model specifically developed for SmartIntent V2. It was trained on 16,000 smart contracts, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage.
For production use or general smart contract representation tasks, we recommend SmartBERT V3: https://huggingface.co/web3se/SmartBERT-v3

Training Data

SmartBERT V2 was trained on a corpus of approximately 16,000 smart contracts, primarily written in Solidity and collected from public blockchain repositories.

To better model smart contract behavior, contracts were processed at the function level, enabling the model to learn fine-grained semantic representations of smart contract functions.

For benchmarking purposes in the SmartIntent V2, the pretraining corpus was intentionally limited to this 16,000-contract dataset.
The evaluation dataset (4,000 smart contracts) was strictly held out and not included in the pretraining data, ensuring that downstream evaluations remain unbiased and free from data leakage.

Preprocessing

During preprocessing, all newline (\n) and tab (\t) characters in the function code were normalized by replacing them with a single space.
This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.

Base Model

SmartBERT V2 is initialized from:

Base Model: CodeBERT-base-mlm

CodeBERT is a transformer-based model trained on source code and natural language pairs.
SmartBERT V2 further adapts this model to the smart contract domain through continued pretraining.

Training Objective

The model is trained using the Masked Language Modeling (MLM) objective, following the same training paradigm as the original CodeBERT model.

During training:

A subset of tokens is randomly masked.
The model learns to predict the masked tokens based on surrounding context.
This encourages the model to learn deeper structural and semantic representations of smart contract code.

Training Setup

Training was conducted using the HuggingFace Transformers framework with the following configuration:

Hardware: 2 × Nvidia A100 (80GB)
Training Duration: ~10 hours
Training Dataset: 16,000 smart contracts
Evaluation Dataset: 4,000 smart contracts

Example training configuration:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_device_train_batch_size=64,
    save_steps=10000,
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=10000,
    resume_from_checkpoint=checkpoint
)

Evaluation

The model was evaluated on a held-out dataset of approximately 4,000 smart contracts to monitor training stability and generalization during pretraining.

SmartBERT V2 is primarily intended as a representation learning model, providing high-quality embeddings for downstream smart contract analysis tasks.

How to Use

You can load SmartBERT V2 using the HuggingFace Transformers library.

import torch
from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")

code = "function totalSupply() external view returns (uint256);"

inputs = tokenizer(
    code,
    return_tensors="pt",
    truncation=True,
    max_length=512
)

with torch.no_grad():
    outputs = model(**inputs)

# Option 1: CLS embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Option 2: Mean pooling (recommended for code representation)
mean_embedding = outputs.last_hidden_state.mean(dim=1)

Mean pooling is often recommended when using the model for code representation or similarity tasks.

GitHub Repository

To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:

https://github.com/web3se-lab/SmartBERT

Citation

If you use SmartBERT in your research, please cite:

@article{huang2025smart,
  title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
  author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
  journal={arXiv preprint arXiv:2508.20086},
  year={2025}
}

Acknowledgement

Institute of Intelligent Computing Technology, Suzhou, CAS
Macau University of Science and Technology
CAS Mino (中科劢诺)