GenerRNA: A Generative Language Model for de novo RNA Design
GenerRNA is a generative pre-trained language model for de novo RNA sequence design. It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences without any structural input, functional label, or sequence alignment. To our knowledge, GenerRNA is the first application of a generative language model to RNA generation.
With GenerRNA you can:
- Generate RNA in a zero-shot manner to explore the RNA sequence space, or
- Fine-tune on your own dataset to generate RNAs belonging to a particular family or possessing specific characteristics (e.g., high binding affinity to a target protein).
Developed by Preferred Networks, Inc. and The University of Tokyo. Introduced in PLOS ONE (2024): GenerRNA: A generative pre-trained language model for de novo RNA design.
Table of Contents
- Model Summary
- Key Features
- Model Details
- Intended Use & Use Cases
- Requirements
- Quickstart
- Training & Fine-tuning
- Repository Structure
- Training Data
- Limitations
- FAQ
- Citation
- License
Model Summary
GenerRNA is a Transformer decoder-only (GPT-style) language model trained on RNA nucleotide sequences. By treating RNA as a sequence of tokens, it learns statistical and structural regularities of RNA directly from data and can then sample entirely new sequences. GenerRNA was pre-trained on ~16 million RNA sequences (16.09M), encompassing ~17.4 billion nucleotides. Generated RNAs are novel (distinct from training sequences) yet fold into stable secondary structures, and the model can be fine-tuned to design functional RNAs such as protein binders β all without requiring prior structural knowledge.
Key Features
- 𧬠De novo RNA generation β create novel RNA sequences from scratch; no structure, label, or alignment required.
- π― Zero-shot or fine-tuned β explore RNA space out of the box, or specialize the model for a target family or function.
- π¬ Structurally plausible outputs β generated sequences fold into stable secondary structures (low minimum free energy).
- π§© Transformer / GPT architecture β a familiar, scalable decoder-only design (~350M parameters).
- β‘ Two checkpoints provided β an updated long-context model and the original historical model.
- π Open & reproducible β MIT-licensed code, tokenizer, checkpoints, and the data behind the paper's figures.
Model Details
| Model type | Generative language model (decoder-only Transformer, GPT-style) |
| Domain | RNA / nucleotide sequences |
| Parameters | 350M (24 transformer layers, model dimension 1280) |
| Context window | 1024 tokens (~4000 nucleotides) |
| Tokenizer | Byte-Pair Encoding (BPE), vocabulary size 1024 |
| Checkpoints | model_updated.pt (recommended; longer context, deduplicated data) Β· original split model in experiment_data/historical_version/ |
| Framework | PyTorch (β₯ 2.0) |
| License | MIT |
| Paper | PLOS ONE 19(10):e0310814 (2024) Β· doi:10.1371/journal.pone.0310814 |
| Developed by | Preferred Networks, Inc. & The University of Tokyo |
Intended Use & Use Cases
GenerRNA is intended for research in RNA biology, synthetic biology, and RNA-based therapeutics / drug discovery. Typical use cases include:
- Exploring the diversity of the RNA sequence space.
- Generating candidate RNAs from a target family by fine-tuning on family-specific data.
- Designing RNAs with desired functional properties, such as aptamers/binders with high affinity to a target protein (demonstrated for the RNA-binding proteins ELAVL1 and SRSF1 in the paper).
- Serving as a pre-trained backbone for downstream RNA modeling and design tasks.
Requirements
A CUDA environment with a minimum of 8 GB VRAM is required.
torch>=2.0
numpy
transformers==4.33.0.dev0
datasets==2.14.4
tqdm
Quickstart
Clone the repository (it ships with the recommended checkpoint model_updated.pt and its tokenizer/):
git clone https://huggingface.co/pfnet/GenerRNA
cd GenerRNA
De novo generation (zero-shot)
python sampling.py \
--out_path {output_file_path} \
--max_new_tokens 256 \
--ckpt_path model_updated.pt \
--tokenizer_path tokenizer
Want to use the original (historical) model instead? It is stored as split files. Recombine it and use its dedicated tokenizer:
cat experiment_data/historical_version/model.pt.part-* > model.pt python sampling.py \ --out_path {output_file_path} \ --max_new_tokens 256 \ --ckpt_path model.pt \ --tokenizer_path experiment_data/historical_version/tokenizer_bpe_1024
Training & Fine-tuning
1. Tokenize your sequences (one sequence per line, no header):
python tokenization.py \
--data_dir {path_to_directory_containing_sequence_data} \
--file_name {file_name_of_sequence_data} \
--tokenizer_path tokenizer \
--out_dir {directory_to_save_tokenized_data} \
--block_size 256
2. Create a config based on configs/example_pretraining.py (training from scratch) or configs/example_finetuning.py (fine-tuning).
3. Train / fine-tune:
python train.py --config {path_to_your_config_file}
Train your own tokenizer (optional)
python train_BPE.py \
--txt_file_path {path_to_training_file_one_sequence_per_line} \
--vocab_size 50256 \
--new_tokenizer_path {directory_to_save_trained_tokenizer}
Repository Structure
.
βββ LICENSE
βββ README.md
βββ CITATION.cff # machine-readable citation metadata
βββ model.py # model architecture (decoder-only Transformer)
βββ sampling.py # generate sequences from a trained model
βββ tokenization.py # tokenize sequence data for training
βββ train.py # pre-training / fine-tuning entry point
βββ train_BPE.py # train a new BPE tokenizer
βββ model_updated.pt # recommended checkpoint (longer context, deduplicated data)
βββ tokenizer/ # BPE tokenizer for model_updated.pt
βββ configs/
β βββ example_pretraining.py
β βββ example_finetuning.py
βββ experiment_data/
βββ *.csv # data underlying the paper's figures
βββ pretraining_data.sh # how the pre-training corpus was built (RNAcentral + MMseqs2)
βββ historical_version/ # original model (split into parts) + its tokenizer
βββ model.pt.part-a{a,b,c,d}
βββ tokenizer_bpe_1024/
Training Data
GenerRNA was pre-trained on RNA sequences from RNAcentral (release 22, which aggregates 51 expert databases). Starting from 34.39 million raw sequences, deduplication with MMseqs2 at 80% sequence identity yielded a pre-training corpus of ~16 million sequences (16.09M), encompassing ~17.4 billion nucleotides. GenerRNA has a context window of 1024 tokens (~4000 nucleotides). The pre-processing pipeline is in experiment_data/pretraining_data.sh, and the data underlying the paper's figures is provided in experiment_data/. See the paper for full dataset details.
Limitations
- GenerRNA models RNA sequence; it does not explicitly predict tertiary structure or function. Validate candidates with downstream structure/function tools and wet-lab experiments.
- A CUDA GPU is required for generation and training as provided.
- Zero-shot outputs reflect the natural distribution of the training data; targeting a specific family or property generally requires fine-tuning.
- Generated sequences are computational hypotheses and should be experimentally validated before any real-world application.
FAQ
What is GenerRNA? GenerRNA is a generative, pre-trained language model (a decoder-only Transformer) that designs novel RNA sequences de novo, without requiring structural information, functional labels, or sequence alignments.
How is GenerRNA different from other RNA models? Most RNA models are discriminative β they predict structure or properties from a given sequence. GenerRNA is generative: it samples entirely new sequences. To our knowledge, it is the first application of a generative language model to RNA generation.
Do I need RNA structure or alignments as input? No. GenerRNA generates sequences directly from its learned distribution; no structure or alignment is needed.
Can I generate RNAs from a specific family or with a specific function? Yes. Fine-tune GenerRNA on a family- or function-specific dataset. The paper demonstrates designing RNAs with high binding affinity to the proteins ELAVL1 and SRSF1.
Which checkpoint should I use?
Use model_updated.pt (longer context, trained on deduplicated data). The original split model is kept in experiment_data/historical_version/ for reproducibility.
Is GenerRNA free to use? Yes. The code and weights are released under the MIT License. Please cite the paper if you use GenerRNA in your work.
How do I cite GenerRNA? See Citation below.
Citation
If you use GenerRNA, its checkpoints, or this repository in your research, please cite:
@article{zhao2024generrna,
title = {GenerRNA: A generative pre-trained language model for de novo RNA design},
author = {Zhao, Yichong and Oono, Kenta and Takizawa, Hiroki and Kotera, Masaaki},
journal = {PLOS ONE},
volume = {19},
number = {10},
pages = {e0310814},
year = {2024},
doi = {10.1371/journal.pone.0310814},
publisher = {Public Library of Science}
}
Plain text: Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for de novo RNA design. PLOS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814
- π Paper (PLOS ONE): https://doi.org/10.1371/journal.pone.0310814
- π Preprint (bioRxiv): https://doi.org/10.1101/2024.02.01.578496
- π€ Model: https://huggingface.co/pfnet/GenerRNA
- π» Code (GitHub): https://github.com/ekkkkki/GenerRNA
- π Project page: https://ekkkkki.github.io/GenerRNA/
License
The source code is licensed under the MIT License β see LICENSE. Β© 2024 Yichong Zhao, Masaaki Kotera, Kenta Oono, Hiroki Takizawa.