GenerRNA: A Generative Language Model for de novo RNA Design

Paper (PLOS ONE) Preprint (bioRxiv) License: MIT Model on Hugging Face

GenerRNA is a generative pre-trained language model for de novo RNA sequence design. It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences without any structural input, functional label, or sequence alignment. To our knowledge, GenerRNA is the first application of a generative language model to RNA generation.

With GenerRNA you can:

  • Generate RNA in a zero-shot manner to explore the RNA sequence space, or
  • Fine-tune on your own dataset to generate RNAs belonging to a particular family or possessing specific characteristics (e.g., high binding affinity to a target protein).

Developed by Preferred Networks, Inc. and The University of Tokyo. Introduced in PLOS ONE (2024): GenerRNA: A generative pre-trained language model for de novo RNA design.


Table of Contents


Model Summary

GenerRNA is a Transformer decoder-only (GPT-style) language model trained on RNA nucleotide sequences. By treating RNA as a sequence of tokens, it learns statistical and structural regularities of RNA directly from data and can then sample entirely new sequences. GenerRNA was pre-trained on ~16 million RNA sequences (16.09M), encompassing ~17.4 billion nucleotides. Generated RNAs are novel (distinct from training sequences) yet fold into stable secondary structures, and the model can be fine-tuned to design functional RNAs such as protein binders β€” all without requiring prior structural knowledge.

Key Features

  • 🧬 De novo RNA generation β€” create novel RNA sequences from scratch; no structure, label, or alignment required.
  • 🎯 Zero-shot or fine-tuned β€” explore RNA space out of the box, or specialize the model for a target family or function.
  • πŸ”¬ Structurally plausible outputs β€” generated sequences fold into stable secondary structures (low minimum free energy).
  • 🧩 Transformer / GPT architecture β€” a familiar, scalable decoder-only design (~350M parameters).
  • ⚑ Two checkpoints provided β€” an updated long-context model and the original historical model.
  • πŸ“– Open & reproducible β€” MIT-licensed code, tokenizer, checkpoints, and the data behind the paper's figures.

Model Details

Model type Generative language model (decoder-only Transformer, GPT-style)
Domain RNA / nucleotide sequences
Parameters 350M (24 transformer layers, model dimension 1280)
Context window 1024 tokens (~4000 nucleotides)
Tokenizer Byte-Pair Encoding (BPE), vocabulary size 1024
Checkpoints model_updated.pt (recommended; longer context, deduplicated data) Β· original split model in experiment_data/historical_version/
Framework PyTorch (β‰₯ 2.0)
License MIT
Paper PLOS ONE 19(10):e0310814 (2024) Β· doi:10.1371/journal.pone.0310814
Developed by Preferred Networks, Inc. & The University of Tokyo

Intended Use & Use Cases

GenerRNA is intended for research in RNA biology, synthetic biology, and RNA-based therapeutics / drug discovery. Typical use cases include:

  • Exploring the diversity of the RNA sequence space.
  • Generating candidate RNAs from a target family by fine-tuning on family-specific data.
  • Designing RNAs with desired functional properties, such as aptamers/binders with high affinity to a target protein (demonstrated for the RNA-binding proteins ELAVL1 and SRSF1 in the paper).
  • Serving as a pre-trained backbone for downstream RNA modeling and design tasks.

Requirements

A CUDA environment with a minimum of 8 GB VRAM is required.

torch>=2.0
numpy
transformers==4.33.0.dev0
datasets==2.14.4
tqdm

Quickstart

Clone the repository (it ships with the recommended checkpoint model_updated.pt and its tokenizer/):

git clone https://huggingface.co/pfnet/GenerRNA
cd GenerRNA

De novo generation (zero-shot)

python sampling.py \
    --out_path {output_file_path} \
    --max_new_tokens 256 \
    --ckpt_path model_updated.pt \
    --tokenizer_path tokenizer

Want to use the original (historical) model instead? It is stored as split files. Recombine it and use its dedicated tokenizer:

cat experiment_data/historical_version/model.pt.part-* > model.pt
python sampling.py \
    --out_path {output_file_path} \
    --max_new_tokens 256 \
    --ckpt_path model.pt \
    --tokenizer_path experiment_data/historical_version/tokenizer_bpe_1024

Training & Fine-tuning

1. Tokenize your sequences (one sequence per line, no header):

python tokenization.py \
    --data_dir {path_to_directory_containing_sequence_data} \
    --file_name {file_name_of_sequence_data} \
    --tokenizer_path tokenizer \
    --out_dir {directory_to_save_tokenized_data} \
    --block_size 256

2. Create a config based on configs/example_pretraining.py (training from scratch) or configs/example_finetuning.py (fine-tuning).

3. Train / fine-tune:

python train.py --config {path_to_your_config_file}

Train your own tokenizer (optional)

python train_BPE.py \
    --txt_file_path {path_to_training_file_one_sequence_per_line} \
    --vocab_size 50256 \
    --new_tokenizer_path {directory_to_save_trained_tokenizer}

Repository Structure

.
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ CITATION.cff               # machine-readable citation metadata
β”œβ”€β”€ model.py                   # model architecture (decoder-only Transformer)
β”œβ”€β”€ sampling.py                # generate sequences from a trained model
β”œβ”€β”€ tokenization.py            # tokenize sequence data for training
β”œβ”€β”€ train.py                   # pre-training / fine-tuning entry point
β”œβ”€β”€ train_BPE.py               # train a new BPE tokenizer
β”œβ”€β”€ model_updated.pt           # recommended checkpoint (longer context, deduplicated data)
β”œβ”€β”€ tokenizer/                 # BPE tokenizer for model_updated.pt
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ example_pretraining.py
β”‚   └── example_finetuning.py
└── experiment_data/
    β”œβ”€β”€ *.csv                  # data underlying the paper's figures
    β”œβ”€β”€ pretraining_data.sh    # how the pre-training corpus was built (RNAcentral + MMseqs2)
    └── historical_version/    # original model (split into parts) + its tokenizer
        β”œβ”€β”€ model.pt.part-a{a,b,c,d}
        └── tokenizer_bpe_1024/

Training Data

GenerRNA was pre-trained on RNA sequences from RNAcentral (release 22, which aggregates 51 expert databases). Starting from 34.39 million raw sequences, deduplication with MMseqs2 at 80% sequence identity yielded a pre-training corpus of ~16 million sequences (16.09M), encompassing ~17.4 billion nucleotides. GenerRNA has a context window of 1024 tokens (~4000 nucleotides). The pre-processing pipeline is in experiment_data/pretraining_data.sh, and the data underlying the paper's figures is provided in experiment_data/. See the paper for full dataset details.

Limitations

  • GenerRNA models RNA sequence; it does not explicitly predict tertiary structure or function. Validate candidates with downstream structure/function tools and wet-lab experiments.
  • A CUDA GPU is required for generation and training as provided.
  • Zero-shot outputs reflect the natural distribution of the training data; targeting a specific family or property generally requires fine-tuning.
  • Generated sequences are computational hypotheses and should be experimentally validated before any real-world application.

FAQ

What is GenerRNA? GenerRNA is a generative, pre-trained language model (a decoder-only Transformer) that designs novel RNA sequences de novo, without requiring structural information, functional labels, or sequence alignments.

How is GenerRNA different from other RNA models? Most RNA models are discriminative β€” they predict structure or properties from a given sequence. GenerRNA is generative: it samples entirely new sequences. To our knowledge, it is the first application of a generative language model to RNA generation.

Do I need RNA structure or alignments as input? No. GenerRNA generates sequences directly from its learned distribution; no structure or alignment is needed.

Can I generate RNAs from a specific family or with a specific function? Yes. Fine-tune GenerRNA on a family- or function-specific dataset. The paper demonstrates designing RNAs with high binding affinity to the proteins ELAVL1 and SRSF1.

Which checkpoint should I use? Use model_updated.pt (longer context, trained on deduplicated data). The original split model is kept in experiment_data/historical_version/ for reproducibility.

Is GenerRNA free to use? Yes. The code and weights are released under the MIT License. Please cite the paper if you use GenerRNA in your work.

How do I cite GenerRNA? See Citation below.

Citation

If you use GenerRNA, its checkpoints, or this repository in your research, please cite:

@article{zhao2024generrna,
  title     = {GenerRNA: A generative pre-trained language model for de novo RNA design},
  author    = {Zhao, Yichong and Oono, Kenta and Takizawa, Hiroki and Kotera, Masaaki},
  journal   = {PLOS ONE},
  volume    = {19},
  number    = {10},
  pages     = {e0310814},
  year      = {2024},
  doi       = {10.1371/journal.pone.0310814},
  publisher = {Public Library of Science}
}

Plain text: Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for de novo RNA design. PLOS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814

License

The source code is licensed under the MIT License β€” see LICENSE. Β© 2024 Yichong Zhao, Masaaki Kotera, Kenta Oono, Hiroki Takizawa.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support