PyTorch
plantbimoe
biology
genomics
language model
plants
custom_code

Model Overview

PlantBiMoE is a DNA language model trained on 42 representative plant species genomes. More specifically, PlantBiMoE uses the BiMamba and SparseMoE architecture with a masked language modeling objective to leverage highly available genotype data from 42 different plant speices to learn general representations of nucleotide sequences. PlantBiMoE contains 64M active parameters and has a context window of 32678 tokens. PlantBiMoE uses a singlebase tokenizer to convert genomic nucletoide sequences to tokens.

How to use


import torch
from plantbimoe.modeling_plantbimoe import PlantbimoeForMaskedLM
from plantbimoe.configuration_plantbimoe import PlantbimoeConfig
from plantbimoe.tokenization_plantbimoe import PlantbimoeTokenizer

MAX_LENGTH = 32768
device = "cuda:0"

config = PlantbimoeConfig.from_pretrained("plant-llms/Plantbimoe")
model = PlantbimoeForMaskedLM.from_pretrained("plant-llms/Plantbimoe", config=config).to(device)
tokenizer = PlantbimoeTokenizer(model_max_length=MAX_LENGTH,  padding_side="right")


sequences = ["CCTGAACCCTAACGGCTATGA", "CGTACTACGA"]
tokenized_sequences = tokenizer(sequences, truncation=True)["input_ids"]
input_ids = torch.LongTensor(tokenized_sequences).to(device)

embd = model(input_ids=input_ids, output_hidden_states=True)["hidden_states"][0]

Pre-training

Data

Our pre-training dataset was built from plants reference genomes contained in the NCBI database. The dataset consists of approximately 25.40B nucleic acid pairs across 42 different species.

Processing

  • The reference genomes of all plants were cut into segments with a fixed length of 32,768 bp. An overlap strategy was introduced during the cut into segments to do implicit data enhancement, i.e., an overlap interval of 64 to 128 bp was retained between two neighboring segments, and the sliding step size was randomly sampled within this range.
  • Replace all non-standard bases in the fragments, other than A, T, C, G, and N, with N uniformly.
  • Filter out all sequences with more than 2% of base N to form the initial dataset.
  • Randomly select 30% of the sequences in the initial dataset for reverse complementation (RC) and add them to the initial dataset to form the final dataset.

Tokenization example

nucleotide sequence: CGTACTACGAN
tokens: <CLS> <C> <G> <T> <A> <C> <T> <A> <C> <G> <A> <N> <SEP>

Masked Lauguage Modeling

The pre-training strategy of PlantBiMoE utilizes Masked Language Modeling. During the pretraining process, 15% of the tokens in the input sequence are randomly masked, and the model’s task is to predict the original tokens at these masked positions. The masking mechanism follows the strategy of BERT, where 80% of the selected tokens are replaced with the <MASK>, 10% are replaced with random tokens, and the remaining 10% are left unchanged. This strategy enhances the model's capacity to capture global contextual dependencies and, based on this, improve its ability to model sequence structures, thereby enhancing its performance across various downstream tasks.

Pre-training Details

The pre-training of PlantBiMoE was distributed across a computing node with 8 Nvidia A800-80G GPUs, where the batch size for each GPU was set to 4. With 8-step gradient accumulation, the effective batch size became 256. The AdamW optimizer was used, with β1\beta_{1} set to 0.95, β2\beta_{2} to 0.9, and a weight decay of 0.1. The total number of training steps was equivalent to 10 epochs. During the initial 2% of the training steps, the learning rate increased linearly from 0 to 0.008, followed by a cosine decay to 0.004. Mixed precision training with bf16 was adopted to improve training efficiency and reduce memory overhead, resulting in a total pre-training time of approximately 166 hours.

BibTeX entry and citation info

If you use this model, please cite the following paper:

@article{lin2025plantbimoe,
  title={PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes},
  author={Lin, Kepeng and Zhang, Qizhe and Wang, Rui and Hu, Xuehai and Xu, Wei},
  journal={arXiv preprint arXiv:2512.07113},
  year={2025},
  url={https://arxiv.org/pdf/2512.07113}
}
Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support