NucEL

Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Paper bioRxiv GitHub

NucEL is the first ELECTRA-style genomic foundation model. A small generator proposes single-nucleotide substitutions to a masked DNA sequence, and a larger ModernBERT discriminator is trained to detect which positions were replaced — giving dense supervision across all tokens instead of the ~15 % covered by masked language modeling. At 93 M parameters, NucEL outperforms MLM-based models of similar size and matches or exceeds models up to 25× larger (e.g. NT-Multi-2.5B) on GUE, GB, and NT benchmarks.

Model summary

Architecture ELECTRA RTD with ModernBERT backbone (discriminator only at inference)
Layers / hidden / heads 22 / 512 / 16
Intermediate size 2048
Attention Hybrid: local window 128 + global every 3rd layer (FlashAttention-2)
Tokenization Single-nucleotide (k = 1), vocabulary size 27 (4 nt + 7 special + 16 reserved)
Parameters 93 M
Pre-training data Human genome GRCh38 / hg38 (1224 bp windows, 100 bp overlap, 1024 bp segments)
Objective L_total = L_generator + 50.0 · L_discriminator (replaced-token detection)
Optimizer AdamW (β₁=0.9, β₂=0.999), lr = 1e-4, cosine schedule, 1000-step warmup
Compute 8× NVIDIA A100, FP16, 50 epochs, global batch size 192
Max sequence length Pre-trained at 1024 bp; ModernBERT config supports up to 8192
License Apache-2.0

Quickstart

Install a transformers version that honors NucEL's custom tokenizer via auto_map:

pip install "transformers>=4.48,<5.4" torch
import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
model     = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()

sequence = "ATCGATCGATCGATCG"
inputs   = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state          # (1, L, 512)
print(f"Sequence embeddings shape: {embeddings.shape}")

Using transformers 5.4 or newer

Starting with transformers 5.4.0 (released 2026-03-27), modernbert-based checkpoints were added to a MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS blocklist that causes AutoTokenizer.from_pretrained to ignore the auto_map field of every modernbert checkpoint, including this one. The model itself (AutoModel.from_pretrained) is unaffected; only tokenizer auto-discovery is blocked. To work on transformers 5.4+, load the tokenizer class directly from the Hub instead:

import sys
import torch
from huggingface_hub import snapshot_download
from transformers import AutoModel

local = snapshot_download(
    "FreakingPotato/NucEL",
    allow_patterns=[
        "tokenizer.py",
        "tokenizer_config.json",
        "vocab.json",
        "special_tokens_map.json",
    ],
)
sys.path.insert(0, local)
from tokenizer import NucEL_Tokenizer

tokenizer = NucEL_Tokenizer.from_pretrained(local)
model     = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()

sequence = "ATCGATCGATCGATCG"
inputs   = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

print(f"Sequence embeddings shape: {outputs.last_hidden_state.shape}")

Benchmark results

GUE (paper Table 1, MCC; F1 for CVC; bold = best, italic = second-best)

Model TF-H PD CPD SSP TF-M EMP CVC Avg
DNABERT-2 (117 M) 70.10 84.21 70.52 84.99 67.99 55.98 71.02 72.11
NT-2500M-multi 63.32 88.14 71.62 89.36 67.01 58.06 73.04 72.94
NucEL (93 M) 67.64 87.10 75.13 90.30 70.62 65.01 70.29 75.16

Genomic Benchmarks (paper Table 2, accuracy; 5-seed mean)

Task HyenaDNA Caduceus-Ph NT2-100M NucEL
Coding vs Intergenic 0.904 0.915 0.950 0.941
Human vs Worm 0.964 0.972 0.972 0.975
Human Enhancers Cohn 0.729 0.747 0.736 0.735
Human Enhancers Ensembl 0.849 0.893 0.935 0.940
Human Regulatory 0.869 0.872 0.935 0.941
Human OCR Ensembl 0.783 0.828 0.776 0.794
Human NonTATA Promoters 0.944 0.946 0.920 0.973
Average 0.863 0.882 0.890 0.899

Nucleotide Transformer benchmark (paper Table 3, 18 tasks, MCC; 10-seed mean)

NucEL averages 0.664 across the 18 NT tasks, matching NT2-Multi-500M and slightly exceeding NT-Multi-2.5B (0.661) while using 27× fewer parameters. See the paper for the full per-task breakdown.

Intended use and limitations

  • Intended. Embedding extraction and fine-tuning for human-genome downstream tasks (regulatory element classification, transcription-factor binding, splice-site / promoter prediction, histone-mark profiles). Also shown to transfer zero-shot to other species (mouse, yeast, viruses).
  • Limitations. Pre-trained on the human reference genome only; coverage of non-coding repeats, microbial sequences, and highly diverged clades may be limited. Ambiguous bases (N) are mapped to [UNK].

Citation

@article{ding2025nucel,
  title   = {NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training
             for Efficient and Interpretable Representations},
  author  = {Ding, Ke and Parker, Brian and Wen, Jiayu},
  journal = {bioRxiv},
  year    = {2025},
  doi     = {10.1101/2025.08.17.670700},
  url     = {https://www.biorxiv.org/content/10.1101/2025.08.17.670700v1}
}

License

Apache-2.0.

Links

Downloads last month
151
Safetensors
Model size
92.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for FreakingPotato/NucEL