NucEL

Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

NucEL is the first ELECTRA-style genomic foundation model. A small generator proposes single-nucleotide substitutions to a masked DNA sequence, and a larger ModernBERT discriminator is trained to detect which positions were replaced — giving dense supervision across all tokens instead of the ~15 % covered by masked language modeling. At 93 M parameters, NucEL outperforms MLM-based models of similar size and matches or exceeds models up to 25× larger (e.g. NT-Multi-2.5B) on GUE, GB, and NT benchmarks.

Model summary


Architecture	ELECTRA RTD with ModernBERT backbone (discriminator only at inference)
Layers / hidden / heads	22 / 512 / 16
Intermediate size	2048
Attention	Hybrid: local window 128 + global every 3rd layer (FlashAttention-2)
Tokenization	Single-nucleotide (k = 1), vocabulary size 27 (4 nt + 7 special + 16 reserved)
Parameters	93 M
Pre-training data	Human genome GRCh38 / hg38 (1224 bp windows, 100 bp overlap, 1024 bp segments)
Objective	`L_total = L_generator + 50.0 · L_discriminator` (replaced-token detection)
Optimizer	AdamW (β₁=0.9, β₂=0.999), lr = 1e-4, cosine schedule, 1000-step warmup
Compute	8× NVIDIA A100, FP16, 50 epochs, global batch size 192
Max sequence length	Pre-trained at 1024 bp; ModernBERT config supports up to 8192
License	Apache-2.0

Quickstart

Install a transformers version that honors NucEL's custom tokenizer via auto_map:

pip install "transformers>=4.48,<5.4" torch

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
model     = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()

sequence = "ATCGATCGATCGATCG"
inputs   = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state          # (1, L, 512)
print(f"Sequence embeddings shape: {embeddings.shape}")

Using transformers 5.4 or newer

Starting with transformers 5.4.0 (released 2026-03-27), modernbert-based checkpoints were added to a MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS blocklist that causes AutoTokenizer.from_pretrained to ignore the auto_map field of every modernbert checkpoint, including this one. The model itself (AutoModel.from_pretrained) is unaffected; only tokenizer auto-discovery is blocked. To work on transformers 5.4+, load the tokenizer class directly from the Hub instead:

import sys
import torch
from huggingface_hub import snapshot_download
from transformers import AutoModel

local = snapshot_download(
    "FreakingPotato/NucEL",
    allow_patterns=[
        "tokenizer.py",
        "tokenizer_config.json",
        "vocab.json",
        "special_tokens_map.json",
    ],
)
sys.path.insert(0, local)
from tokenizer import NucEL_Tokenizer

tokenizer = NucEL_Tokenizer.from_pretrained(local)
model     = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()

sequence = "ATCGATCGATCGATCG"
inputs   = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

print(f"Sequence embeddings shape: {outputs.last_hidden_state.shape}")

Benchmark results

GUE (paper Table 1, MCC; F1 for CVC; bold = best, italic = second-best)

Model	TF-H	PD	CPD	SSP	TF-M	EMP	CVC	Avg
DNABERT-2 (117 M)	70.10	84.21	70.52	84.99	67.99	55.98	71.02	72.11
NT-2500M-multi	63.32	88.14	71.62	89.36	67.01	58.06	73.04	72.94
NucEL (93 M)	67.64	87.10	75.13	90.30	70.62	65.01	70.29	75.16

Genomic Benchmarks (paper Table 2, accuracy; 5-seed mean)

Task	HyenaDNA	Caduceus-Ph	NT2-100M	NucEL
Coding vs Intergenic	0.904	0.915	0.950	0.941
Human vs Worm	0.964	0.972	0.972	0.975
Human Enhancers Cohn	0.729	0.747	0.736	0.735
Human Enhancers Ensembl	0.849	0.893	0.935	0.940
Human Regulatory	0.869	0.872	0.935	0.941
Human OCR Ensembl	0.783	0.828	0.776	0.794
Human NonTATA Promoters	0.944	0.946	0.920	0.973
Average	0.863	0.882	0.890	0.899

Nucleotide Transformer benchmark (paper Table 3, 18 tasks, MCC; 10-seed mean)

NucEL averages 0.664 across the 18 NT tasks, matching NT2-Multi-500M and slightly exceeding NT-Multi-2.5B (0.661) while using 27× fewer parameters. See the paper for the full per-task breakdown.

Intended use and limitations

Intended. Embedding extraction and fine-tuning for human-genome downstream tasks (regulatory element classification, transcription-factor binding, splice-site / promoter prediction, histone-mark profiles). Also shown to transfer zero-shot to other species (mouse, yeast, viruses).
Limitations. Pre-trained on the human reference genome only; coverage of non-coding repeats, microbial sequences, and highly diverged clades may be limited. Ambiguous bases (N) are mapped to [UNK].

Citation

@article{ding2025nucel,
  title   = {NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training
             for Efficient and Interpretable Representations},
  author  = {Ding, Ke and Parker, Brian and Wen, Jiayu},
  journal = {bioRxiv},
  year    = {2025},
  doi     = {10.1101/2025.08.17.670700},
  url     = {https://www.biorxiv.org/content/10.1101/2025.08.17.670700v1}
}

License

Apache-2.0.

Paper for FreakingPotato/NucEL

NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Paper • 2508.13191 • Published Aug 15, 2025

FreakingPotato
/

NucEL