Instructions to use FreakingPotato/NucEL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FreakingPotato/NucEL with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="FreakingPotato/NucEL")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL") model = AutoModel.from_pretrained("FreakingPotato/NucEL") - Notebooks
- Google Colab
- Kaggle
NucEL
Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations
NucEL is the first ELECTRA-style genomic foundation model. A small generator proposes single-nucleotide substitutions to a masked DNA sequence, and a larger ModernBERT discriminator is trained to detect which positions were replaced — giving dense supervision across all tokens instead of the ~15 % covered by masked language modeling. At 93 M parameters, NucEL outperforms MLM-based models of similar size and matches or exceeds models up to 25× larger (e.g. NT-Multi-2.5B) on GUE, GB, and NT benchmarks.
Model summary
| Architecture | ELECTRA RTD with ModernBERT backbone (discriminator only at inference) |
| Layers / hidden / heads | 22 / 512 / 16 |
| Intermediate size | 2048 |
| Attention | Hybrid: local window 128 + global every 3rd layer (FlashAttention-2) |
| Tokenization | Single-nucleotide (k = 1), vocabulary size 27 (4 nt + 7 special + 16 reserved) |
| Parameters | 93 M |
| Pre-training data | Human genome GRCh38 / hg38 (1224 bp windows, 100 bp overlap, 1024 bp segments) |
| Objective | L_total = L_generator + 50.0 · L_discriminator (replaced-token detection) |
| Optimizer | AdamW (β₁=0.9, β₂=0.999), lr = 1e-4, cosine schedule, 1000-step warmup |
| Compute | 8× NVIDIA A100, FP16, 50 epochs, global batch size 192 |
| Max sequence length | Pre-trained at 1024 bp; ModernBERT config supports up to 8192 |
| License | Apache-2.0 |
Quickstart
Install a transformers version that honors NucEL's custom tokenizer via
auto_map:
pip install "transformers>=4.48,<5.4" torch
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
model = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()
sequence = "ATCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # (1, L, 512)
print(f"Sequence embeddings shape: {embeddings.shape}")
Using transformers 5.4 or newer
Starting with transformers 5.4.0 (released 2026-03-27), modernbert-based
checkpoints were added to a MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS
blocklist that causes AutoTokenizer.from_pretrained to ignore the
auto_map field of every modernbert checkpoint, including this one. The
model itself (AutoModel.from_pretrained) is unaffected; only tokenizer
auto-discovery is blocked. To work on transformers 5.4+, load the
tokenizer class directly from the Hub instead:
import sys
import torch
from huggingface_hub import snapshot_download
from transformers import AutoModel
local = snapshot_download(
"FreakingPotato/NucEL",
allow_patterns=[
"tokenizer.py",
"tokenizer_config.json",
"vocab.json",
"special_tokens_map.json",
],
)
sys.path.insert(0, local)
from tokenizer import NucEL_Tokenizer
tokenizer = NucEL_Tokenizer.from_pretrained(local)
model = AutoModel.from_pretrained("FreakingPotato/NucEL").eval()
sequence = "ATCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
print(f"Sequence embeddings shape: {outputs.last_hidden_state.shape}")
Benchmark results
GUE (paper Table 1, MCC; F1 for CVC; bold = best, italic = second-best)
| Model | TF-H | PD | CPD | SSP | TF-M | EMP | CVC | Avg |
|---|---|---|---|---|---|---|---|---|
| DNABERT-2 (117 M) | 70.10 | 84.21 | 70.52 | 84.99 | 67.99 | 55.98 | 71.02 | 72.11 |
| NT-2500M-multi | 63.32 | 88.14 | 71.62 | 89.36 | 67.01 | 58.06 | 73.04 | 72.94 |
| NucEL (93 M) | 67.64 | 87.10 | 75.13 | 90.30 | 70.62 | 65.01 | 70.29 | 75.16 |
Genomic Benchmarks (paper Table 2, accuracy; 5-seed mean)
| Task | HyenaDNA | Caduceus-Ph | NT2-100M | NucEL |
|---|---|---|---|---|
| Coding vs Intergenic | 0.904 | 0.915 | 0.950 | 0.941 |
| Human vs Worm | 0.964 | 0.972 | 0.972 | 0.975 |
| Human Enhancers Cohn | 0.729 | 0.747 | 0.736 | 0.735 |
| Human Enhancers Ensembl | 0.849 | 0.893 | 0.935 | 0.940 |
| Human Regulatory | 0.869 | 0.872 | 0.935 | 0.941 |
| Human OCR Ensembl | 0.783 | 0.828 | 0.776 | 0.794 |
| Human NonTATA Promoters | 0.944 | 0.946 | 0.920 | 0.973 |
| Average | 0.863 | 0.882 | 0.890 | 0.899 |
Nucleotide Transformer benchmark (paper Table 3, 18 tasks, MCC; 10-seed mean)
NucEL averages 0.664 across the 18 NT tasks, matching NT2-Multi-500M and slightly exceeding NT-Multi-2.5B (0.661) while using 27× fewer parameters. See the paper for the full per-task breakdown.
Intended use and limitations
- Intended. Embedding extraction and fine-tuning for human-genome downstream tasks (regulatory element classification, transcription-factor binding, splice-site / promoter prediction, histone-mark profiles). Also shown to transfer zero-shot to other species (mouse, yeast, viruses).
- Limitations. Pre-trained on the human reference genome only; coverage
of non-coding repeats, microbial sequences, and highly diverged clades may
be limited. Ambiguous bases (
N) are mapped to[UNK].
Citation
@article{ding2025nucel,
title = {NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training
for Efficient and Interpretable Representations},
author = {Ding, Ke and Parker, Brian and Wen, Jiayu},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.08.17.670700},
url = {https://www.biorxiv.org/content/10.1101/2025.08.17.670700v1}
}
License
Apache-2.0.
Links
- Code (pretraining + fine-tuning + visualization): https://github.com/FreakingPotato/NucEL
- Paper (arXiv): https://arxiv.org/abs/2508.13191
- Paper (bioRxiv): https://www.biorxiv.org/content/10.1101/2025.08.17.670700v1
- Downloads last month
- 151