Instructions to use mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch

SGLang

How to use mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch with Docker Model Runner:
```
docker model run hf.co/mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch
```

YatNMN-Softplus d=12 Chinchilla (261M) — PyTorch / HuggingFace Transformers

A 261M-parameter nanochat-architecture GPT with the YatNMN-Softplus MLP (per-neuron bias, softplus-positive bias, learnable epsilon, learnable α). Trained in JAX/Flax on TPU v6e-8 to Chinchilla-optimal token budget on C4, then ported to PyTorch for easy inference via the HuggingFace transformers API.

This is the best-performing 261M model in the ablation series:

MLP variant	Final smooth loss	vs GELU
YatNMN-Softplus (per-neuron)	2.98	−0.13
YatNMN-Softplus + scalar_bias	3.06	−0.05
GELU	3.11	baseline

Weights are bit-exact with the Flax checkpoint (mlnomad/yatnmn-softplus-d12-chinchilla-261M) — parity validated at max |Δ logits| = 1.5e-5 on CPU/fp32.

Quick start

pip install torch transformers safetensors

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch",
    trust_remote_code=True,
    dtype=torch.float32,
).eval()

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

prompt = "The meaning of life is"
ids = tokenizer(prompt, return_tensors="pt").input_ids
with torch.no_grad():
    out = model.generate(
        ids, max_new_tokens=50,
        do_sample=True, temperature=0.8, top_p=0.9,
        use_cache=True, pad_token_id=tokenizer.eos_token_id or 0,
    )
print(tokenizer.decode(out[0], skip_special_tokens=True))

Greedy completion samples:

"The meaning of life is" → the same as life. The meaning of life is the same as life…

"Once upon a time," → the world was a place where people could live and work. The world was a place where people could…

YatNMN-Softplus MLP

Each MLP block uses the YatNMN nonlinearity from nmn>=0.2.29:

y = α · (x · W + softplus(b))² / (||x − W||² + softplus(ε))

with per-neuron bias b of shape (4·n_embd,) = (3072,), scalar learnable epsilon of shape (1,), and scalar learnable α of shape (1,). Both bias and epsilon are passed through softplus to keep them strictly positive. The MLP is then c_proj (Linear → 768) on top of YatNMN's output.

Model details


Parameters	261,133,226
Architecture	Nanochat-style GPT with YatNMN-Softplus MLP (ported from JAX/Flax NNX)
Config	d=12, n_embd=768, n_head=12, n_kv_head=12, seq_len=1024, tied embeddings, SSSL sliding window
Training data	`allenai/c4` (English split), 5.22 B tokens (Chinchilla 20×)
Tokenizer	`mistralai/Mistral-7B-v0.1` (vocab 32,768)
Optimizer	plain AdamW, peak LR 0.03, warmup-cosine
Hardware	TPU v6e-8 (TRC), europe-west4-a
Final loss (smooth)	2.98

Architecture features

Full nanochat stack, faithfully ported to PyTorch:

YatNMN-Softplus MLP (per-neuron bias, softplus-positive, learnable α and ε)
RoPE (base 100,000), split-half layout
MHA (n_head = n_kv_head = 12; the code supports GQA via n_kv_head < n_head, but all d=12 models use full MHA)
QK-norm with 1.2× scaling (after RoPE)
Parameterless RMSNorm (no learnable gain) post-embedding and per block
Sliding-window attention with "SSSL" pattern
Tied embeddings (lm_head = wte.T)
Value embeddings on alternating layers (ResFormer-style)
Per-layer learnable residual scalars (resid_lambdas, x0_lambdas)
Smear — learnable gate on first 24 dims of token embedding mixes in prev token
Backout — subtract mid-layer residual from late layers
Logit soft-cap: 15 · tanh(logits / 15)
No biases in any Linear

KV cache

The YatGPTForCausalLM class implements a smear-aware KV cache for fast autoregressive generation. KV-cache parity vs full forward is validated at max |Δ| < 3e-5. Pass use_cache=True (the default for .generate()).

Files in this repo

.
├── config.json                       # HF config with auto_map → the classes below
├── generation_config.json
├── model.safetensors                 # ~1.04 GB, fp32 weights + persistent RoPE buffers
├── yatnmn_gpt.py                     # pure PyTorch Yat_GPT module + YatNMN layer
├── torch_gpt.py                      # shared building blocks (RMSNorm, RoPE, attention)
├── configuration_yatnmn_gpt.py       # PretrainedConfig subclass
├── modeling_yatnmn_gpt.py            # PreTrainedModel + GenerationMixin wrapper with KV cache
└── README.md

mlnomad/yatnmn-softplus-d12-chinchilla-261M — original JAX/Flax Orbax checkpoint (model + AdamW optimizer state, resumable)
mlnomad/gelu-d12-chinchilla-261M-pytorch — GELU baseline at identical compute, smooth loss 3.11
flaxchat — JAX/Flax training harness
nmn — the YatNMN layer (used at training time; not required for inference here, the nonlinearity is reimplemented in pure PyTorch)

Wikitext-103 evaluation

Metric	Value
Wikitext-103 test loss	3.693
Wikitext-103 test PPL	40.15

Evaluated on ~330K tokens from wikitext-103 test set (model trained on C4 only — this is a zero-shot transfer metric).

License

Apache 2.0.

Downloads last month: 3,742

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for mlnomad/yatnmn-softplus-d12-chinchilla-261M-pytorch

Base model

mlnomad/yatnmn-softplus-d12-chinchilla-261M

Finetuned

(2)

this model

mlnomad
/

yatnmn-softplus-d12-chinchilla-261M-pytorch