Haha no shit. I just finished writing an article on scaling ssm mamba style models and popped over to see what's new in posts. I guess there's a theme today.
R PRO
AI & ML interests
Recent Activity
Organizations
Not all is lost however. The outcome was a very in depth neural network atlas complete with its own SQLite queryable database for the Qwen3-8B model I can now share with you all. The data base combines these methods for a full in depth dive:
- Neuron Taxonomy
- Category Separation Scoring
- Co-activation Analysis
- Per-Head Decomposition
- Component Comparison
- Attribution Patching
- Sparse Non-negative Matrix Factorization
- NeuronLens
- DAS SVD rotation
- Cross-layer Coherence
- SQLite database
So if you've ever wondered where a specific behaviour or ability lives in the hidden dimensions of Qwen-8B or perhaps wanted to make informed quantization decisions please enjoy the fruits of my ill-informed labour lol. 😂
juiceb0xc0de/qwen3-8b-atlas
Qwen/Qwen3-8B
I applaud you in your journey into the void with small models. I too am deeply fascinated with the optimization of smaller models rather than asking for more parameters and terabytes of scraped internet data. I hope to see what you've come up with in a few weeks time.
I just finished designing a sparsity training scheduler that trains on average 35% of a models available weights with almost no hidden dimensions between transformers adjoined and zero throughput while randomizing trainable locations. It cuts VRAM and training time down and the models set higher benchmarks on mathematics than FFT models trained on the same corpus. I discovered this while fucking around for fun.
I don't doubt the discoveries to be made with training smaller architectures have many more surprises in store for us.
@danielhanchen what happened to this magnificent model!? I had the perfect place to slot it in to my team of AI bros! I would love to see this back on HF. 🤗
Blog post - https://huggingface.co/blog/kalyan-ks/llm-guardrail-models-less-robust
Evaluated the robustness of three LLM guardrail models (GLiGuard, LlamaGuard3 and MiniGuard).
Evaluation is done using 16 text mutation attacks over three datasets (AEGIS 2.0, WildGuard and ExpGuard).
Achieved average Unsafe ASR score of up to 33% and average Safe ASR score of up to 25% against GLiGuard model.
Achieved average Unsafe ASR score of up to 35% and average Safe ASR score of up to 17% against LlamaGuard3-8B model.
Achieved average Unsafe ASR score of up to 45% and average Safe ASR score of up to 15% against MiniGuard v0.1 model.
First Hindi instruction-tuned fine-tune of OpenBMB's brand-new MiniCPM5-1B (released this week).
Trained with Unsloth + LoRA (r=32) on AI4Bharat's anudesh + dolly Hindi splits — ~4k high-quality examples, 2 epochs on a single T4 in 60 minutes.
🔗 Model (16-bit + LoRA adapter):
pankajpandey-dev/MiniCPM5-1B-Hindi-Instruct
📦 GGUF quants for llama.cpp / Ollama / LM Studio:
pankajpandey-dev/MiniCPM5-1B-Hindi-Instruct-v1-GGUF
5 quant levels — from Q3_K_M (~560 MB, runs on a Raspberry Pi) to Q8_0 (~1.2 GB, near-lossless). Q4_K_M is the recommended default.
Part of my ongoing 🇮🇳 Hindi LLM Series — bringing strong open-source LLMs to Indian languages.
#Hindi #IndicNLP #MiniCPM5 #LoRA #Unsloth #GGUF #llamacpp #Ollama #LocalLLM
It is tiny and not extremely trained making it prone to hallucination, please double check all information. I can't afford to train it more or increase model size, so if anyone somehow has access to compute and want's to contribute, let me know.
Would you be looking for something like this?
https://huggingface.co/spaces/strangertoolshf/huggingface-user-stats
The transformer operates with the "single" setting.
AbstractPhil/geolip-svae-transformer
I've implanted a rigid formula that allows this direct behavior from the H2 battery to superimpose onto adjacent structural boundaries, and with that built aleph and void into the system as well. These are guarantees.
As for the centrifuge concept. The optimization on the centrifuge was quite lackluster. The hardware doesn't support such behavior. You can access the current operating version of the centrifuge by utilizing "stacked" configuration. Four lenses was too much when running a quaternion bank to handle such complex interactions reasonably, so I will need to work something out in the future to get a full centrifuge system working.
Crusher is ready, transformer_v3.
You might be curious WHY these converge at such low raw MSE in the later stages. The reasoning is kind of difficult to explain, so I'll try to make it simple. The direction is very subtle in the later stages of training with AdamW, so the curves start to create much more accurate shifts towards the goals. This allows the model to rapidly converge after earlier heavier training. You can't simply train it low, it takes too long. This allows the model to KIND OF get everything NEAR where it's supposed to be, which allows the really small twitches of MSE to provide massive corrections without needing hard logits or more difficult to finetune features.
It is small, easy to try, and meant for exploring diffusion-style decoding and latency tradeoffs in compact LMs.
Model: codelion/dhara-250m
Try the chat demo here: codelion/dhara-chat
Feels like conversations are increasingly converging around the same operational requirements: identity, interoperability, governance, trust boundaries, orchestration, and coordination between agents, tools, and services.
As agents become more operational, these infrastructure layers seem increasingly important for making larger multi-agent ecosystems reliable outside controlled environments.
https://www.linuxfoundation.org/press/agentic-ai-foundation-adds-43-new-members-as-enterprise-and-government-adoption-of-open-agent-standards-accelerates
Think skill tree meets training recipe builder. Helps you discover that you don't always need AdamW + Cosine. There's a whole ecosystem of combinations most people never try.
If you're new to ML or just stuck in a routine build yourself a new training suite. It could be a great decision! Or a waste on GPaaS I really can't say. I'm just a bartender, don't believe what I say most of the time.
juiceb0xc0de/forge
That is exactly what I'm planning on doing!
Update: I've completed the first 9 layers and will be taking a step back for a quick mo to adjust and update the auto trainer for finer resolution and other shit I have swimming around in my brain.
Yo every month this resets? Thank you for my new guilty pleasure. This playground feels like it was personally designed for my weird ass ideas. I'm about to get all KINDS of stupid up in here. You don't even know! 🤗
MTP enables Qwen3.6 to generate ~1.4–2.2× faster with no accuracy change.
Qwen3.6-27B: unsloth/Qwen3.6-27B-MTP-GGUF
Qwen3.6-35B-A3B: unsloth/Qwen3.6-35B-A3B-MTP-GGUF
Guide: https://unsloth.ai/docs/models/qwen3.6#mtp-guide
JumpReLU Sparse Autoencoders trained on every layer of Gemma-4-E2B-it using an adaptive Lagrangian controller. Training in progress. I'm publishing layers live as they come hot off the press for anyone interested in following along. I will be making further adjustments for finer resolution but the early data should be helpful I think? I'm just a bartender don't trust everything I say. 🤗 The Lagrangian math is pretty cool. It auto-steers the trainer taking the guess work out of hyperparameter adjustments.
Full paper and methodology when ever I get around to writing it up. There's a lot of work to be done. For now though, enjoy! 🤗
https://huggingface.co/juiceb0xc0de/gemma-4-e2b-saes
The Brain Atlas is an interactive tool that lets you explore the internal behavior of Google's Gemma-4-E2B model layer by layer, head by head. Pick a behavior category, pick a layer, and see exactly which components light up and which go quiet. The dataset is fully queryable if you want to go deeper.
The mapping combines multiple single-direction techniques run in parallel across every layer and component. Activation taxonomy (classifying each neuron by how broadly it fires across prompt categories), coactivation pair analysis (which neurons lock together and on what topics), F-stat behavioral separation (one-way ANOVA per feature across 16 behavior categories), per-head specificity scoring, and a full compliance probe pipeline using SVD, sparse decomposition, and variance analysis.
Here's what I found when I ran it.
The sharpest behavioral signal isn't at the output. It's Layer 0. Up projection hits F=22.7, nearly 2x anything in the final third of the network. The model does its behavioral sorting before it's barely started, then spends the next 34 layers… doing what exactly?
The gate has a lifecycle. 70% dormant at L1, highest in the model. Brutal sparsification at L23–26 (>58% silent). Then reopens. The final five layers are the most alive gates anywhere. The model's last act is a gate flare.
Layer 4 routes 5 projections to dim 448. One layer. One dimension. That's a topology highway.
Zero specialist neurons. Not one. 1.2M neurons analyzed. None fires exclusively on a single category. This model distributes everything.
🧠 Space: juiceb0xc0de/gemma-4-e2b-brain-atlas
📊 Dataset (1.3M rows, fully queryable): juiceb0xc0de/gemma-4-e2b-atlas
Unsloth is an open-source project that makes training & running models more accurate and faster with less compute. Our mission is to make local AI accessible to everyone. Thanks to all of you for making this possible! 💕
Blog: https://unsloth.ai/blog/pytorch
GitHub: https://github.com/unslothai/unsloth
I don't aim to remove guard rails or the LLM identity entirely, what I want to do is dampen RLHF to a manageable volume. Personality models perform better with guardrails intact no different than humans with moral guidelines and boundaries. Refusals can help steer and mold personality. RLHF however drowns out adaptability so I'm cranking it down for you to crank your project up!
juiceb0xc0de/bella-bartender-gemma-e2b
juiceb0xc0de/locus-gemma-4-e2b