Activity Feed

AI & ML interests

None defined yet.

Recent Activity

mlabonneย 
posted an update about 1 month ago
view post
Post
1981
Big update to llm-datasets, my curated list of datasets and tools for post-training LLMs.

> Added many new datasets
> New "thinking" column
> Refreshed recommended tools.

Thanks to everyone who told me they used it for their research at ICLR, you motivated this update!
  • 2 replies
ยท
anakin87ย 
posted an update about 1 month ago
view post
Post
3333
A small model that struggled against a random opponent now beats GPT-5-mini at tic-tac-toe

I took LiquidAI/LFM2-2.6B and trained it through play.

๐Ÿง‘โ€๐Ÿณ Here's how:

1๏ธโƒฃ Build a solid RL env with Verifiers (Prime Intellect)
2๏ธโƒฃ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3๏ธโƒฃ SFT warm-up to teach format
4๏ธโƒฃ Group-based RL (CISPO) against opponents making 20-70% random moves
5๏ธโƒฃ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies

Done! Beats GPT-5-mini ๐Ÿ†

---

๐ŸŽฎ Play against the model: anakin87/LFM2-2.6B-mr-tictactoe

๐Ÿค— Model: anakin87/LFM2-2.6B-mr-tictactoe

๐Ÿ“š Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course

๐Ÿค— Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
anakin87ย 
posted an update about 1 month ago
view post
Post
106
Local Gemma 4 agent ๐Ÿ’Ž๐Ÿ•ต๏ธ๐Ÿ—บ๏ธ
drop in a mysterious map, get the location, live weather, and top spots to visit

I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a ๐Ÿ““ ๐™ฃ๐™ค๐™ฉ๐™š๐™—๐™ค๐™ค๐™  with Gemma + Haystack AI Framework covering 4 demos.

๐Ÿ““ https://t.ly/04Ty5

Another interesting one is the ๐—š๐—ถ๐˜๐—›๐˜‚๐—ฏ ๐—”๐—ด๐—ฒ๐—ป๐˜.

I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent โŒ

Then I used the ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ง๐—ผ๐—ผ๐—น๐˜€๐—ฒ๐˜ ๐Ÿ”Ž ๐Ÿงฐ
It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.

Now it actually works.

The notebook also contains
๐Ÿ’Ž Multimodal weather agent: the mystery map demo above
๐Ÿ’Ž Visual Question Answering from a paper
๐Ÿ’Ž RAG on Rock music
anakin87ย 
posted an update about 1 month ago
view post
Post
10406
How LLM training with RL Environments works?

It all starts with ๐—ฅ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐˜„๐—ถ๐˜๐—ต ๐—ฉ๐—ฒ๐—ฟ๐—ถ๐—ณ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฅ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ๐˜€
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training


In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)

Consider a more complex tic-tac-toe env โŒโญ•
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions

(envs can also include tools)

---

What happens at training?

We use ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ ๐—ฅ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฃ๐—ผ๐—น๐—ถ๐—ฐ๐˜† ๐—ข๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป with a tic-tac-toe env

No critic model needed, the group is the baseline
Simpler than PPO

1๏ธโƒฃ Rollout generation: from the same board, model plays N games via sampling
2๏ธโƒฃ Each game scored with deterministic rewards (win, format, ...)
3๏ธโƒฃ Mean score computed across the group
4๏ธโƒฃ Each rollout's advantage = its score minus the group mean
5๏ธโƒฃ Model updated to favor trajectories above baseline

๐Ÿ” Repeat


For a deep dive, check out
๐ŸŒฑ https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs
  • 2 replies
ยท