1๏ธโฃ Build a solid RL env with Verifiers (Prime Intellect) 2๏ธโฃ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env 3๏ธโฃ SFT warm-up to teach format 4๏ธโฃ Group-based RL (CISPO) against opponents making 20-70% random moves 5๏ธโฃ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
Local Gemma 4 agent ๐๐ต๏ธ๐บ๏ธ drop in a mysterious map, get the location, live weather, and top spots to visit
I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a ๐ ๐ฃ๐ค๐ฉ๐๐๐ค๐ค๐ with Gemma + Haystack AI Framework covering 4 demos.
Another interesting one is the ๐๐ถ๐๐๐๐ฏ ๐๐ด๐ฒ๐ป๐.
I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent โ
Then I used the ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฎ๐ฏ๐น๐ฒ ๐ง๐ผ๐ผ๐น๐๐ฒ๐ ๐ ๐งฐ It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.
Now it actually works.
The notebook also contains ๐ Multimodal weather agent: the mystery map demo above ๐ Visual Question Answering from a paper ๐ RAG on Rock music
It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env โโญ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env
No critic model needed, the group is the baseline Simpler than PPO
1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโฃ Mean score computed across the group 4๏ธโฃ Each rollout's advantage = its score minus the group mean 5๏ธโฃ Model updated to favor trajectories above baseline