LFM2.5-350M Linux Command Generator
A fine-tuned version of LiquidAI/LFM2.5-350M specialized for converting natural language instructions into Linux shell commands.
What This Model Actually Is
This is a task-specific fine-tune of LiquidAI's 350M parameter language model, trained to generate bash commands wrapped in special tokens. It was created as a research/demonstration project to explore:
- LoRA fine-tuning for command generation tasks
- GRPO (Group Relative Policy Optimization) for reinforcement learning from rewards
- Custom format training using special tokens
- Production pipeline with Azure OpenAI dataset generation
Architecture & Training
Base Model
- Model: LiquidAI/LFM2.5-350M (350M parameters)
- Architecture: Transformer decoder-only
- Context Length: 4096 tokens
Training Pipeline
OpenAI GPT-4 Dataset (3,000 examples)
↓
SFT Training (4 epochs, LoRA r=16, BF16)
- assistant_only_loss=True
- Custom data collator with assistant_masks
↓
GRPO Training (2 epochs, 7 reward functions)
- beta=0.04 (KL constraint)
- num_generations=3 per prompt
- Temperature annealing 0.7 → 0.45
↓
Final Merged Model
Dataset (15 Categories, ~3,000 examples)
| Category | Examples | Description |
|---|---|---|
| file_operations | 450 | ls, cp, mv, rm, mkdir |
| text_processing | 400 | grep, awk, sed, cut, sort |
| file_search | 300 | find, locate, which |
| process_management | 300 | ps, kill, pkill, nohup |
| networking | 250 | ping, curl, wget, ssh, scp |
| permissions | 200 | chmod, chown, sudo |
| archives_compression | 200 | tar, gzip, zip |
| system_info | 200 | df, du, free, uptime |
| io_redirection | 200 | pipes, >, >>, tee |
| environment | 150 | export, alias, source |
| monitoring | 150 | watch, lsof, journalctl |
| user_management | 150 | useradd, passwd, id |
| disk_storage | 150 | lsblk, mount, fdisk |
| string_patterns | 150 | grep -E, sed -E patterns |
| shell_scripting | 150 | for loops, if statements |
Output Format (v30)
The model outputs raw bash commands between special tokens:
<|tool_call_start|>find . -name "*.py" -mtime -7<|tool_call_end|>
Key characteristics:
- No function wrappers (
linux_command(...)) - just raw bash - Uses LFM2.5's native special tokens:
<|tool_call_start|>,<|tool_call_end|> - Designed for direct extraction and execution
Performance Metrics (Actual)
| Metric | Score | Notes |
|---|---|---|
| Format Accuracy | 100% | Correct use of special tokens |
| Tool Name Accuracy | 98% | Raw command format (no wrappers) |
| Exact Match | 24% | String match with reference command |
| Command F1 | 0.58 | Token-level F1 score |
What These Numbers Mean
- 100% Format Accuracy: The model consistently outputs commands in the correct format with proper special tokens
- 98% Tool Name: Almost never uses old function-wrapper format
- 24% Exact Match: Matches the reference command exactly 1 in 4 times (this is actually competitive for 350M parameters)
- 0.58 F1: Moderate token overlap with reference commands
Comparison Context
| Model | Parameters | NL2Bash EM | Notes |
|---|---|---|---|
| GPT-4 | ~? | ~50% | Proprietary, cloud-only |
| StarCoder2-7b | 7B | ~35% | 20x larger |
| CodeLlama-7b | 7B | ~30% | 20x larger |
| This Model | 350M | 24% | Fully open, edge-runnable |
| CodeT5-base | 220M | ~18% | Smaller but older arch |
Takeaway: For a 350M parameter model, 24% EM is reasonable. It's not SOTA, but it's competitive for the size class and runs on minimal hardware.
Technical Highlights
1. LoRA Fine-Tuning
LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling
lora_dropout=0.05,
target_modules="all-linear",
bias="none",
)
2. GRPO Reward Functions (7 total)
reward_format: Correct special token usage (+2/-1)reward_tool_name: Raw command format validation (+2/-2)reward_exact_cmd: Exact string match (+2, partial credit)reward_similarity: Token F1 similarity (0-1)reward_safety: Dangerous command penalty (-3)reward_penalties: Termination and structure qualityreward_structure: Content quality and format
3. Critical Bug Fixes Applied
Tokenizer Patch for GRPO
# TRL's GRPOTrainer calls batch_decode with skip_special_tokens=True
# which strips our format tokens. We monkey-patch to force=False.
def _forced_decode(sequences, skip_special_tokens=True, **kwargs):
return original(sequences, skip_special_tokens=False, **kwargs)
Pickle Fix for odict_keys
# TRL/Transformers has issues with odict_keys in save checkpoints
# We monkey-patch Trainer._save to convert to list before saving
def patched_save(self, output_dir, state_dict):
if hasattr(self, 'model_kwarg_keys'):
if isinstance(keys, (KeysView, ValuesView, ItemsView)):
self.model_kwarg_keys = list(keys)
# ... sanitize and save
4. Training Optimizations
- Right padding for training (assistant_only_loss requirement)
- BF16 mixed precision for speed
- Gradient checkpointing for memory
- Temperature annealing (0.7 → 0.45) for exploration → exploitation
- Milestone checkpoints at 10%, 50%, 100%
Usage
Basic Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import re
model_id = "2796gauravc/lfm25-350m-linux-grpo"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Prepare prompt
messages = [
{"role": "system", "content": "You are a Linux command assistant."},
{"role": "user", "content": "Find all PDF files modified in the last 7 days"}
]
# Tokenize
enc = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
)
input_ids = enc.input_ids.to(model.device)
attention_mask = enc.attention_mask.to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=100,
do_sample=False,
pad_token_id=tokenizer.pad_token_id
)
# Decode with special tokens preserved
response = tokenizer.decode(
outputs[0][input_ids.size(-1):],
skip_special_tokens=False
)
# Extract command
match = re.search(
r"<\|tool_call_start\|>(.*?)<\|tool_call_end\|>",
response,
re.DOTALL
)
if match:
command = match.group(1).strip()
print(f"Generated: {command}")
# Output: find . -name "*.py" -mtime -7
Hardware Requirements
| Mode | VRAM | RAM | Speed |
|---|---|---|---|
| Inference (GPU) | 2GB | 4GB | ~100 tokens/s |
| Inference (CPU) | - | 4GB | ~20 tokens/s |
| Training (SFT) | 16GB | 32GB | ~2 hrs |
| Training (GRPO) | 20GB | 32GB | ~3 hrs |
Limitations & Honest Assessment
What It Does Well
- ✅ Format compliance: Always uses correct special tokens
- ✅ Simple commands: Good at basic file operations, text processing
- ✅ Edge deployment: Small enough to run on consumer hardware
- ✅ No function wrappers: Clean raw command output
What It Struggles With
- ❌ Complex pipelines: Multi-stage commands with pipes
- ❌ Exact match: Only 24% match reference exactly (but many alternatives are valid)
- ❌ Edge cases: Unusual flags or rare utilities
- ❌ Context awareness: No memory of previous commands
Known Issues
- Semantic equivalence not string equivalence: Many valid bash commands exist for the same task. The model may generate a correct alternative that doesn't match the reference string.
- Safety: While we filter dangerous patterns in training, the model could still suggest risky commands. Always review before execution.
- Overfitting to training patterns: May repeat common patterns from the training data.
Citation
@misc{lfm25-350m-linux-grpo,
title={LFM2.5-350M Linux Command Generator},
author={Gaurav Chauhan},
year={2026},
howpublished={\url{https://huggingface.co/2796gauravc/lfm25-350m-linux-grpo}},
note={350M parameter NL2Bash model with LoRA + GRPO training}
}
License
MIT License - See LICENSE file for details.
Base model: LiquidAI/LFM2.5-350M (Apache 2.0)
Acknowledgments
- LiquidAI for the LFM2.5 base model
- HuggingFace for Transformers and TRL libraries
- Azure OpenAI for dataset generation API
- Downloads last month
- 293