LFM2.5-350M Linux Command Generator

A fine-tuned version of LiquidAI/LFM2.5-350M specialized for converting natural language instructions into Linux shell commands.

What This Model Actually Is

This is a task-specific fine-tune of LiquidAI's 350M parameter language model, trained to generate bash commands wrapped in special tokens. It was created as a research/demonstration project to explore:

LoRA fine-tuning for command generation tasks
GRPO (Group Relative Policy Optimization) for reinforcement learning from rewards
Custom format training using special tokens
Production pipeline with Azure OpenAI dataset generation

Architecture & Training

Base Model

Model: LiquidAI/LFM2.5-350M (350M parameters)
Architecture: Transformer decoder-only
Context Length: 4096 tokens

Training Pipeline

OpenAI GPT-4 Dataset (3,000 examples)
    ↓
SFT Training (4 epochs, LoRA r=16, BF16)
    - assistant_only_loss=True
    - Custom data collator with assistant_masks
    ↓
GRPO Training (2 epochs, 7 reward functions)
    - beta=0.04 (KL constraint)
    - num_generations=3 per prompt
    - Temperature annealing 0.7 → 0.45
    ↓
Final Merged Model

Dataset (15 Categories, ~3,000 examples)

Category	Examples	Description
file_operations	450	ls, cp, mv, rm, mkdir
text_processing	400	grep, awk, sed, cut, sort
file_search	300	find, locate, which
process_management	300	ps, kill, pkill, nohup
networking	250	ping, curl, wget, ssh, scp
permissions	200	chmod, chown, sudo
archives_compression	200	tar, gzip, zip
system_info	200	df, du, free, uptime
io_redirection	200	pipes, >, >>, tee
environment	150	export, alias, source
monitoring	150	watch, lsof, journalctl
user_management	150	useradd, passwd, id
disk_storage	150	lsblk, mount, fdisk
string_patterns	150	grep -E, sed -E patterns
shell_scripting	150	for loops, if statements

Output Format (v30)

The model outputs raw bash commands between special tokens:

<|tool_call_start|>find . -name "*.py" -mtime -7<|tool_call_end|>

Key characteristics:

No function wrappers (linux_command(...)) - just raw bash
Uses LFM2.5's native special tokens: <|tool_call_start|>, <|tool_call_end|>
Designed for direct extraction and execution

Performance Metrics (Actual)

Metric	Score	Notes
Format Accuracy	100%	Correct use of special tokens
Tool Name Accuracy	98%	Raw command format (no wrappers)
Exact Match	24%	String match with reference command
Command F1	0.58	Token-level F1 score

What These Numbers Mean

100% Format Accuracy: The model consistently outputs commands in the correct format with proper special tokens
98% Tool Name: Almost never uses old function-wrapper format
24% Exact Match: Matches the reference command exactly 1 in 4 times (this is actually competitive for 350M parameters)
0.58 F1: Moderate token overlap with reference commands

Comparison Context

Model	Parameters	NL2Bash EM	Notes
GPT-4	~?	~50%	Proprietary, cloud-only
StarCoder2-7b	7B	~35%	20x larger
CodeLlama-7b	7B	~30%	20x larger
This Model	350M	24%	Fully open, edge-runnable
CodeT5-base	220M	~18%	Smaller but older arch

Takeaway: For a 350M parameter model, 24% EM is reasonable. It's not SOTA, but it's competitive for the size class and runs on minimal hardware.

Technical Highlights

1. LoRA Fine-Tuning

LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Scaling
    lora_dropout=0.05,
    target_modules="all-linear",
    bias="none",
)

2. GRPO Reward Functions (7 total)

reward_format: Correct special token usage (+2/-1)
reward_tool_name: Raw command format validation (+2/-2)
reward_exact_cmd: Exact string match (+2, partial credit)
reward_similarity: Token F1 similarity (0-1)
reward_safety: Dangerous command penalty (-3)
reward_penalties: Termination and structure quality
reward_structure: Content quality and format

3. Critical Bug Fixes Applied

Tokenizer Patch for GRPO

# TRL's GRPOTrainer calls batch_decode with skip_special_tokens=True
# which strips our format tokens. We monkey-patch to force=False.
def _forced_decode(sequences, skip_special_tokens=True, **kwargs):
    return original(sequences, skip_special_tokens=False, **kwargs)

Pickle Fix for odict_keys

# TRL/Transformers has issues with odict_keys in save checkpoints
# We monkey-patch Trainer._save to convert to list before saving
def patched_save(self, output_dir, state_dict):
    if hasattr(self, 'model_kwarg_keys'):
        if isinstance(keys, (KeysView, ValuesView, ItemsView)):
            self.model_kwarg_keys = list(keys)
    # ... sanitize and save

4. Training Optimizations

Right padding for training (assistant_only_loss requirement)
BF16 mixed precision for speed
Gradient checkpointing for memory
Temperature annealing (0.7 → 0.45) for exploration → exploitation
Milestone checkpoints at 10%, 50%, 100%

Usage

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import re

model_id = "2796gauravc/lfm25-350m-linux-grpo"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare prompt
messages = [
    {"role": "system", "content": "You are a Linux command assistant."},
    {"role": "user", "content": "Find all PDF files modified in the last 7 days"}
]

# Tokenize
enc = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt"
)
input_ids = enc.input_ids.to(model.device)
attention_mask = enc.attention_mask.to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=100,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id
    )

# Decode with special tokens preserved
response = tokenizer.decode(
    outputs[0][input_ids.size(-1):], 
    skip_special_tokens=False
)

# Extract command
match = re.search(
    r"<\|tool_call_start\|>(.*?)<\|tool_call_end\|>", 
    response, 
    re.DOTALL
)
if match:
    command = match.group(1).strip()
    print(f"Generated: {command}")
    # Output: find . -name "*.py" -mtime -7

Hardware Requirements

Mode	VRAM	RAM	Speed
Inference (GPU)	2GB	4GB	~100 tokens/s
Inference (CPU)	-	4GB	~20 tokens/s
Training (SFT)	16GB	32GB	~2 hrs
Training (GRPO)	20GB	32GB	~3 hrs

Limitations & Honest Assessment

What It Does Well

✅ Format compliance: Always uses correct special tokens
✅ Simple commands: Good at basic file operations, text processing
✅ Edge deployment: Small enough to run on consumer hardware
✅ No function wrappers: Clean raw command output

What It Struggles With

❌ Complex pipelines: Multi-stage commands with pipes
❌ Exact match: Only 24% match reference exactly (but many alternatives are valid)
❌ Edge cases: Unusual flags or rare utilities
❌ Context awareness: No memory of previous commands

Known Issues

Semantic equivalence not string equivalence: Many valid bash commands exist for the same task. The model may generate a correct alternative that doesn't match the reference string.
Safety: While we filter dangerous patterns in training, the model could still suggest risky commands. Always review before execution.
Overfitting to training patterns: May repeat common patterns from the training data.

Citation

@misc{lfm25-350m-linux-grpo,
  title={LFM2.5-350M Linux Command Generator},
  author={Gaurav Chauhan},
  year={2026},
  howpublished={\url{https://huggingface.co/2796gauravc/lfm25-350m-linux-grpo}},
  note={350M parameter NL2Bash model with LoRA + GRPO training}
}

License

MIT License - See LICENSE file for details.

Base model: LiquidAI/LFM2.5-350M (Apache 2.0)

Acknowledgments

LiquidAI for the LFM2.5 base model
HuggingFace for Transformers and TRL libraries
Azure OpenAI for dataset generation API

Downloads last month: 293

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for 2796gauravc/lfm25-350m-linux-grpo

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

LiquidAI/LFM2.5-350M

Adapter

(3)

this model