Instructions to use trl-internal-testing/tiny-Gemma4ForConditionalGeneration with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use trl-internal-testing/tiny-Gemma4ForConditionalGeneration with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="trl-internal-testing/tiny-Gemma4ForConditionalGeneration") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("trl-internal-testing/tiny-Gemma4ForConditionalGeneration") model = AutoModelForImageTextToText.from_pretrained("trl-internal-testing/tiny-Gemma4ForConditionalGeneration") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use trl-internal-testing/tiny-Gemma4ForConditionalGeneration with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "trl-internal-testing/tiny-Gemma4ForConditionalGeneration" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trl-internal-testing/tiny-Gemma4ForConditionalGeneration", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/trl-internal-testing/tiny-Gemma4ForConditionalGeneration
- SGLang
How to use trl-internal-testing/tiny-Gemma4ForConditionalGeneration with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "trl-internal-testing/tiny-Gemma4ForConditionalGeneration" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trl-internal-testing/tiny-Gemma4ForConditionalGeneration", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "trl-internal-testing/tiny-Gemma4ForConditionalGeneration" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trl-internal-testing/tiny-Gemma4ForConditionalGeneration", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use trl-internal-testing/tiny-Gemma4ForConditionalGeneration with Docker Model Runner:
docker model run hf.co/trl-internal-testing/tiny-Gemma4ForConditionalGeneration
Update chat_template.jinja
Making the Gemma4 chat template prefix-preserving
Problem
The new template inlines tool responses into the model turn via a forward-scan. This means appending a role:tool message changes tokens before the turn-end marker, breaking prefix-preservation:
# Without tool response → turn closes immediately:
...model\n<|tool_call>call:func{}<tool_call|><turn|>\n
# With tool response → new content inserted before <turn|>:
...model\n<|tool_call>call:func{}<tool_call|><|tool_response>...<tool_response|>...
Fix (3 changes)
1. Handle role:tool messages in the main loop — render them as standalone <|tool_response> blocks after the model turn closes, instead of inlining them:
{#- Loop through messages -#}
{%- for message in loop_messages -%}
- {%- if message['role'] != 'tool' -%}
+ {%- if message['role'] == 'tool' -%}
+ {#- Render tool responses as standalone blocks (outside model turn) for prefix-preservation -#}
+ {%- set tool_name = message.get('name') | default('unknown') -%}
+ {%- set tool_body = message.get('content') -%}
+ {%- if tool_body is string -%}
+ {{- format_tool_response_block(tool_name, tool_body) -}}
+ {%- elif tool_body is sequence and tool_body is not string -%}
+ {%- set ns_txt = namespace(s='') -%}
+ {%- for part in tool_body -%}
+ {%- if part.get('type') == 'text' -%}
+ {%- set ns_txt.s = ns_txt.s + (part.get('text') | default('')) -%}
+ {%- endif -%}
+ {%- endfor -%}
+ {{- format_tool_response_block(tool_name, ns_txt.s) -}}
+ {%- else -%}
+ {{- format_tool_response_block(tool_name, tool_body) -}}
+ {%- endif -%}
+ {%- else -%}
2. Include tool messages in the previous-message scan — so an assistant message after a tool opens a new <|turn>model instead of continuing the previous model turn:
{%- if loop.index0 > 0 -%}
{%- for j in range(loop.index0 - 1, -1, -1) -%}
{%- if not prev_nt.found -%}
- {%- if loop_messages[j]['role'] != 'tool' -%}
- {%- set prev_nt.role = loop_messages[j]['role'] -%}
- {%- set prev_nt.found = true -%}
- {%- endif -%}
+ {%- set prev_nt.role = loop_messages[j]['role'] -%}
+ {%- set prev_nt.found = true -%}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
3. Remove the forward-scan that inlined tool responses into the model turn:
{%- if message.get('tool_responses') -%}
{#- Legacy: tool_responses embedded on the assistant message (Google/Gemma native) -#}
...
- {%- elif message.get('tool_calls') -%}
- {#- OpenAI Chat Completions: forward-scan consecutive role:tool messages -#}
- {%- set ns_tool_scan = namespace(stopped=false) -%}
- {%- for k in range(loop.index0 + 1, loop_messages | length) -%}
- ... (35 lines removed)
- {%- endfor -%}
{%- endif -%}
Result
# Before (inlined, not prefix-preserving):
<|turn>model
<|tool_call>call:multiply{a:3,b:4}<tool_call|><|tool_response>response:multiply{value:<|"|>12<|"|>}<tool_response|><|turn>model
<|channel>thought
<channel|>
# After (standalone, prefix-preserving):
<|turn>model
<|tool_call>call:multiply{a:3,b:4}<tool_call|><turn|>
<|tool_response>response:multiply{value:<|"|>12<|"|>}<tool_response|><|turn>model
<|channel>thought
<channel|>
The model turn now closes with <turn|> before the tool response, so appending tool messages only adds tokens after the existing output.