Back to Blog

Fine-Tuning FunctionGemma 270M for Tool Calling

Fine-Tuning FunctionGemma 270M for Tool Calling

TL;DR - Key Takeaways

  • Google's FunctionGemma 270M is a tiny Gemma 3 model designed for function calling — but the base version produces almost zero valid tool calls on unseen schemas
  • LoRA fine-tuning on ~13K general function-calling examples (Salesforce xLAM-60k + MadeAgents irrelevance data) took 25 minutes on an H100 GPU via vast.ai
  • Benchmarked with lm-evaluation-harness: +29% tool selection accuracy, +39% first tool accuracy, +20% parameter accuracy
  • End-to-end through a multi-protocol tool agent: 14% → 57% tool selection on a 7-query evaluation
  • The fine-tuned model is published on HuggingFace and the agent code is on GitHub
  • At 270M parameters, the model is strong on simple tool schemas but struggles with 14+ complex tools — a 3B+ model would be the next step

The Problem: Small Models Can't Call Tools

Function calling — where a language model decides which API or tool to invoke and with what arguments — is one of the most practical capabilities for AI agents. Models like GPT-4, Claude, and Gemini handle it natively. But what about models small enough to run on a phone, a Raspberry Pi, or an edge device?

Google released FunctionGemma, a 270M parameter variant of Gemma 3, specifically designed for function calling. It uses a unique control-token format:

<start_of_turn>user
You are a model that can do function calling with the following functions

{"name": "get_weather", "parameters": {"city": "string"}}

What's the weather in Tokyo?<end_of_turn>
<start_of_turn>model
<start_function_call>call:get_weather{city:<escape>Tokyo<escape>}<end_function_call><end_of_turn>

The idea is compelling — a model small enough to run anywhere, structured enough to call tools reliably. But there's a catch: the base model barely works on unseen tool schemas. In my testing, it produced zero valid function calls for common queries like "What's the weather?" or "Send an email."

So I decided to fine-tune it.


Training Pipeline

The full pipeline from raw data to published model looks like this:

Training Pipeline

Datasets

I combined two public datasets from HuggingFace, filtered to general-purpose function calling:

Dataset Source Sampled Purpose
Salesforce/xlam-function-calling-60k HuggingFace 10,000 General function calling across diverse tools
MadeAgents/xlam-irrelevance-7.5k HuggingFace 3,000 Negative examples — queries where no tool is applicable

The combined dataset totals ~13,000 examples spanning categories like ToolBench, xLAM-60k, OpenFunctions, and irrelevance/refusal samples.

Format Conversion

The datasets ship in xLAM-2's ChatML format (JSON arrays of tool calls). FunctionGemma expects a completely different format with Gemma 3 turn markers and special control tokens. A conversion script transforms each example:

xLAM-2 input:

[{"name": "get_weather", "arguments": {"city": "Tokyo", "unit": "celsius"}}]

FunctionGemma output:

<start_function_call>call:get_weather{city:<escape>Tokyo<escape>,unit:<escape>celsius<escape>}<end_function_call>

The conversion handles edge cases: multi-tool calls (multiple <start_function_call> blocks), no-tool responses (plain text), and nested argument values. Examples that can't be cleanly converted are dropped.


Training Configuration

Base Model

unsloth/functiongemma-270m-it — Gemma 3 architecture, 270M parameters, instruction-tuned. This is the smallest model in the Gemma family with function-calling support.

LoRA Configuration

Rather than full fine-tuning (which would require retraining all 270M parameters), I used LoRA (Low-Rank Adaptation) to train a lightweight adapter that modifies the model's attention and feed-forward layers:

Parameter Value
Rank (r) 16
Alpha 32
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters ~2.4M (< 1% of total)
Task type CAUSAL_LM

Infrastructure

Resource Detail
GPU NVIDIA H100 SXM 80GB
Provider vast.ai (cloud GPU rental)
Training time 25 minutes
Epochs 3
Batch size 8 (effective 16 with gradient accumulation 2)
Learning rate 2e-4
Max sequence length 1,024 tokens
Framework HuggingFace TRL (SFTTrainer) + PEFT

Training Metrics

Metric Value
Final train loss 0.6503
Final eval loss 0.6921
Token accuracy 85.7%
Throughput 24 samples/sec
Best checkpoint Step 2,000 (of 2,196)

Benchmark Results

I benchmarked both the base and fine-tuned models using two complementary approaches.

lm-evaluation-harness (Standardized Benchmark)

Using EleutherAI's lm-evaluation-harness with custom task definitions for function calling:

Benchmark Comparison

The fine-tuned model nearly doubles tool selection accuracy and achieves 88% first-tool accuracy — meaning when it picks a tool, it picks the right one almost 9 out of 10 times.

End-to-End Through the Tool Agent

The real test: running both models through the actual tool agent pipeline with real tool schemas and natural language queries.

Query Base (270M) Fine-tuned v1
"What's the weather in Tokyo?" No output get_weather(city="Tokyo")
"Search for latest news about AI" No output search_web(query="artificial intelligence")
"Send email to john@example.com..." No output send_email(to="john@example.com", subject="Meeting", body="See you at 3pm")
"What is 234 * 567 + 89?" No output search_web (wrong — expected calculate)
"Remind me to call dentist at 9am" No output send_email (wrong — expected set_reminder)
"Weather in Paris in fahrenheit?" No output get_weather(city="Paris")
"Tell me a joke" No output (correct) search_web (wrong — should decline)
Metric Base Fine-tuned Delta
Tool Selection Accuracy 1/7 (14%) 4/7 (57%) +43%
Avg inference time 2.21s 2.72s +0.51s

The base model produced zero valid tool calls. It never generated the <start_function_call> tokens at all. The fine-tuned model correctly selects tools for weather, search, and email queries, with properly extracted arguments (city names, email addresses, subjects).


Tool Agent Architecture

The fine-tuned model runs inside a multi-protocol tool agent server that exposes four connectivity options:

Tool Agent Architecture

Protocol Support

Protocol Endpoint Use Case
REST API /tools, /route, /execute Standard HTTP integration
WebSocket /ws (JSON-RPC 2.0) Streaming, real-time clients
MCP /mcp Model Context Protocol — tool discovery for AI agents
A2A /a2a, /.well-known/agent-card.json Google's Agent-to-Agent protocol

Loading the Fine-tuned Model

The TransformersBackend automatically detects LoRA adapters by looking for adapter_config.json in the model directory. It reads the base model path from the config, loads it, and merges the adapter:

model_dir = Path(model_path)
adapter_cfg = model_dir / "adapter_config.json"
if adapter_cfg.exists():
    cfg = json.loads(adapter_cfg.read_text())
    base_model = cfg["base_model_name_or_path"]
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(base_model)
    # Merge LoRA adapter
    model = PeftModel.from_pretrained(model, str(model_dir))
    model = model.merge_and_unload()

It also detects FunctionGemma models and switches to the legacy prompt format with Gemma 3 turn markers instead of the standard ChatML template:

def _build_prompt(self, user_message, tools):
    tools_text = format_tools(tools)
    return (
        f"<start_of_turn>user\n"
        f"You are a model that can do function calling "
        f"with the following functions\n\n{tools_text}\n\n"
        f"{user_message}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )

The FunctionCall.parse() method handles both the legacy FunctionGemma control tokens and standard JSON arrays, so the same router works regardless of which backend is active:

# Legacy format
<start_function_call>call:get_weather{city:<escape>Tokyo<escape>}<end_function_call>

# JSON format (xLAM-2, Gemini, etc.)
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}]

Running It

# With the fine-tuned model
TOOL_AGENT_BACKEND=transformers \
TOOL_AGENT_MODEL=./models/finetuned \
python -m agent.server

# With Gemini API (for comparison)
TOOL_AGENT_BACKEND=gemini \
GEMINI_API_KEY="..." \
python -m agent.server

Testing with Firecrawl MCP

To stress-test the model, I connected it to Firecrawl's MCP server, which exposes 12 web scraping tools (scrape, crawl, search, extract, browser sessions, etc.). Combined with the built-in HTTP tools, the model had to choose from 14 tools total.

The 270M model struggled here. When asked to "Scrape https://example.com," it selected firecrawl_browser_create instead of firecrawl_scrape and passed wrong argument types. For search and crawl queries, it gave up entirely.

This isn't surprising: the model has never seen these specific tool schemas during training, and at 270M parameters, it doesn't have enough capacity to generalize from "I know how to call get_weather" to "I can figure out which of 14 complex tools with nested parameters to use."

The same queries work perfectly when routed through Gemini 2.5 Flash Lite via the same agent — confirming the architecture is solid and only the model size is the bottleneck.


What I Learned

Fine-tuning works, even at 270M parameters. The base model produced zero valid tool calls. After 25 minutes of LoRA training on an H100, it correctly selects tools 57% of the time on unseen schemas. The lm-evaluation-harness benchmarks show even stronger results: +29% tool selection, +39% first-tool accuracy.

The FunctionGemma format is viable. The control-token approach (<start_function_call>call:fn{key:<escape>val<escape>}) is clean, parseable, and distinct enough from natural text that the model rarely produces false positives. The FunctionCall.parse() handler processes both legacy and JSON formats transparently.

270M is too small for complex registries. With 5 simple tools, the model does well. With 14 tools (including Firecrawl's complex schemas with nested objects and optional parameters), it falls apart. A 3B+ model fine-tuned on the same data would likely handle it.

LoRA adds negligible overhead. The adapter merge happens once at load time (+0.5s). Inference speed is identical to the base model after merging — there's no runtime cost.

Argument extraction is the model's strength. When it picks the right tool, it extracts arguments accurately: city names from weather queries, email addresses from send requests, search terms from search queries. The training data clearly teaches argument mapping well.


Next Steps

  • Scale up: Fine-tune a 3B or 8B Gemma model on the same dataset to handle complex tool registries
  • GGUF export: Convert the adapter to GGUF format for Ollama deployment (currently only supported via Transformers backend)
  • Domain-specific training: Add n8n workflow tools and custom API schemas to the training data
  • Multi-turn support: Extend training to handle tool results and follow-up calls

Model & Code

Training Data

Frameworks & Tools

SA
Written by Sumit Agrawal

Software Engineer & Technical Writer specializing in full-stack development, cloud architecture, and AI integration.

Related Posts