Fine-Tuning FunctionGemma 270M for Tool Calling

Sumit Agrawal

Back to Blog

Fine-Tuning FunctionGemma 270M for Tool Calling

March 10, 2026Updated March 09, 20267 min read

aifine-tuningfunction-callingllmloragemmatool-agenthuggingfacemachine-learning

Fine-Tuning FunctionGemma 270M for Tool Calling

TL;DR - Key Takeaways

Google's FunctionGemma 270M is a tiny Gemma 3 model designed for function calling — but the base version produces almost zero valid tool calls on unseen schemas
LoRA fine-tuning on ~13K general function-calling examples (Salesforce xLAM-60k + MadeAgents irrelevance data) took 25 minutes on an H100 GPU via vast.ai
Benchmarked with lm-evaluation-harness: +29% tool selection accuracy, +39% first tool accuracy, +20% parameter accuracy
End-to-end through a multi-protocol tool agent: 14% → 57% tool selection on a 7-query evaluation
The fine-tuned model is published on HuggingFace and the agent code is on GitHub
At 270M parameters, the model is strong on simple tool schemas but struggles with 14+ complex tools — a 3B+ model would be the next step

The Problem: Small Models Can't Call Tools

Function calling — where a language model decides which API or tool to invoke and with what arguments — is one of the most practical capabilities for AI agents. Models like GPT-4, Claude, and Gemini handle it natively. But what about models small enough to run on a phone, a Raspberry Pi, or an edge device?

Google released FunctionGemma, a 270M parameter variant of Gemma 3, specifically designed for function calling. It uses a unique control-token format:

<start_of_turn>user
You are a model that can do function calling with the following functions

{"name": "get_weather", "parameters": {"city": "string"}}

What's the weather in Tokyo?<end_of_turn>
<start_of_turn>model
<start_function_call>call:get_weather{city:<escape>Tokyo<escape>}<end_function_call><end_of_turn>

The idea is compelling — a model small enough to run anywhere, structured enough to call tools reliably. But there's a catch: the base model barely works on unseen tool schemas. In my testing, it produced zero valid function calls for common queries like "What's the weather?" or "Send an email."

So I decided to fine-tune it.

Training Pipeline

The full pipeline from raw data to published model looks like this:

Training Pipeline

Datasets

I combined two public datasets from HuggingFace, filtered to general-purpose function calling:

Dataset	Source	Sampled	Purpose
Salesforce/xlam-function-calling-60k	HuggingFace	10,000	General function calling across diverse tools
MadeAgents/xlam-irrelevance-7.5k	HuggingFace	3,000	Negative examples — queries where no tool is applicable

The combined dataset totals ~13,000 examples spanning categories like ToolBench, xLAM-60k, OpenFunctions, and irrelevance/refusal samples.

Format Conversion

The datasets ship in xLAM-2's ChatML format (JSON arrays of tool calls). FunctionGemma expects a completely different format with Gemma 3 turn markers and special control tokens. A conversion script transforms each example:

xLAM-2 input:

[{"name": "get_weather", "arguments": {"city": "Tokyo", "unit": "celsius"}}]

FunctionGemma output:

<start_function_call>call:get_weather{city:<escape>Tokyo<escape>,unit:<escape>celsius<escape>}<end_function_call>

The conversion handles edge cases: multi-tool calls (multiple <start_function_call> blocks), no-tool responses (plain text), and nested argument values. Examples that can't be cleanly converted are dropped.

Training Configuration

Base Model

unsloth/functiongemma-270m-it — Gemma 3 architecture, 270M parameters, instruction-tuned. This is the smallest model in the Gemma family with function-calling support.

LoRA Configuration

Rather than full fine-tuning (which would require retraining all 270M parameters), I used LoRA (Low-Rank Adaptation) to train a lightweight adapter that modifies the model's attention and feed-forward layers:

Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable parameters	~2.4M (< 1% of total)
Task type	CAUSAL_LM

Infrastructure

Resource	Detail
GPU	NVIDIA H100 SXM 80GB
Provider	vast.ai (cloud GPU rental)
Training time	25 minutes
Epochs	3
Batch size	8 (effective 16 with gradient accumulation 2)
Learning rate	2e-4
Max sequence length	1,024 tokens
Framework	HuggingFace TRL (SFTTrainer) + PEFT

Training Metrics

Metric	Value
Final train loss	0.6503
Final eval loss	0.6921
Token accuracy	85.7%
Throughput	24 samples/sec
Best checkpoint	Step 2,000 (of 2,196)

Benchmark Results

I benchmarked both the base and fine-tuned models using two complementary approaches.

lm-evaluation-harness (Standardized Benchmark)

Using EleutherAI's lm-evaluation-harness with custom task definitions for function calling:

Benchmark Comparison

The fine-tuned model nearly doubles tool selection accuracy and achieves 88% first-tool accuracy — meaning when it picks a tool, it picks the right one almost 9 out of 10 times.

End-to-End Through the Tool Agent

The real test: running both models through the actual tool agent pipeline with real tool schemas and natural language queries.

Query	Base (270M)	Fine-tuned v1
"What's the weather in Tokyo?"	No output	`get_weather(city="Tokyo")`
"Search for latest news about AI"	No output	`search_web(query="artificial intelligence")`
"Send email to john@example.com..."	No output	`send_email(to="john@example.com", subject="Meeting", body="See you at 3pm")`
"What is 234 * 567 + 89?"	No output	`search_web` (wrong — expected `calculate`)
"Remind me to call dentist at 9am"	No output	`send_email` (wrong — expected `set_reminder`)
"Weather in Paris in fahrenheit?"	No output	`get_weather(city="Paris")`
"Tell me a joke"	No output (correct)	`search_web` (wrong — should decline)

Metric	Base	Fine-tuned	Delta
Tool Selection Accuracy	1/7 (14%)	4/7 (57%)	+43%
Avg inference time	2.21s	2.72s	+0.51s

The base model produced zero valid tool calls. It never generated the <start_function_call> tokens at all. The fine-tuned model correctly selects tools for weather, search, and email queries, with properly extracted arguments (city names, email addresses, subjects).

Tool Agent Architecture

The fine-tuned model runs inside a multi-protocol tool agent server that exposes four connectivity options:

Tool Agent Architecture

Protocol Support

Protocol	Endpoint	Use Case
REST API	`/tools`, `/route`, `/execute`	Standard HTTP integration
WebSocket	`/ws` (JSON-RPC 2.0)	Streaming, real-time clients
MCP	`/mcp`	Model Context Protocol — tool discovery for AI agents
A2A	`/a2a`, `/.well-known/agent-card.json`	Google's Agent-to-Agent protocol

Loading the Fine-tuned Model

The TransformersBackend automatically detects LoRA adapters by looking for adapter_config.json in the model directory. It reads the base model path from the config, loads it, and merges the adapter:

model_dir = Path(model_path)
adapter_cfg = model_dir / "adapter_config.json"
if adapter_cfg.exists():
    cfg = json.loads(adapter_cfg.read_text())
    base_model = cfg["base_model_name_or_path"]
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(base_model)
    # Merge LoRA adapter
    model = PeftModel.from_pretrained(model, str(model_dir))
    model = model.merge_and_unload()

It also detects FunctionGemma models and switches to the legacy prompt format with Gemma 3 turn markers instead of the standard ChatML template:

def _build_prompt(self, user_message, tools):
    tools_text = format_tools(tools)
    return (
        f"<start_of_turn>user\n"
        f"You are a model that can do function calling "
        f"with the following functions\n\n{tools_text}\n\n"
        f"{user_message}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )

The FunctionCall.parse() method handles both the legacy FunctionGemma control tokens and standard JSON arrays, so the same router works regardless of which backend is active:

# Legacy format
<start_function_call>call:get_weather{city:<escape>Tokyo<escape>}<end_function_call>

# JSON format (xLAM-2, Gemini, etc.)
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}]

Running It

# With the fine-tuned model
TOOL_AGENT_BACKEND=transformers \
TOOL_AGENT_MODEL=./models/finetuned \
python -m agent.server

# With Gemini API (for comparison)
TOOL_AGENT_BACKEND=gemini \
GEMINI_API_KEY="..." \
python -m agent.server

Testing with Firecrawl MCP

To stress-test the model, I connected it to Firecrawl's MCP server, which exposes 12 web scraping tools (scrape, crawl, search, extract, browser sessions, etc.). Combined with the built-in HTTP tools, the model had to choose from 14 tools total.

The 270M model struggled here. When asked to "Scrape https://example.com," it selected firecrawl_browser_create instead of firecrawl_scrape and passed wrong argument types. For search and crawl queries, it gave up entirely.

This isn't surprising: the model has never seen these specific tool schemas during training, and at 270M parameters, it doesn't have enough capacity to generalize from "I know how to call get_weather" to "I can figure out which of 14 complex tools with nested parameters to use."

The same queries work perfectly when routed through Gemini 2.5 Flash Lite via the same agent — confirming the architecture is solid and only the model size is the bottleneck.

What I Learned

Fine-tuning works, even at 270M parameters. The base model produced zero valid tool calls. After 25 minutes of LoRA training on an H100, it correctly selects tools 57% of the time on unseen schemas. The lm-evaluation-harness benchmarks show even stronger results: +29% tool selection, +39% first-tool accuracy.

The FunctionGemma format is viable. The control-token approach (<start_function_call>call:fn{key:<escape>val<escape>}) is clean, parseable, and distinct enough from natural text that the model rarely produces false positives. The FunctionCall.parse() handler processes both legacy and JSON formats transparently.

270M is too small for complex registries. With 5 simple tools, the model does well. With 14 tools (including Firecrawl's complex schemas with nested objects and optional parameters), it falls apart. A 3B+ model fine-tuned on the same data would likely handle it.

LoRA adds negligible overhead. The adapter merge happens once at load time (+0.5s). Inference speed is identical to the base model after merging — there's no runtime cost.

Argument extraction is the model's strength. When it picks the right tool, it extracts arguments accurately: city names from weather queries, email addresses from send requests, search terms from search queries. The training data clearly teaches argument mapping well.

Next Steps

Scale up: Fine-tune a 3B or 8B Gemma model on the same dataset to handle complex tool registries
GGUF export: Convert the adapter to GGUF format for Ollama deployment (currently only supported via Transformers backend)
Domain-specific training: Add n8n workflow tools and custom API schemas to the training data
Multi-turn support: Extend training to handle tool results and follow-up calls

Links

Model & Code

Fine-tuned model (HuggingFace): sumitagrawal/functiongemma-270m-tool-agent
Tool Agent source code (GitHub): tech-sumit/tool-agent
Base model: unsloth/functiongemma-270m-it

Training Data

Salesforce/xlam-function-calling-60k — general function calling
MadeAgents/xlam-irrelevance-7.5k — negative/refusal examples

Frameworks & Tools

HuggingFace TRL — SFTTrainer for supervised fine-tuning
PEFT — LoRA adapter training
lm-evaluation-harness — standardized benchmarking
vast.ai — cloud GPU rental (H100)
FastAPI — tool agent server framework
Firecrawl — MCP-based web scraping tools

SA

Written by Sumit Agrawal

Software Engineer & Technical Writer specializing in full-stack development, cloud architecture, and AI integration.

TL;DR - Key Takeaways

The Problem: Small Models Can't Call Tools

Training Pipeline

Datasets

Format Conversion

Training Configuration

Base Model

LoRA Configuration

Infrastructure

Training Metrics

Benchmark Results

lm-evaluation-harness (Standardized Benchmark)

End-to-End Through the Tool Agent

Tool Agent Architecture

Protocol Support

Loading the Fine-tuned Model

Running It

Testing with Firecrawl MCP

What I Learned

Next Steps

Links

Model & Code

Training Data

Frameworks & Tools

Related Posts