---
title: "Fine-Tuning FunctionGemma 270M for Tool Calling"
date: 2026-03-10T12:00:00.000Z
description: "How I fine-tuned Google's 270M parameter FunctionGemma model with LoRA on an H100 in 25 minutes, improving tool selection accuracy by 29% and deploying it through a multi-protocol agent server."
tags: ["ai", "fine-tuning", "function-calling", "llm", "lora", "gemma", "tool-agent", "huggingface", "machine-learning"]
tokens: 2274
content-signal: search=yes, ai-input=yes, ai-train=no
---


![Fine-Tuning FunctionGemma 270M for Tool Calling](/images/posts/finetuning-functiongemma-270m-tool-calling/hero.png)

## TL;DR - Key Takeaways

- Google's **FunctionGemma 270M** is a tiny Gemma 3 model designed for function calling — but the base version produces almost zero valid tool calls on unseen schemas
- **LoRA fine-tuning** on ~13K general function-calling examples (Salesforce xLAM-60k + MadeAgents irrelevance data) took **25 minutes on an H100 GPU** via vast.ai
- Benchmarked with `lm-evaluation-harness`: **+29% tool selection accuracy**, **+39% first tool accuracy**, **+20% parameter accuracy**
- End-to-end through a multi-protocol tool agent: **14% → 57% tool selection** on a 7-query evaluation
- The fine-tuned model is published on [HuggingFace](https://huggingface.co/sumitagrawal/functiongemma-270m-tool-agent) and the agent code is on [GitHub](https://github.com/tech-sumit/tool-agent)
- At 270M parameters, the model is strong on simple tool schemas but struggles with 14+ complex tools — a 3B+ model would be the next step

---

## The Problem: Small Models Can't Call Tools

Function calling — where a language model decides which API or tool to invoke and with what arguments — is one of the most practical capabilities for AI agents. Models like GPT-4, Claude, and Gemini handle it natively. But what about models small enough to run on a phone, a Raspberry Pi, or an edge device?

Google released **FunctionGemma**, a 270M parameter variant of Gemma 3, specifically designed for function calling. It uses a unique control-token format:

```
<start_of_turn>user
You are a model that can do function calling with the following functions

{"name": "get_weather", "parameters": {"city": "string"}}

What's the weather in Tokyo?<end_of_turn>
<start_of_turn>model
<start_function_call>call:get_weather{city:<escape>Tokyo<escape>}<end_function_call><end_of_turn>
```

The idea is compelling — a model small enough to run anywhere, structured enough to call tools reliably. But there's a catch: **the base model barely works on unseen tool schemas**. In my testing, it produced zero valid function calls for common queries like "What's the weather?" or "Send an email."

So I decided to fine-tune it.

---

## Training Pipeline

The full pipeline from raw data to published model looks like this:

![Training Pipeline](/images/posts/finetuning-functiongemma-270m-tool-calling/training-pipeline.png)

### Datasets

I combined two public datasets from HuggingFace, filtered to general-purpose function calling:

| Dataset | Source | Sampled | Purpose |
|---------|--------|---------|---------|
| Salesforce/xlam-function-calling-60k | [HuggingFace](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 10,000 | General function calling across diverse tools |
| MadeAgents/xlam-irrelevance-7.5k | [HuggingFace](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) | 3,000 | Negative examples — queries where no tool is applicable |

The combined dataset totals **~13,000 examples** spanning categories like ToolBench, xLAM-60k, OpenFunctions, and irrelevance/refusal samples.

### Format Conversion

The datasets ship in xLAM-2's ChatML format (JSON arrays of tool calls). FunctionGemma expects a completely different format with Gemma 3 turn markers and special control tokens. A conversion script transforms each example:

**xLAM-2 input:**

```json
[{"name": "get_weather", "arguments": {"city": "Tokyo", "unit": "celsius"}}]
```

**FunctionGemma output:**

```
<start_function_call>call:get_weather{city:<escape>Tokyo<escape>,unit:<escape>celsius<escape>}<end_function_call>
```

The conversion handles edge cases: multi-tool calls (multiple `<start_function_call>` blocks), no-tool responses (plain text), and nested argument values. Examples that can't be cleanly converted are dropped.

---

## Training Configuration

### Base Model

**[unsloth/functiongemma-270m-it](https://huggingface.co/unsloth/functiongemma-270m-it)** — Gemma 3 architecture, 270M parameters, instruction-tuned. This is the smallest model in the Gemma family with function-calling support.

### LoRA Configuration

Rather than full fine-tuning (which would require retraining all 270M parameters), I used **[LoRA](https://arxiv.org/abs/2106.09685)** (Low-Rank Adaptation) to train a lightweight adapter that modifies the model's attention and feed-forward layers:

| Parameter | Value |
|-----------|-------|
| Rank (r) | 16 |
| Alpha | 32 |
| Dropout | 0.05 |
| Target modules | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| Trainable parameters | ~2.4M (< 1% of total) |
| Task type | CAUSAL_LM |

### Infrastructure

| Resource | Detail |
|----------|--------|
| GPU | NVIDIA H100 SXM 80GB |
| Provider | [vast.ai](https://vast.ai) (cloud GPU rental) |
| Training time | **25 minutes** |
| Epochs | 3 |
| Batch size | 8 (effective 16 with gradient accumulation 2) |
| Learning rate | 2e-4 |
| Max sequence length | 1,024 tokens |
| Framework | HuggingFace [TRL](https://huggingface.co/docs/trl) (SFTTrainer) + [PEFT](https://huggingface.co/docs/peft) |

### Training Metrics

| Metric | Value |
|--------|-------|
| Final train loss | 0.6503 |
| Final eval loss | 0.6921 |
| Token accuracy | 85.7% |
| Throughput | 24 samples/sec |
| Best checkpoint | Step 2,000 (of 2,196) |

---

## Benchmark Results

I benchmarked both the base and fine-tuned models using two complementary approaches.

### lm-evaluation-harness (Standardized Benchmark)

Using EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) with custom task definitions for function calling:

![Benchmark Comparison](/images/posts/finetuning-functiongemma-270m-tool-calling/benchmark-comparison.png)

The fine-tuned model nearly doubles tool selection accuracy and achieves **88% first-tool accuracy** — meaning when it picks a tool, it picks the right one almost 9 out of 10 times.

### End-to-End Through the Tool Agent

The real test: running both models through the actual tool agent pipeline with real tool schemas and natural language queries.

| Query | Base (270M) | Fine-tuned v1 |
|-------|------------|---------------|
| "What's the weather in Tokyo?" | No output | `get_weather(city="Tokyo")` |
| "Search for latest news about AI" | No output | `search_web(query="artificial intelligence")` |
| "Send email to john@example.com..." | No output | `send_email(to="john@example.com", subject="Meeting", body="See you at 3pm")` |
| "What is 234 * 567 + 89?" | No output | `search_web` (wrong — expected `calculate`) |
| "Remind me to call dentist at 9am" | No output | `send_email` (wrong — expected `set_reminder`) |
| "Weather in Paris in fahrenheit?" | No output | `get_weather(city="Paris")` |
| "Tell me a joke" | No output (correct) | `search_web` (wrong — should decline) |

| Metric | Base | Fine-tuned | Delta |
|--------|------|-----------|-------|
| Tool Selection Accuracy | 1/7 (14%) | 4/7 (57%) | **+43%** |
| Avg inference time | 2.21s | 2.72s | +0.51s |

The base model produced **zero valid tool calls**. It never generated the `<start_function_call>` tokens at all. The fine-tuned model correctly selects tools for weather, search, and email queries, with properly extracted arguments (city names, email addresses, subjects).

---

## Tool Agent Architecture

The fine-tuned model runs inside a multi-protocol tool agent server that exposes four connectivity options:

![Tool Agent Architecture](/images/posts/finetuning-functiongemma-270m-tool-calling/architecture.png)

### Protocol Support

| Protocol | Endpoint | Use Case |
|----------|----------|----------|
| REST API | `/tools`, `/route`, `/execute` | Standard HTTP integration |
| WebSocket | `/ws` (JSON-RPC 2.0) | Streaming, real-time clients |
| MCP | `/mcp` | Model Context Protocol — tool discovery for AI agents |
| A2A | `/a2a`, `/.well-known/agent-card.json` | Google's Agent-to-Agent protocol |

### Loading the Fine-tuned Model

The `TransformersBackend` automatically detects LoRA adapters by looking for `adapter_config.json` in the model directory. It reads the base model path from the config, loads it, and merges the adapter:

```python
model_dir = Path(model_path)
adapter_cfg = model_dir / "adapter_config.json"
if adapter_cfg.exists():
    cfg = json.loads(adapter_cfg.read_text())
    base_model = cfg["base_model_name_or_path"]
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(base_model)
    # Merge LoRA adapter
    model = PeftModel.from_pretrained(model, str(model_dir))
    model = model.merge_and_unload()
```

It also detects FunctionGemma models and switches to the legacy prompt format with Gemma 3 turn markers instead of the standard ChatML template:

```python
def _build_prompt(self, user_message, tools):
    tools_text = format_tools(tools)
    return (
        f"<start_of_turn>user\n"
        f"You are a model that can do function calling "
        f"with the following functions\n\n{tools_text}\n\n"
        f"{user_message}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )
```

The `FunctionCall.parse()` method handles both the legacy FunctionGemma control tokens and standard JSON arrays, so the same router works regardless of which backend is active:

```python
# Legacy format
<start_function_call>call:get_weather{city:<escape>Tokyo<escape>}<end_function_call>

# JSON format (xLAM-2, Gemini, etc.)
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}]
```

### Running It

```bash
# With the fine-tuned model
TOOL_AGENT_BACKEND=transformers \
TOOL_AGENT_MODEL=./models/finetuned \
python -m agent.server

# With Gemini API (for comparison)
TOOL_AGENT_BACKEND=gemini \
GEMINI_API_KEY="..." \
python -m agent.server
```

---

## Testing with Firecrawl MCP

To stress-test the model, I connected it to [Firecrawl](https://firecrawl.dev)'s MCP server, which exposes 12 web scraping tools (scrape, crawl, search, extract, browser sessions, etc.). Combined with the built-in HTTP tools, the model had to choose from 14 tools total.

The 270M model struggled here. When asked to "Scrape https://example.com," it selected `firecrawl_browser_create` instead of `firecrawl_scrape` and passed wrong argument types. For search and crawl queries, it gave up entirely.

This isn't surprising: the model has never seen these specific tool schemas during training, and at 270M parameters, it doesn't have enough capacity to generalize from "I know how to call `get_weather`" to "I can figure out which of 14 complex tools with nested parameters to use."

The same queries work perfectly when routed through Gemini 2.5 Flash Lite via the same agent — confirming the architecture is solid and only the model size is the bottleneck.

---

## What I Learned

**Fine-tuning works, even at 270M parameters.** The base model produced zero valid tool calls. After 25 minutes of LoRA training on an H100, it correctly selects tools 57% of the time on unseen schemas. The lm-evaluation-harness benchmarks show even stronger results: +29% tool selection, +39% first-tool accuracy.

**The FunctionGemma format is viable.** The control-token approach (`<start_function_call>call:fn{key:<escape>val<escape>}`) is clean, parseable, and distinct enough from natural text that the model rarely produces false positives. The `FunctionCall.parse()` handler processes both legacy and JSON formats transparently.

**270M is too small for complex registries.** With 5 simple tools, the model does well. With 14 tools (including Firecrawl's complex schemas with nested objects and optional parameters), it falls apart. A 3B+ model fine-tuned on the same data would likely handle it.

**LoRA adds negligible overhead.** The adapter merge happens once at load time (+0.5s). Inference speed is identical to the base model after merging — there's no runtime cost.

**Argument extraction is the model's strength.** When it picks the right tool, it extracts arguments accurately: city names from weather queries, email addresses from send requests, search terms from search queries. The training data clearly teaches argument mapping well.

---

## Next Steps

- **Scale up**: Fine-tune a 3B or 8B Gemma model on the same dataset to handle complex tool registries
- **GGUF export**: Convert the adapter to GGUF format for Ollama deployment (currently only supported via Transformers backend)
- **Domain-specific training**: Add n8n workflow tools and custom API schemas to the training data
- **Multi-turn support**: Extend training to handle tool results and follow-up calls

---

## Links

### Model & Code

- **Fine-tuned model (HuggingFace)**: [sumitagrawal/functiongemma-270m-tool-agent](https://huggingface.co/sumitagrawal/functiongemma-270m-tool-agent)
- **Tool Agent source code (GitHub)**: [tech-sumit/tool-agent](https://github.com/tech-sumit/tool-agent)
- **Base model**: [unsloth/functiongemma-270m-it](https://huggingface.co/unsloth/functiongemma-270m-it)

### Training Data

- [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) — general function calling
- [MadeAgents/xlam-irrelevance-7.5k](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) — negative/refusal examples

### Frameworks & Tools

- [HuggingFace TRL](https://huggingface.co/docs/trl) — SFTTrainer for supervised fine-tuning
- [PEFT](https://huggingface.co/docs/peft) — LoRA adapter training
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) — standardized benchmarking
- [vast.ai](https://vast.ai) — cloud GPU rental (H100)
- [FastAPI](https://fastapi.tiangolo.com) — tool agent server framework
- [Firecrawl](https://firecrawl.dev) — MCP-based web scraping tools
