--- title: "Fine-Tuning FunctionGemma 270M for Tool Calling" date: 2026-03-10T12:00:00.000Z description: "How I fine-tuned Google's 270M parameter FunctionGemma model with LoRA on an H100 in 25 minutes, improving tool selection accuracy by 29% and deploying it through a multi-protocol agent server." tags: ["ai", "fine-tuning", "function-calling", "llm", "lora", "gemma", "tool-agent", "huggingface", "machine-learning"] tokens: 2274 content-signal: search=yes, ai-input=yes, ai-train=no --- ![Fine-Tuning FunctionGemma 270M for Tool Calling](/images/posts/finetuning-functiongemma-270m-tool-calling/hero.png) ## TL;DR - Key Takeaways - Google's **FunctionGemma 270M** is a tiny Gemma 3 model designed for function calling — but the base version produces almost zero valid tool calls on unseen schemas - **LoRA fine-tuning** on ~13K general function-calling examples (Salesforce xLAM-60k + MadeAgents irrelevance data) took **25 minutes on an H100 GPU** via vast.ai - Benchmarked with `lm-evaluation-harness`: **+29% tool selection accuracy**, **+39% first tool accuracy**, **+20% parameter accuracy** - End-to-end through a multi-protocol tool agent: **14% → 57% tool selection** on a 7-query evaluation - The fine-tuned model is published on [HuggingFace](https://huggingface.co/sumitagrawal/functiongemma-270m-tool-agent) and the agent code is on [GitHub](https://github.com/tech-sumit/tool-agent) - At 270M parameters, the model is strong on simple tool schemas but struggles with 14+ complex tools — a 3B+ model would be the next step --- ## The Problem: Small Models Can't Call Tools Function calling — where a language model decides which API or tool to invoke and with what arguments — is one of the most practical capabilities for AI agents. Models like GPT-4, Claude, and Gemini handle it natively. But what about models small enough to run on a phone, a Raspberry Pi, or an edge device? Google released **FunctionGemma**, a 270M parameter variant of Gemma 3, specifically designed for function calling. It uses a unique control-token format: ``` user You are a model that can do function calling with the following functions {"name": "get_weather", "parameters": {"city": "string"}} What's the weather in Tokyo? model call:get_weather{city:Tokyo} ``` The idea is compelling — a model small enough to run anywhere, structured enough to call tools reliably. But there's a catch: **the base model barely works on unseen tool schemas**. In my testing, it produced zero valid function calls for common queries like "What's the weather?" or "Send an email." So I decided to fine-tune it. --- ## Training Pipeline The full pipeline from raw data to published model looks like this: ![Training Pipeline](/images/posts/finetuning-functiongemma-270m-tool-calling/training-pipeline.png) ### Datasets I combined two public datasets from HuggingFace, filtered to general-purpose function calling: | Dataset | Source | Sampled | Purpose | |---------|--------|---------|---------| | Salesforce/xlam-function-calling-60k | [HuggingFace](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 10,000 | General function calling across diverse tools | | MadeAgents/xlam-irrelevance-7.5k | [HuggingFace](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) | 3,000 | Negative examples — queries where no tool is applicable | The combined dataset totals **~13,000 examples** spanning categories like ToolBench, xLAM-60k, OpenFunctions, and irrelevance/refusal samples. ### Format Conversion The datasets ship in xLAM-2's ChatML format (JSON arrays of tool calls). FunctionGemma expects a completely different format with Gemma 3 turn markers and special control tokens. A conversion script transforms each example: **xLAM-2 input:** ```json [{"name": "get_weather", "arguments": {"city": "Tokyo", "unit": "celsius"}}] ``` **FunctionGemma output:** ``` call:get_weather{city:Tokyo,unit:celsius} ``` The conversion handles edge cases: multi-tool calls (multiple `` blocks), no-tool responses (plain text), and nested argument values. Examples that can't be cleanly converted are dropped. --- ## Training Configuration ### Base Model **[unsloth/functiongemma-270m-it](https://huggingface.co/unsloth/functiongemma-270m-it)** — Gemma 3 architecture, 270M parameters, instruction-tuned. This is the smallest model in the Gemma family with function-calling support. ### LoRA Configuration Rather than full fine-tuning (which would require retraining all 270M parameters), I used **[LoRA](https://arxiv.org/abs/2106.09685)** (Low-Rank Adaptation) to train a lightweight adapter that modifies the model's attention and feed-forward layers: | Parameter | Value | |-----------|-------| | Rank (r) | 16 | | Alpha | 32 | | Dropout | 0.05 | | Target modules | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | | Trainable parameters | ~2.4M (< 1% of total) | | Task type | CAUSAL_LM | ### Infrastructure | Resource | Detail | |----------|--------| | GPU | NVIDIA H100 SXM 80GB | | Provider | [vast.ai](https://vast.ai) (cloud GPU rental) | | Training time | **25 minutes** | | Epochs | 3 | | Batch size | 8 (effective 16 with gradient accumulation 2) | | Learning rate | 2e-4 | | Max sequence length | 1,024 tokens | | Framework | HuggingFace [TRL](https://huggingface.co/docs/trl) (SFTTrainer) + [PEFT](https://huggingface.co/docs/peft) | ### Training Metrics | Metric | Value | |--------|-------| | Final train loss | 0.6503 | | Final eval loss | 0.6921 | | Token accuracy | 85.7% | | Throughput | 24 samples/sec | | Best checkpoint | Step 2,000 (of 2,196) | --- ## Benchmark Results I benchmarked both the base and fine-tuned models using two complementary approaches. ### lm-evaluation-harness (Standardized Benchmark) Using EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) with custom task definitions for function calling: ![Benchmark Comparison](/images/posts/finetuning-functiongemma-270m-tool-calling/benchmark-comparison.png) The fine-tuned model nearly doubles tool selection accuracy and achieves **88% first-tool accuracy** — meaning when it picks a tool, it picks the right one almost 9 out of 10 times. ### End-to-End Through the Tool Agent The real test: running both models through the actual tool agent pipeline with real tool schemas and natural language queries. | Query | Base (270M) | Fine-tuned v1 | |-------|------------|---------------| | "What's the weather in Tokyo?" | No output | `get_weather(city="Tokyo")` | | "Search for latest news about AI" | No output | `search_web(query="artificial intelligence")` | | "Send email to john@example.com..." | No output | `send_email(to="john@example.com", subject="Meeting", body="See you at 3pm")` | | "What is 234 * 567 + 89?" | No output | `search_web` (wrong — expected `calculate`) | | "Remind me to call dentist at 9am" | No output | `send_email` (wrong — expected `set_reminder`) | | "Weather in Paris in fahrenheit?" | No output | `get_weather(city="Paris")` | | "Tell me a joke" | No output (correct) | `search_web` (wrong — should decline) | | Metric | Base | Fine-tuned | Delta | |--------|------|-----------|-------| | Tool Selection Accuracy | 1/7 (14%) | 4/7 (57%) | **+43%** | | Avg inference time | 2.21s | 2.72s | +0.51s | The base model produced **zero valid tool calls**. It never generated the `` tokens at all. The fine-tuned model correctly selects tools for weather, search, and email queries, with properly extracted arguments (city names, email addresses, subjects). --- ## Tool Agent Architecture The fine-tuned model runs inside a multi-protocol tool agent server that exposes four connectivity options: ![Tool Agent Architecture](/images/posts/finetuning-functiongemma-270m-tool-calling/architecture.png) ### Protocol Support | Protocol | Endpoint | Use Case | |----------|----------|----------| | REST API | `/tools`, `/route`, `/execute` | Standard HTTP integration | | WebSocket | `/ws` (JSON-RPC 2.0) | Streaming, real-time clients | | MCP | `/mcp` | Model Context Protocol — tool discovery for AI agents | | A2A | `/a2a`, `/.well-known/agent-card.json` | Google's Agent-to-Agent protocol | ### Loading the Fine-tuned Model The `TransformersBackend` automatically detects LoRA adapters by looking for `adapter_config.json` in the model directory. It reads the base model path from the config, loads it, and merges the adapter: ```python model_dir = Path(model_path) adapter_cfg = model_dir / "adapter_config.json" if adapter_cfg.exists(): cfg = json.loads(adapter_cfg.read_text()) base_model = cfg["base_model_name_or_path"] # Load base model model = AutoModelForCausalLM.from_pretrained(base_model) # Merge LoRA adapter model = PeftModel.from_pretrained(model, str(model_dir)) model = model.merge_and_unload() ``` It also detects FunctionGemma models and switches to the legacy prompt format with Gemma 3 turn markers instead of the standard ChatML template: ```python def _build_prompt(self, user_message, tools): tools_text = format_tools(tools) return ( f"user\n" f"You are a model that can do function calling " f"with the following functions\n\n{tools_text}\n\n" f"{user_message}\n" f"model\n" ) ``` The `FunctionCall.parse()` method handles both the legacy FunctionGemma control tokens and standard JSON arrays, so the same router works regardless of which backend is active: ```python # Legacy format call:get_weather{city:Tokyo} # JSON format (xLAM-2, Gemini, etc.) [{"name": "get_weather", "arguments": {"city": "Tokyo"}}] ``` ### Running It ```bash # With the fine-tuned model TOOL_AGENT_BACKEND=transformers \ TOOL_AGENT_MODEL=./models/finetuned \ python -m agent.server # With Gemini API (for comparison) TOOL_AGENT_BACKEND=gemini \ GEMINI_API_KEY="..." \ python -m agent.server ``` --- ## Testing with Firecrawl MCP To stress-test the model, I connected it to [Firecrawl](https://firecrawl.dev)'s MCP server, which exposes 12 web scraping tools (scrape, crawl, search, extract, browser sessions, etc.). Combined with the built-in HTTP tools, the model had to choose from 14 tools total. The 270M model struggled here. When asked to "Scrape https://example.com," it selected `firecrawl_browser_create` instead of `firecrawl_scrape` and passed wrong argument types. For search and crawl queries, it gave up entirely. This isn't surprising: the model has never seen these specific tool schemas during training, and at 270M parameters, it doesn't have enough capacity to generalize from "I know how to call `get_weather`" to "I can figure out which of 14 complex tools with nested parameters to use." The same queries work perfectly when routed through Gemini 2.5 Flash Lite via the same agent — confirming the architecture is solid and only the model size is the bottleneck. --- ## What I Learned **Fine-tuning works, even at 270M parameters.** The base model produced zero valid tool calls. After 25 minutes of LoRA training on an H100, it correctly selects tools 57% of the time on unseen schemas. The lm-evaluation-harness benchmarks show even stronger results: +29% tool selection, +39% first-tool accuracy. **The FunctionGemma format is viable.** The control-token approach (`call:fn{key:val}`) is clean, parseable, and distinct enough from natural text that the model rarely produces false positives. The `FunctionCall.parse()` handler processes both legacy and JSON formats transparently. **270M is too small for complex registries.** With 5 simple tools, the model does well. With 14 tools (including Firecrawl's complex schemas with nested objects and optional parameters), it falls apart. A 3B+ model fine-tuned on the same data would likely handle it. **LoRA adds negligible overhead.** The adapter merge happens once at load time (+0.5s). Inference speed is identical to the base model after merging — there's no runtime cost. **Argument extraction is the model's strength.** When it picks the right tool, it extracts arguments accurately: city names from weather queries, email addresses from send requests, search terms from search queries. The training data clearly teaches argument mapping well. --- ## Next Steps - **Scale up**: Fine-tune a 3B or 8B Gemma model on the same dataset to handle complex tool registries - **GGUF export**: Convert the adapter to GGUF format for Ollama deployment (currently only supported via Transformers backend) - **Domain-specific training**: Add n8n workflow tools and custom API schemas to the training data - **Multi-turn support**: Extend training to handle tool results and follow-up calls --- ## Links ### Model & Code - **Fine-tuned model (HuggingFace)**: [sumitagrawal/functiongemma-270m-tool-agent](https://huggingface.co/sumitagrawal/functiongemma-270m-tool-agent) - **Tool Agent source code (GitHub)**: [tech-sumit/tool-agent](https://github.com/tech-sumit/tool-agent) - **Base model**: [unsloth/functiongemma-270m-it](https://huggingface.co/unsloth/functiongemma-270m-it) ### Training Data - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) — general function calling - [MadeAgents/xlam-irrelevance-7.5k](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) — negative/refusal examples ### Frameworks & Tools - [HuggingFace TRL](https://huggingface.co/docs/trl) — SFTTrainer for supervised fine-tuning - [PEFT](https://huggingface.co/docs/peft) — LoRA adapter training - [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) — standardized benchmarking - [vast.ai](https://vast.ai) — cloud GPU rental (H100) - [FastAPI](https://fastapi.tiangolo.com) — tool agent server framework - [Firecrawl](https://firecrawl.dev) — MCP-based web scraping tools