---
title: "Fine-Tuning π0 on RoboTwin: Cutting Open-Loop MAE 30% in Phase B2"
date: 2026-05-07T01:30:00.000Z
description: "Two 30k-step π0 checkpoints, one open-loop benchmark on the LeRobot stack, and a 30% MAE drop. The Phase B2 fine-tune of pi0_base on RoboTwin-unified data, evaluated across 25 held-out episodes — and now published to the Hugging Face Hub."
tags: ["robotics", "pi0", "lerobot", "robotwin", "imitation-learning", "fine-tuning", "vision-language-actions", "huggingface", "aloha", "machine-learning"]
tokens: 1716
content-signal: search=yes, ai-input=yes, ai-train=no
---


![π0 RoboTwin Phase B2 — 30k checkpoint headline stats](/images/posts/pi0-robotwin-phaseB2-30k-eval/hero.png)

## TL;DR — Key Takeaways

- **Phase B2 ≠ Phase B.** Same `lerobot/pi0_base` start, same RoboTwin-unified data, same 30k steps — different recipe. Phase B2 cut **aggregate open-loop MAE by 30%** (0.220 → 0.154) and roughly **doubled SR @ MAE 0.20** (15% → 31%) on held-out episodes.
- **Eval, not vibes.** Both runs use the LeRobot open-loop harness: re-create the policy with `PI0Policy.from_pretrained(...)`, replay held-out RoboTwin trajectories, score MSE/MAE per joint plus step success rates at thresholds.
- **Right arm got the most help.** Per-joint MAE drops the most on joints 7–13 (right arm + grip), where Phase B was clearly weaker.
- **Action plots show what the metrics hide.** A pred-vs-gt overlay for episode `22042` shows the policy tracking gross trajectories well, with a mid-episode discontinuity that pulls average MAE up.
- **Weights are public.** The Phase B2 30k checkpoint lives at [`sumitagrawal/pi0-robotwin-phaseB2-30k`](https://huggingface.co/sumitagrawal/pi0-robotwin-phaseB2-30k) — `PI0Policy.from_pretrained(...)` and you’re running.

---

## Why Open-Loop Eval (and Why It’s Honest)

π0 is a **vision-language-action** policy: cameras + state + a task string in, action chunks out. The cleanest way to measure “did this fine-tune learn something?” without a sim wrapper is **open-loop replay**:

1. Take a held-out episode from the same dataset family.
2. At each step, give the policy the **real** observation.
3. Have it predict the next action chunk.
4. Score the chunk against the **recorded** ground-truth actions.

It doesn’t close the loop on physics — that needs a sim or robot. But it isolates the **policy’s** behavior from compounding sim error and gives you per-step, per-joint numbers you can actually compare across runs.

The two runs in this post:

| Run | Checkpoint | Episodes | Aggregate MAE | Aggregate MSE | Runtime |
|------|------------|----------|----------------|----------------|---------|
| Phase B | `pi0-phaseB-20260503-205352 / 030000` | 5 | 0.2202 | 0.1960 | 9 min |
| **Phase B2** | `pi0-phaseB2-20260505-151435 / 030000` | **25** | **0.1543** | **0.1283** | 42 min |

Same base model. Same step count. Phase B was the smaller pilot; Phase B2 used the refined recipe and a wider eval slice.

---

## The Headline Numbers

![Phase B vs Phase B2 — MAE and success-rate comparison](/images/posts/pi0-robotwin-phaseB2-30k-eval/phaseB-vs-phaseB2-headline.png)

Two complementary views: **MAE** (lower is better) and **success rate at MAE thresholds** (higher is better). The success rate at threshold `t` is the fraction of steps where the per-step MAE is `≤ t` — a forgiving but useful proxy for “did the policy at least stay in the neighborhood?”.

| Metric | Phase B | Phase B2 | Δ |
|---|---|---|---|
| Aggregate MAE | 0.2202 | 0.1543 | **−30.0 %** |
| Aggregate MSE | 0.1960 | 0.1283 | −34.5 % |
| SR @ MAE 0.05 | 0.1 % | 1.0 % | +0.9 pp |
| SR @ MAE 0.10 | 3.4 % | 9.4 % | +6.0 pp |
| SR @ MAE 0.20 | 15.1 % | **30.9 %** | **+15.8 pp** |
| SR @ MAE 0.50 | 43.2 % | 61.1 % | +18.0 pp |

The 0.20 threshold is the most operational: it’s the band where the policy is still “close enough” for replay in a sim controller, before chunk re-planning kicks in.

---

## Where the Improvement Lives: Per-Joint MAE

ALOHA-style 14-DoF actions: `j0–j6` are the **left arm + gripper**, `j7–j13` are the **right arm + gripper**. The two runs’ per-joint MAE side by side:

![Per-joint MAE — Phase B vs Phase B2](/images/posts/pi0-robotwin-phaseB2-30k-eval/per-joint-mae.png)

- Phase B already did fine on `j4` (a near-static gripper-axis joint).
- The biggest absolute drops show up on `j8`, `j9`, `j12` — right-arm joints that involve **wider, multi-second motions**. Phase B was wobbly there; B2 tightened the tracking.
- Left arm (`j0–j6`) improved more modestly; it was already in a good place.

This is exactly what you want a refinement run to look like: focused improvement on the weakest joints, no regressions elsewhere.

---

## How Consistent Is It? Per-Episode MAE on Phase B2

Five episodes is a story. **25** is a starter distribution.

![Per-episode MAE — Phase B2, 25 held-out episodes](/images/posts/pi0-robotwin-phaseB2-30k-eval/per-episode-mae-phaseB2.png)

Bars below the dashed mean line are coloured teal (better than average), above it rose. Most episodes cluster around 0.10–0.20 MAE, with a few outliers near 0.30 (longer or harder-to-track tasks).

That long tail is informative — it tells you which episode types to **target next** (more data, weighted sampling, or a curriculum bump on those task IDs).

---

## What Open-Loop Looks Like (Episode 22042)

This is the per-step prediction overlay for one of the harder episodes:

![Predicted vs ground-truth actions — episode 22042](/images/posts/pi0-robotwin-phaseB2-30k-eval/actions_pred_vs_gt-episode22042.png)

A few things stand out:

- **Most joints** (`j8`–`j13`, much of `j0`) show the predicted (red) curve hugging the ground-truth (blue) shape over the full 151-step window.
- **Around step ~50** several joints have a sharp discontinuity — the predicted chunk transitions don’t line up perfectly with a ground-truth contact event. That single window is responsible for a meaningful chunk of episode-level MAE.
- **The static joints** (`j4`, `j6`) sit near zero with low-amplitude jitter — expected.

That mid-episode jump is a familiar failure mode for open-loop eval: chunked action policies are **planning-ahead** models, and a 1-step shift in when a contact is predicted can produce a tall, brief MAE spike that dominates the episode score.

A short rollout from the same episode (10 fps, 720p):

![Open-loop rollout — episode 22042](/images/posts/pi0-robotwin-phaseB2-30k-eval/rollout-episode22042.gif)

---

## The Pipeline

![Phase B2 pipeline — base → fine-tune → eval → publish](/images/posts/pi0-robotwin-phaseB2-30k-eval/pipeline.png)

1. **Base.** [`lerobot/pi0_base`](https://huggingface.co/lerobot/pi0_base).
2. **Data.** RoboTwin-unified — the LeRobot-format aggregated RoboTwin set (`lerobot/robotwin_unified`).
3. **Train.** RunPod **H100** pod, weights/checkpoints staged on **Cloudflare R2** so spot interruptions don’t cost progress. 30k steps of standard π0 fine-tuning.
4. **Eval.** A small open-loop harness using `LeRobotDataset` + `PI0Policy.from_pretrained(...)` + `make_pre_post_processors(...)`, run on the `030000` checkpoint.
5. **Publish.** Stage `pretrained_model/` (weights + tokenizer/preprocessor shards) + a model card → `huggingface_hub.upload_folder` → public Hub repo.

The training compute and data prep are the patient parts. The eval-and-publish loop is fast — minutes once you have a checkpoint you trust.

---

## Using the Model

```python
from lerobot.policies.pi0.modeling_pi0 import PI0Policy

policy = PI0Policy.from_pretrained("sumitagrawal/pi0-robotwin-phaseB2-30k")
policy.eval()
# Build observation batch from your robot (or LeRobotDataset row):
#   observation.images.*  — multi-camera tensors
#   observation.state     — robot state vector
#   task                  — language instruction string
# Use make_pre_post_processors(policy_cfg=policy.config, pretrained_path=<repo>)
# to wire input/output normalization correctly.
```

For an end-to-end working example (load → build batch from a dataset row → first action chunk), the same training repo ships an `inference_pi0_from_hub.py` you can adapt.

---

## What I’d Do Next

- **Sim-in-the-loop.** Open-loop is a screening tool; closed-loop in a RoboTwin sim is where these MAE numbers translate to real **task success**.
- **Targeted data on weak episodes.** The 25-episode distribution surfaces a few stubborn outliers — those are the next batch to over-sample (or to inspect for label issues).
- **Chunk-aware loss tweaks.** The mid-episode discontinuity in episode `22042` is classic chunked-policy behavior; a small loss term on chunk-boundary smoothness might help.
- **Bigger eval, public artifacts.** 50–100 episodes is the next bar, with the metrics + plots committed alongside the model card so anyone can reproduce.

---

## Honest Caveats

- This is **open-loop**. Real task success requires sim or hardware.
- The two runs differ in eval breadth (5 ep vs 25 ep). The B2 distribution is the truer picture; B is a pilot baseline.
- The dataset and base model carry their own licenses/limits — see `lerobot/pi0_base` and `lerobot/robotwin_unified` on the Hub.
- “Same 30k steps” doesn’t mean “same total compute” — recipe differences (LR schedule, batch composition, augmentation) account for the lift.

---

## Links

- **Model:** [`sumitagrawal/pi0-robotwin-phaseB2-30k`](https://huggingface.co/sumitagrawal/pi0-robotwin-phaseB2-30k) on Hugging Face
- **Base model:** [`lerobot/pi0_base`](https://huggingface.co/lerobot/pi0_base)
- **LeRobot:** [github.com/huggingface/lerobot](https://github.com/huggingface/lerobot)
- **RoboTwin-unified:** [`lerobot/robotwin_unified`](https://huggingface.co/datasets/lerobot/robotwin_unified)

If you spin this up on your own robot or sim and the numbers look different — please share. Open-loop is a starting line, not a finish line.