·6m read time·1,095 words·

Benchmarks Said Frontier. Developers Said "Dumb."

Gemini 3.5 Flash topped MCP Atlas, Toolathlon and CharXiv on day one. By the next morning a developer on Google's own forum had documented the model looping for 776 steps. The gap between the benchmark and the work is not a bug.

Gemini 3.5 Flash hit general availability at Google I/O on 19 May 2026. By the next morning, a developer on Google's own AI Developers Forum had posted a forensic dump titled "Antigravity 2.0..... lackluster? Gemini 3.5 Flash seems.... dumb". The thread is still up.

Their first real task: verify a proposed AI architecture using a custom workflow.

What Flash did: 776 steps. 194 model responses. 160 file reads. 4 actual commands. 2 file writes.

An 80:1 read-to-write ratio. Four complete restart cycles re-reading the same LaTeX files. Every time context compacted, the model lost track of work it had already done and started over. Then started over. Then started over again.

To work out what had happened, the developer switched to Opus 4.6 and asked it to read the logs. The diagnosis came from the competitor.

That's the launch story.

The case for Flash is real on paper

Pull up the model card and the press deck is doing real work. Gemini 3.5 Flash tops MCP Atlas at 83.6%, Toolathlon at 56.5%, CharXiv Reasoning at 84.2%, and posts 1656 Elo on GDPval-AA. It beats Gemini 3.1 Pro on most of them. On several it goes toe to toe with Claude Opus 4.7 and GPT-5.5. Sundar Pichai called it "frontier intelligence with action."

The New Stack ran with it. Seeking Alpha ran "surpasses GPT-5.5 in agentic benchmarks." The chart, the bars, the green ticks. All of it real.

Then a developer pressed Enter on a task and watched it loop for an hour.

What benchmarks measure, and what they don't

The thing I keep telling people about benchmarks: they are real, and they are not what you think they are.

MCP Atlas measures whether a model, inside a controlled harness, on tasks shaped to fit a specific tool-use pattern, produces roughly the right token sequence to complete the task. That's a useful capability. It is genuinely improving year over year. I am not disputing the score.

Real engineering work measures whether the agent:

  • Recovers from context compaction without amnesia
  • Notices that it already wrote sympy_verification.py on Cycle 1 before re-reading every input file on Cycle 4
  • Writes intermediate state to disk so the loop survives a context flush
  • Decomposes a task into subagents when subagent tools are available
  • Detects when it's stuck and breaks the pattern

None of these are in MCP Atlas. None of these are in Toolathlon. They can't be, because the benchmark scaffold runs short, controlled, isolated tasks and the harness does the bookkeeping that breaks in production. The forum poster's own diagnosis is razor sharp on this: "That's not a model intelligence problem. That's a platform engineering gap."

The benchmark didn't lie. It just measured something that wasn't your job.

The subagent thing is the tell

This is where it gets sharp. The I/O 2026 keynote built the Antigravity 2.0 narrative around multi-agent orchestration: spin up subagents, delegate work, run in parallel. That was the demo. That was the wow.

In the developer's session, Flash had invoke_subagent, define_subagent, and manage_subagents in its toolset. It used none of them. It looped on its own context, reading files it had already read, planning work it had already planned.

A working subagent system would have spawned a research subagent to ingest the LaTeX files once, written a summary to disk, and kept the orchestrator context clean. Flash had the tools. It didn't have the strategic reasoning to use them.

That gap isn't on the benchmark sheet. Of course it isn't. MCP Atlas doesn't ask "did the model voluntarily decompose this task." It asks "did the model produce the expected output." Those are different questions, and only one of them is what you're paying for.

Who the benchmark is actually for

Here's the part nobody says out loud. Benchmark wins are not for you.

They're for the slide that lands in front of a CTO who needs to justify a vendor switch. They're for the procurement team comparing rows in a spreadsheet. They're for the analyst who needs a number for a report. They're for the press release that runs in The New Stack at 9:01 AM on launch day.

The model then gets used by a developer on row 5 of an Antigravity session, on a real task, with their own files, under context pressure. That developer is downstream of a sales motion, not the audience for it.

This isn't a Google problem. Anthropic does it. OpenAI does it. The whole industry runs on benchmark-as-marketing, with model-as-product as the second-order consequence. The misalignment isn't a bug they'll fix in 3.5 Flash 2. It's the go-to-market.

If you find that uncomfortable, good. It has the same shape as the parrot problem and the prompt-versus-spec gap. The marketed thing and the working thing are not the same thing, and the gap is where the money is made.

The price twist

While we're here. Gemini 3.5 Flash is 3x the price of Gemini 3 Flash Preview and 6x Flash-Lite, as Simon Willison flagged on day one. $1.50 per million input tokens, $9 per million output. That's pushing up against Gemini 3.1 Pro pricing.

GPT-5.5 was 2x GPT-5.4. Opus 4.7 is roughly 1.46x Opus 4.6.

All three labs are probing what you'll pay, and you're paying it. The bet is that benchmark wins justify the bump. The bet only works if you don't notice that those wins evaporate the moment context compaction enters the chat.

What to actually do

Read benchmarks like spec sheets, not consumer reviews. A spec sheet tells you what the thing was designed to do in a controlled environment. It doesn't tell you whether it'll survive a Tuesday.

Then run your own evals. Not synthetic ones. Real ones. On your codebase, with your tools, on tasks that last longer than the demo. Measure what actually matters: completion rate on multi-hour tasks, recovery from compaction, loop detection, useful subagent delegation. Score the same task across three models, including a cheap one. Watch your assumption about which is "best" fall apart.

You'll end up with a different shortlist than the one in the press deck. That's the point.

Pick the headline

A developer's first real task on Gemini 3.5 Flash crashed Google's flagship into a four-cycle loop. They had to use a competitor's model to write the post-mortem. The thread is still pinned on Google's own forum.

The press release said: surpasses GPT-5.5 in agentic benchmarks.

You pick which one is the launch story.