Logo
Book a Demo
CareersDocsRegistryBook a Demo

ARTICLE

Same quality, a quarter of the cost: Should DeepSeek Flash be your model of choice?

Discover if DeepSeek Flash is your cost-effective AI model choice, offering comparable quality at a fraction of the price. Explore our detailed analysis.

Simon Maple

Rob Willoughby, Simon Maple

·9 Jun 2026·8 min read

$0.0236 is how much DeepSeek V4 Flash costs to run a complete agentic task, skill included, on the Fireworks price sheet. Claude Haiku 4.5 costs $0.10 for the same task. Sonnet 4.6 costs $0.30.

In terms of how good they are, in our evals Flash scores 82.3, and Haiku scores 82.9. So the evals points to them being comparable, with skills applied, but one is four times the cost.

In our eval we ran 19 model configurations through the same benchmark harness. The tasks we asked of them were real agentic tasks, and we measured the total token counts, and looked at the charged provider pricing. To be honest, the value story we expected to find was "cheap models are a trap." What we found instead was more interesting, and particularly useful if you're running agents at any kind of scale.

First, the Pro comparison

DeepSeek V4 ships two tiers: Pro and Flash. In our eval runs, Pro costs $0.183/task and Flash costs $0.0236/task. That's a 7.7× price gap within the same model family.

When you look at what you get for the extra spend, it’s only three points. On the eval results, Pro scores 85.3, Flash scores 82.3. When we scale that, 10,000 tasks/month costs you an extra $19,000/year and 100,000 tasks/month costs an extra $190,000/year. For three points that may not be too visible from a quality point of view.

Points-per-dollar

When we look at cost per point of eval score, this gives us a ratio between quality and cost, which can be useful, so long as the overall quality of the model satisfies your needs.

ModelScore (w/ skill)$/taskpts/$
DeepSeek V4 Flash82.3$0.0243,482
Haiku 4.582.9$0.097829
DeepSeek V4 Pro85.3$0.183467
GLM 5.190.4$0.200451
Sonnet 4.690.8$0.296303

The number your cost model is probably missing

Cost-per-token is the number everyone tends to quote and often mistakenly use as the most important factor in making a decision. It's also the number that will quietly blow your budget if you're not watching turns per solve as well.

Flash's mean average is around 20 turns per task which is pretty manageable. But the single worst-case runs in our dataset hit roughly 10× that. This isn’t unusual for models in this class, but in dollar terms, that's a single task costing as much as 10 average tasks. Multiply that across thousands of concurrent agent runs and you may start to have a budget problem that didn't show up in your per-token estimate.

The reason most teams don't catch this is that agent frameworks surface token counts by default. Turn counts, which is the variable that actually drives fat-tail cost explosions, often need to be logged explicitly.

Instrument your agents for turns, not just tokens. Know your median and your 95th percentile. Set your timeout policies against the 95th, not the median, or you're either killing valid runs or absorbing surprise bills.

The skill is doing half the work

One thing worth being very direct about here is that Flash's 82.3 score is a skill-augmented score. Without a skill, Flash scores 64.1. The skill adds +18.2 points.

That lift is real, but very conditional on the skill being precise, well-scoped, and actually relevant to the task. A vague skill will drag you back down closer to the 64.1 baseline, whereas a sharp one gets you 82.3.

This matters more than most model evaluations acknowledge since the model you test in a playground doesn’t usually use a skill or relevant context, but just raw capability.

Going further: find cheaper models and test them yourself

The analysis above shows the cheapest hosted options we measured. But there are two obvious next steps if you want to push it further, and both are more accessible than you might think.

Every model in this benchmark that isn't GPT, Anthropic, or Gemini has publicly available weights. DeepSeek V4 Flash, GLM 5.1, you can run all of them yourself. When you do, the marginal token cost drops to near zero. You're paying for compute (GPU rental or owned infra), not per-call pricing.

The maths of self-hosting only make sense above a certain volume threshold, the ops overhead and GPU costs aren't free of course, but if you're running tens of thousands of agentic tasks per month, the crossover point is lower than you'd expect.

The skill in this benchmark is doing +18.2 points of work. The question worth asking is: where did that skill come from, and how do you know it's any good?

The Tessl registry is a good place to start and look at the quality, impact and security posture of your skill. Before you write a skill from scratch, check whether one already exists and has eval data behind it.

Evaluate your skills properly. You can run two types of evaluation: reviews (automated quality assessment of whether your skill is well-structured) and task evals (end-to-end runs that measure whether the skill actually improves agent performance on real tasks). The task eval output is exactly the kind of "with skill / without skill" delta that the Flash benchmark is built on.

Use skill quality as a model selection input. The 18-point lift Flash gets from a well-scoped skill isn't a fixed number, it depends on the skill and the tasks. A skill that has been evaluated by Tessl with a high task eval score gives you confidence that the lift is real and reproducible. A skill that's never been evaluated is a variable you can't account for in your cost modelling.

Your own workload, not someone else's benchmark. The task eval system lets you define scenarios from your actual codebase and run them. That's the self-evaluation framework described above.

The takeaways, flat out

  • DeepSeek V4 Flash at $0.0236/task is the value pick. Haiku costs 4× more for 0.6 points. Pro costs 7.7× more for 3 points.
  • Set a quality floor before you rank by cost. pts/$ flatters cheap-and-weak models. Above 80 points, it's a real signal.
  • Instrument for turns, not just tokens. Your 95th percentile turn count is the budget variable nobody's logging.
  • The skill is doing half the work. A bad skill collapses your score back to baseline. Evaluate your skills — with task evals, not vibes.
  • You can run this yourself. 20-30 tasks, turn logging, a spreadsheet, and Tessl's eval system.
  • Self-hosting open source models is a real option. The weights are public, the ops trade-off is real. You should run your own evals with your models to see if they can be substituted in.

The tier name told you Flash was cheap; the data says it's also good. Now you have the tools to find out whether that holds for what you're building.

COPY & SHARE

Rob Willoughby

Member of Technical Staff, AI Research Lead at Tessl

Simon Maple

Simon Maple

Simon Maple is Tessl’s Founding Developer Advocate, a Java Champion, and former DevRel leader at Snyk, ZeroTurnaround, and IBM.

READING

·

0%

IN THIS POST

First, the Pro comparisonPoints-per-dollarThe number your cost model is probably missingThe skill is doing half the workGoing further: find cheaper models and test them yourselfThe takeaways, flat out

COPY & SHARE

Rob Willoughby

Member of Technical Staff, AI Research Lead at Tessl

Simon Maple

Simon Maple

Simon Maple is Tessl’s Founding Developer Advocate, a Java Champion, and former DevRel leader at Snyk, ZeroTurnaround, and IBM.

YOUR NEXT READ

AI Coding Agent Accuracy: Opus 4.7 vs 4.8

Opus 4.8 matches Opus 4.7 in accuracy but improves efficiency, solving tasks in fewer turns and at lower costs, highlighting differences beyond headline metrics.

Rob Willoughby

·29 May 2026·9 min read
Read more

More articles by Rob Willoughby & Simon Maple

See all articles

Opus 4.8 tops the LLM leaderboard with 95% on skill evals

Opus 4.8 leads the LLM leaderboard with a 95% skill evaluation score, surpassing Opus 4.7 and Composer 2.5 Fast, despite being the slowest model tested.

Simon Maple·29 May 2026