Quick Take
What this page helps answer
A source-first tracker of benchmark claims made by Asian AI companies and labs, focused on official release surfaces and how to interpret them.
Who, How, Why
- Who
- Asian Intelligence Editorial Team
- How
- Prepared from cited public sources and reviewed against the site’s editorial standards.
- Why
- To give readers sourced context on AI policy, company strategy, and technology development in Asia.
Report Navigation
On this page
Asian AI Benchmark Claims Tracker
A provenance-first tracker for benchmark claims made on official release surfaces as of March 29, 2026.
How To Use This Page
This tracker is not trying to decide which model is "best." It is trying to show where benchmark claims actually come from, which benchmarks are being foregrounded, and what kind of release surface made the claim. That matters because the same company can cite one set of benchmarks in a GitHub repo, another in a product page, and a third in a media interview.
Verified Claim Surfaces
| Actor | Country | Official source | Benchmarks explicitly surfaced | What readers should notice |
|---|---|---|---|---|
| DeepSeek-V3.2-Exp | China | DeepSeek GitHub repository | Humanity's Last Exam 19.8, AIME 2025 89.3, MMLU-Pro 85.0, GPQA-Diamond 79.9, LiveCodeBench 74.1 | The repo is explicit and numeric. This is a clean example of a Chinese model team using public benchmark tables as part of product positioning. |
| Kimi K2 Instruct | China | MoonshotAI Kimi K2 repository | Humanity's Last Exam text-only 5.7, GPQA-Diamond 75.1, plus extensive AIME, coding, and SWE-bench tables | Moonshot emphasizes breadth. The value here is not one score; it is the way the repo mixes reasoning, coding, and agentic evaluation surfaces. |
| Solar Pro 2 | South Korea | Upstage launch page | Ko-MMLU, Hae-Rae, Ko-IFEval, Ko-Arena-Hard-Auto, MMLU, MMLU-Pro, HumanEval, Math500, AIME, SWE-Bench Agentless | Upstage's claim surface is more product-page oriented. It foregrounds Korean leadership and practical reasoning strength rather than only one universal frontier score. |
| Qwen3 | China | Qwen3 repository | Official repo publishes benchmark tables and a technical report across multiple variants and reasoning modes | Qwen is a reminder that model-family tracking matters. Variant confusion is one of the easiest ways benchmark comparisons become misleading. |
What Makes Asian Benchmark Claims Distinctive
Three patterns stand out. First, Chinese model teams frequently publish benchmark tables directly inside GitHub repos, which makes the claim surface relatively easy to audit. Second, South Korean release pages tend to emphasize local-language strength and usable enterprise capabilities alongside general benchmarks. Third, the most informative claims are often the ones that reveal the company's real product strategy: multilingual strength, tool use, coding, agentic work, or evaluation breadth.
How To Read These Claims Correctly
- Check whether the benchmark table lives on an official repo, an official product page, or only a press interview.
- Check whether the claim is attached to a specific variant, such as instruct, thinking, or experimental mode.
- Check whether the release is emphasizing one benchmark because it flatters the model's real commercial strength.
- Use language and deployment benchmarks seriously in Asia, because they often matter more than one leaderboard headline.
Primary Sources Used
Distribution
Share, follow, and reuse this page
Push the page into social, email, feeds, or CSV workflows without losing the canonical route.
Follow The Coverage
Follow the latest AI in Asia reporting
Use the weekly digest to keep new reports, topic hubs, and briefing updates in the same reading loop.
Prefer feeds or direct links? Use the RSS feed or download the structured CSV exports.