NepNewsCluster: Benchmarking 13 LLMs on Bilingual Nepali News Consolidation

We're publishing NepNewsCluster v0.2 — to our knowledge, the first multi-document, bilingual, rubric-graded news consolidation benchmark for Nepali. It evaluates thirteen frontier and open-weight large language models on a task pulled directly from the production summariser that powers K cha khabar, our cross-publisher Nepali news intelligence platform.

If you'd rather read the full paper:

📄 Read the whitepaper online
⬇️ Download the PDF (1.6 MB)

Why we built it

Nepali has roughly 32 million speakers and a substantial diaspora across Australia, the Gulf, and South Asia. The existing benchmark suites — NLUE, IndicGenBench, Belebele, Global-MMLU, FLORES-200 — concentrate on classification, multiple-choice reading comprehension, or single-sentence translation. None of them measure the thing that matters most for a production news pipeline: bilingual generation under multi-document disagreement, with factuality graded across two scripts simultaneously.

K cha khabar ingests RSS feeds from thirty-plus Nepali outlets every fifteen minutes, clusters articles by underlying story, and renders each cluster as a bilingual brief. The economics of running that at production scale on a low-resource language depend entirely on which model performs the task well at low cost. Without a benchmark, model choice is anecdote. So we built one.

What models had to do

Given between three and fifteen short articles from different Nepali outlets covering the same story, the model had to produce in a single completion:

An English headline of twelve words or fewer
A Nepali (Devanagari) headline of fifteen words or fewer — actual Nepali, not transliteration
A 3–4 sentence English summary covering who/what/where/when/why
A 3–4 sentence Nepali summary conveying the same facts in idiomatic Devanagari prose
A typed list of named entities (person, org, place, party, event, policy)
A URL slug derived from the English headline

The task is hard for three reasons that compound: outlets disagree on arrest counts, casualty numbers, and political affiliations; Nepali and English can fail independently (a clean Nepali summary paired with a mistranslated English version is a failure mode invisible to single-language graders); and Nepali outlets publish dates in Bikram Sambat, so models routinely have to convert BS↔Gregorian and routinely get it wrong by a month.

How we measured it

107 cluster-questions stratified across coverage tiers, drawn from a snapshot of ~9,000 active clusters in the K cha khabar production database. In aggregate the corpus spans 1,310 article snippets (89% Nepali / 11% English) across 23 unique publishers.

Each model received the same byte-identical system prompt — the one running in production — at temperature = 0.2. Outputs were graded blind on three axes (Nepali prose, English prose, topic coverage) by Claude Opus 4.7 with extended thinking. Two independent generation runs were executed (April 27 and April 30) on the same 107 questions. A 20-question informal cross-judge spot-check with GPT-5.5 and Perplexity Sonar Reasoning sanity-checks coarse claims.

The leaderboard

#	Model	Bucket	Reasoning	Overall /100
1	Claude Sonnet 4.6	US SOTA	off	81.4
2	Qwen 3.6 Max (think)	Chinese	on	77.8
3	GPT-5.4	US SOTA	off	76.0
4	GLM 5.1 (z.ai)	Chinese	off	74.9
5	DeepSeek V4 Pro	Chinese	off	74.7
6	DeepSeek V4 Pro (think)	Chinese	on	73.8
7	Gemma 4 31B	Local <150B	off	72.2
8	GPT-5.4 mini	US SOTA	off	71.0
9	Qwen 3.6 Max	Chinese	off	70.7
10	Xiaomi MiMo V2.5 Pro	Chinese	off	70.3
11	Qwen 3.6 Plus	Chinese	off	70.0
12	Claude Haiku 4.5	US SOTA	off	68.6
13	Qwen 3.6 27B	Local <150B	off	66.6

Five things this benchmark says

1. Sonnet 4.6 wins both runs, decisively. Rank #1 in test-2 (ahead by 0.74 axis points) and rank #1 in test-3 (ahead by 1.08). It wins every axis — Nepali prose, English prose, topic coverage — in both lineups. For a workload that prioritises quality and can afford the per-call cost, this is an unambiguous recommendation.

2. Test-time reasoning has opposite signs across two models. The cleanest comparisons in this work are the two direct ablations: same upstream model, same prompt, same temperature; only the reasoning toggle and the output-token budget change.

Qwen 3.6 Max (think) gained 21 points over its no-think variant (77.8 vs 70.7). The trace concentrates output tokens on the harder bilingual axis and does measurably better factual triangulation.
DeepSeek V4 Pro (think) lost 3 points (73.8 vs 74.7). Inspecting the worst regressions reveals a consistent pathology: the reasoning trace treats consolidation as an inference problem (synthesise a unified picture from partial reports) rather than a quotation problem (commit to what at least one source explicitly said). On a multi-document news task where outlets disagree on numbers and dates, the inference framing is harmful — we observed a trace that summed arrest counts across districts no source ever combined and committed to the fabricated total in the final answer.

The takeaway isn't that "thinking is bad." It's that test-time reasoning is a different model with a different failure surface, and the sign of the change has to be measured per task.

3. Local-runnable Gemma 4 31B is genuinely competitive. At $0.0005 per call, Gemma sits far to the cost-efficient left of every proprietary model and beats GPT-5.4 mini and Claude Haiku 4.5 on overall quality. The local-runnable bucket is no longer obviously last for this task — the recommendation for a Nepali consolidation deployment is to A/B test Gemma 4 31B against your current proprietary baseline before assuming the proprietary model is required.

4. The Pareto frontier is short and unintuitive. Three points define the cost-quality frontier: Gemma 4 31B (cheap), DeepSeek V4 Pro no-think (mid), Claude Sonnet 4.6 (high quality). Everything else sits inside the frontier — equal or worse on both axes than something already on the line. Notably, think variants do not sit on the frontier under default token budgets: Qwen 3.6 Max think is more expensive than Sonnet, slower, and lower quality.

5. Top ranks are durable, mid-table is not. Across two independent runs and a tightened rubric, top-3 ranks are stable. Bottom-half ranks reshuffle meaningfully (Haiku 4.5 dropped from rank 6 to 12 between runs). The implication for downstream consumers is concrete: treat absolute scores as run-specific and ranks as run-portable for the top half. Claims like "model X is better than model Y" should require a margin of at least one axis point on /10 (≥10 on /100), ideally backed by re-run agreement.

What this benchmark cannot say

This is a single-author study with N=107 cluster-questions, two generation runs, and one primary judge. It is not peer reviewed. The known holes — single primary judge, no human gold labels, single sample per cell, descriptive (not pre-registered) rubric, RSS-excerpts-not-full-text source material — are documented in §08 of the paper. We're publishing this because, to our knowledge, no equivalent Nepali benchmark exists. A flawed first cut beats no cut.

What's next

Committed for v0.3: a formal 3-judge × 3-run design, human gold labels on a 20-question subsample, pre-registered rubric. Committed for v0.4: a reliability-aware leaderboard that aggregates empty-content and schema-violation rates into the score.

If you work in Nepali NLP, run a research-grade evaluation programme, or are deciding which LLM to put in production for a low-resource language, the full whitepaper walks through the methodology, all five findings in depth, the per-call latency and cost numbers, the reasoning-failure case studies, and every limitation we know about. The PDF version is available here.

For data and reproduction code access, get in touch at research@kchakhabar.com — or book a discovery call if you'd like to talk about what an evaluation-driven LLM pipeline could look like for your own product.