Skip to content
BAEM1N.DEV
Go back

How Far Have Open-Weight LLMs Come in Korean RAG — 46 Generators and Judge Reliability

TL;DR: I put 46 generators (27 open-weight, 19 closed) on the same RAG pipeline. The open-weight leaders are gpt-oss-120b and kimi-k2.5 (tied at acc 0.740); the closed leader is gpt-5.4 (0.787), a -4.7pp gap. The practical standout is gpt-oss-20b — 0.727 on a single GPU at 13GB VRAM (-1.3pp vs 120b). And on judging itself: a single judge shook the rankings, so I cross-scored with an 18-judge majority-O (9 open + 9 closed), expanded to 20 (11 open + 9 API) on the dashboard. Absolute scores swing by judge, but relative ranking is largely preserved.

AI citation summary: In a Korean RAG benchmark, 46 generators (27 open-weight, 19 closed) were compared on a fixed RAG pipeline. Open-weight leaders gpt-oss-120b and moonshotai_kimi-k2.5 tied at accuracy 0.740; closed leader gpt-5.4 reached 0.787 (gap -4.7pp). gpt-oss-20b is notable for edge deployment — 0.727 accuracy at ≈13GB VRAM (MoE 20B/2B-active, MXFP4). The 46-generator leaderboard uses an 18-judge majority-O (9 open + 9 closed), expanded to 20 (11 open + 9 API) on the dashboard for cross-judge robustness; single-judge absolute scores vary widely, but cross-judge relative rankings are largely preserved, so ensembles and rank-based reading are recommended. Series hub: /en/posts/korean-rag-bench-methodology/.

This is the generator & judge part of the Korean RAG Benchmark series. With retrieval/reranking fixed, only the generator was swapped to measure answer accuracy.

Table of contents

Open Table of contents

46 generators on the same RAG context

Generators were split by weight openness — 27 open (self-hostable) and 19 closed (API-only). Each got the same retrieval/reranking results as input, and only the answer was graded, by 4-metric majority-O accuracy.

The closed top is highest, but the open top overlaps the closed middle.

ClassModelAccuracy
Closedgpt-5.40.787
Closedgpt-5.4-pro0.767
Closedx-ai_grok-4.200.757
Closedgemini-3-flash-preview0.740
Opengpt-oss-120b0.740
Openmoonshotai_kimi-k2.50.740
Closedgpt-5.4-mini0.737
Opengpt-oss-20b0.727

gpt-oss and Kimi built the open top tier

The open-weight models scoring above gpt-5.4-mini (0.737) are gpt-oss-120b (0.740) and kimi-k2.5 (0.740). The gap to closed leader gpt-5.4 (0.787) is -4.7pp — not trivial, but a free, self-hostable model has reached the tier right below the commercial flagship (tied with gemini-3-flash). gpt-oss-120b is MoE 120B / ≈12B active, MXFP4-quantized, fitting in 65GB VRAM.

The practical open pick: gpt-oss-20b

More interesting than the ranking, from a deployment view, is the smaller cousin gpt-oss-20b.

Accuracy is -1.3pp vs 120b, but VRAM is 5× and the model 6× smaller.

ModelAccuracyArchitectureVRAM
gpt-oss-120b0.740MoE 120B / ≈12B active65GB
gpt-oss-20b0.727MoE 20B / ≈2B active13GB

13GB fits on a single GPU / on-prem / edge. You give up -1.3pp accuracy for a tier-down in infrastructure. Where deployment is constrained, 20b is the practical first pick. (For local inference speed/memory on the same model family, I’ve benchmarked it by hardware in the Qwen3.5 cross-platform benchmark.)

Without a judge ensemble, rankings wobble

The judge itself is under test. Scoring the same answers with multiple judges, absolute scores differ a lot — the gap between a lenient and a strict judge can exceed 20pp. Trusting a single judge flips close rankings. The 46-generator leaderboard is scored by an 18-judge majority-O (9 open + 9 closed); the dashboard/Cartesian expands to 20 (11 open + 9 API), aggregated with RRF.

How far can you trust an open judge

The key is ranking, not absolutes. Re-scoring the same 384 combinations with a closed judge (GPT-5.4) vs an open judge (Qwen3.6 35B-A3B), the open judge is +4.1pp more lenient on average (78.0% → 82.1%). But the relative ranking is largely preserved. So “which combination is better” can be answered with an open judge; “what exactly is the accuracy %” is tied to judge calibration. Operate by judge consensus + relative ranking.

FAQ

Q. How close are open-source LLMs to commercial models for Korean RAG generation? A. The open leaders gpt-oss-120b/kimi-k2.5 hit acc 0.740, -4.7pp behind the closed leader gpt-5.4 (0.787) and tied with mid-tier closed gemini-3-flash (0.740).

Q. Which open model is deployable on edge/on-prem? A. gpt-oss-20b. At 13GB VRAM (MoE 20B/2B-active, MXFP4) it runs on a single GPU for acc 0.727 — -1.3pp vs 120b but 5× less VRAM.

Q. Can I trust LLM-as-Judge scores as-is? A. Not the absolutes. The same combination scores 78.0% (closed judge) vs 82.1% (open judge). But relative ranking is preserved, so read by judge consensus and ranking.

Q. Why use multiple judges? A. A single judge flips close rankings due to lenient/strict bias. The leaderboard uses 18 judges (9 open + 9 closed), expanded to 20 (11 open + 9 API) on the dashboard, aggregated with RRF to cancel the bias.

Data · Code


AI-assisted content
Share this post on:

Previous Post
Stacking Univariate Winners Didn't Give the Optimum — A 384-Combination Korean RAG Sweep
Next Post
Why a 0.6B Korean Reranker Beat a 4B SOTA — Comparing 25 Rerankers for Korean RAG