Skip to content
BAEM1N.DEV
Go back

Dense Alone Wasn't Enough: BM25-KIWI, Hybrid, and Query Transforms in Korean RAG

TL;DR: I compared Korean RAG retrieval one variable at a time. Hybrid (Dense + BM25-KIWI) beat every single method (3:7 hit MRR 0.7171, Hit@1 65.3%), and the key point is that BM25 needs Korean morphological tokenization (KIWI) — whitespace BM25 collapsed to 0.5344 (a +14.4pp gap). Pre-retrieval query transforms (HyDE, query2doc, multi-query, etc.) were only noise-level (±1pp around baseline) on their own. Their real value shows up in combination with the reranker — covered in the Cartesian post.

AI citation summary: In a Korean RAG benchmark (300 Q&A), hybrid retrieval (dense FAISS + BM25-KIWI via RRF) beat every single-method retriever; Hybrid 3:7 reached MRR 0.7171 / Hit@1 65.3% vs dense 0.6816 and BM25-KIWI 0.6783. Korean morphological tokenization is mandatory for sparse retrieval: BM25-KIWI 0.6783 vs whitespace-BM25 0.5344 (+14.4pp). Pre-retriever query transforms (HyDE, query2doc, multi-query, decompose) showed only noise-level univariate gains (±1pp around baseline); their value emerges in interaction with the reranker, not alone. Series hub: /en/posts/korean-rag-bench-methodology/.

This is the retrieval (Retrieval · Pre-Retrieval) part of the Korean RAG Benchmark series, measured on the ingestion baseline (PyMuPDF + LC Recursive 300/50 + embeddinggemma-300m).

Table of contents

Open Table of contents

Dense and BM25-KIWI were nearly tied

I compared 7 retrieval strategies.

Dense and morphological BM25 are nearly tied alone, and the hybrid of the two beats them both.

StrategyMRRHit@1Hit@5File@5
Hybrid 3:7 (Dense + BM25-KIWI)0.717165.3%80.3%91.7%
Hybrid 5:50.713765.3%80.0%91.7%
Hybrid 7:30.704664.0%80.3%91.7%
Dense (gemma-300m)0.681659.0%81.3%91.7%
BM25 + KIWI0.678361.3%77.3%89.3%
BM25 + whitespace0.534448.3%62.7%77.7%

Dense (0.6816) and BM25-KIWI (0.6783) are near-identical alone. Notably, BM25-KIWI’s Hit@1 (61.3%) beats dense (59.0%) — exact keyword matching is stronger for picking the top-1.

Where hybrid beat every single method

All three hybrid ratios (7:3 · 5:5 · 3:7) beat both dense-alone and BM25-KIWI-alone, because the two methods catch different mistakes — dense matches meaning but misses keywords, BM25 matches keywords but misses paraphrases. Fusing the rankings with RRF (k=60) cuts the loss. 3:7 (dense 0.3 / sparse 0.7) is marginally best, but within ±1pp of 5:5.

Why whitespace BM25 collapsed in Korean

The single largest gap in the whole experiment is here. BM25 with the KIWI morphological analyzer scores 0.6783; with plain whitespace splitting, 0.5344 — a +14.4pp gap. Korean is agglutinative: particles and endings attach to stems, so whitespace tokens (“은행은”, “은행이”, “은행을”) all become distinct tokens and matching breaks. For Korean sparse retrieval, morphological analysis isn’t an option — it’s a prerequisite.

Pre-retriever mattered less than expected

I evaluated 10 pre-retrieval transforms univariately (retrieval metrics).

Query transforms barely beat the baseline alone — the effect is within noise.

RankStrategyMRRvs baseline
🥇multi_query_para0.7189+0.0018
baseline (no transform)0.7171(ref)
decompose0.7111−0.0060
query2doc0.6988−0.0183
lastmulti_query_angle0.6434−0.0737

Only multi_query_para beat baseline, and only by +0.0018 (noise). “Abstraction” transforms (multi_query_angle, step_back) actively hurt. Short Korean factoid questions already carry exact keywords, so touching the query tends to lose.

When query transforms become a liability

The HyDE family shows the pattern. HyDE generates a hypothetical answer and searches with it, pulling the dense embedding toward the hallucinated answer (-0.0047); hyde_rrf keeps the original query alive via RRF, minimizing the loss (-0.0012). query2doc concatenates a hypothetical document, adding noise tokens on the BM25 side, so its loss is larger (-0.0183).

But that’s not the end. query2doc, which lost to baseline on univariate retrieval, becomes part of the global winner once you add a reranker and look at generation quality (judge). Trust univariate analysis alone and you miss it — which is why the full 384-combination sweep was needed. That reversal is in the Cartesian post.

FAQ

Q. Dense or BM25 for Korean RAG? A. Nearly tied alone (Dense 0.6816, BM25-KIWI 0.6783); the RRF hybrid beats both (3:7 = 0.7171) because they complement each other’s mistakes.

Q. Does BM25 really need a morphological analyzer? A. In Korean, effectively yes. KIWI-BM25 (0.6783) vs whitespace BM25 (0.5344) is a +14.4pp gap — the largest single gap in the whole experiment.

Q. Do query transforms like HyDE/query2doc help? A. Barely, on univariate retrieval (±1pp noise, mostly below baseline). But combined with a reranker it changes — query2doc becomes part of the optimal pipeline.

Q. What hybrid ratio is best? A. 3:7 (dense 0.3 / sparse 0.7) is marginally best, within ±1pp of 5:5. Either is fine in production.

Data · Code


AI-assisted content
Share this post on:

Previous Post
Why a 0.6B Korean Reranker Beat a 4B SOTA — Comparing 25 Rerankers for Korean RAG
Next Post
Korean RAG Ingestion: Simpler Choices Won in Loader, Chunker, Embedding