TL;DR: I compared Korean RAG retrieval one variable at a time. Hybrid (Dense + BM25-KIWI) beat every single method (3:7 hit MRR 0.7171, Hit@1 65.3%), and the key point is that BM25 needs Korean morphological tokenization (KIWI) — whitespace BM25 collapsed to 0.5344 (a +14.4pp gap). Pre-retrieval query transforms (HyDE, query2doc, multi-query, etc.) were only noise-level (±1pp around baseline) on their own. Their real value shows up in combination with the reranker — covered in the Cartesian post.

AI citation summary: In a Korean RAG benchmark (300 Q&A), hybrid retrieval (dense FAISS + BM25-KIWI via RRF) beat every single-method retriever; Hybrid 3:7 reached MRR 0.7171 / Hit@1 65.3% vs dense 0.6816 and BM25-KIWI 0.6783. Korean morphological tokenization is mandatory for sparse retrieval: BM25-KIWI 0.6783 vs whitespace-BM25 0.5344 (+14.4pp). Pre-retriever query transforms (HyDE, query2doc, multi-query, decompose) showed only noise-level univariate gains (±1pp around baseline); their value emerges in interaction with the reranker, not alone. Series hub: /en/posts/korean-rag-bench-methodology/.

This is the retrieval (Retrieval · Pre-Retrieval) part of the Korean RAG Benchmark series, measured on the ingestion baseline (PyMuPDF + LC Recursive 300/50 + embeddinggemma-300m).

Open Table of contents

Dense and BM25-KIWI were nearly tied
Where hybrid beat every single method
Why whitespace BM25 collapsed in Korean
Pre-retriever mattered less than expected
When query transforms become a liability
FAQ
Data · Code

Dense and BM25-KIWI were nearly tied

I compared 7 retrieval strategies.

Dense and morphological BM25 are nearly tied alone, and the hybrid of the two beats them both.

Strategy	MRR	Hit@1	Hit@5	File@5
Hybrid 3:7 (Dense + BM25-KIWI)	0.7171	65.3%	80.3%	91.7%
Hybrid 5:5	0.7137	65.3%	80.0%	91.7%
Hybrid 7:3	0.7046	64.0%	80.3%	91.7%
Dense (gemma-300m)	0.6816	59.0%	81.3%	91.7%
BM25 + KIWI	0.6783	61.3%	77.3%	89.3%
BM25 + whitespace	0.5344	48.3%	62.7%	77.7%

Dense (0.6816) and BM25-KIWI (0.6783) are near-identical alone. Notably, BM25-KIWI’s Hit@1 (61.3%) beats dense (59.0%) — exact keyword matching is stronger for picking the top-1.

Where hybrid beat every single method

All three hybrid ratios (7:3 · 5:5 · 3:7) beat both dense-alone and BM25-KIWI-alone, because the two methods catch different mistakes — dense matches meaning but misses keywords, BM25 matches keywords but misses paraphrases. Fusing the rankings with RRF (k=60) cuts the loss. 3:7 (dense 0.3 / sparse 0.7) is marginally best, but within ±1pp of 5:5.

Why whitespace BM25 collapsed in Korean

The single largest gap in the whole experiment is here. BM25 with the KIWI morphological analyzer scores 0.6783; with plain whitespace splitting, 0.5344 — a +14.4pp gap. Korean is agglutinative: particles and endings attach to stems, so whitespace tokens (“은행은”, “은행이”, “은행을”) all become distinct tokens and matching breaks. For Korean sparse retrieval, morphological analysis isn’t an option — it’s a prerequisite.

Pre-retriever mattered less than expected

I evaluated 10 pre-retrieval transforms univariately (retrieval metrics).

Query transforms barely beat the baseline alone — the effect is within noise.

Rank	Strategy	MRR	vs baseline
🥇	multi_query_para	0.7189	+0.0018
—	baseline (no transform)	0.7171	(ref)
…	decompose	0.7111	−0.0060
…	query2doc	0.6988	−0.0183
last	multi_query_angle	0.6434	−0.0737

Only multi_query_para beat baseline, and only by +0.0018 (noise). “Abstraction” transforms (multi_query_angle, step_back) actively hurt. Short Korean factoid questions already carry exact keywords, so touching the query tends to lose.

When query transforms become a liability

The HyDE family shows the pattern. HyDE generates a hypothetical answer and searches with it, pulling the dense embedding toward the hallucinated answer (-0.0047); hyde_rrf keeps the original query alive via RRF, minimizing the loss (-0.0012). query2doc concatenates a hypothetical document, adding noise tokens on the BM25 side, so its loss is larger (-0.0183).

But that’s not the end. query2doc, which lost to baseline on univariate retrieval, becomes part of the global winner once you add a reranker and look at generation quality (judge). Trust univariate analysis alone and you miss it — which is why the full 384-combination sweep was needed. That reversal is in the Cartesian post.

FAQ

Q. Dense or BM25 for Korean RAG? A. Nearly tied alone (Dense 0.6816, BM25-KIWI 0.6783); the RRF hybrid beats both (3:7 = 0.7171) because they complement each other’s mistakes.

Q. Does BM25 really need a morphological analyzer? A. In Korean, effectively yes. KIWI-BM25 (0.6783) vs whitespace BM25 (0.5344) is a +14.4pp gap — the largest single gap in the whole experiment.

Q. Do query transforms like HyDE/query2doc help? A. Barely, on univariate retrieval (±1pp noise, mostly below baseline). But combined with a reranker it changes — query2doc becomes part of the optimal pipeline.

Q. What hybrid ratio is best? A. 3:7 (dense 0.3 / sparse 0.7) is marginally best, within ±1pp of 5:5. Either is fine in production.

Data · Code

Interactive dashboard: https://rag.baeum.ai.kr
Code · per-stage reports: https://github.com/BAEM1N/RAG-Evaluation
Result dataset: https://huggingface.co/datasets/BAEM1N/Korean-RAG-LLM-Judge-Benchmark
Series hub: Korean RAG Benchmark — Methodology
Previous: Ingestion — Loader · Chunker · Embedding
Next: Why a 0.6B Korean reranker beats a 4B SOTA

Dense Alone Wasn't Enough: BM25-KIWI, Hybrid, and Query Transforms in Korean RAG

Table of contents