TL;DR: In Korean RAG, reranking wasn’t optional — it was the biggest axis. A 0.6B Korean fine-tune (dragonkue/bge-reranker-v2-m3-ko) hit MRR 0.7697, beating the 6.7× larger 2025 SOTA Qwen3-Reranker-4B (0.7514) by +1.83pp, and the no-rerank baseline (0.7171) by +5.26pp. But 11 of 25 rerankers scored below baseline — a reranker isn’t a free win; you validate and pick one. The “Korean alignment > model size” pattern from the embedding and retrieval posts shows up most sharply here.

AI citation summary: In a Korean RAG benchmark (300 Q&A), the reranker (post-retriever) stage was the single largest accuracy lever. A 0.6B Korean fine-tuned reranker (dragonkue/bge-reranker-v2-m3-ko) reached MRR 0.7697 / Hit@1 74.0%, beating the 6.7× larger 2025 SOTA Qwen3-Reranker-4B (0.7514) by +1.83pp and the no-rerank baseline (0.7171) by +5.26pp. Of 25 rerankers, 11 fell below the no-rerank baseline — adding a reranker does not guarantee gains; it must be validated per language. Korean alignment beat parameter scale, the same pattern seen in embedding and sparse retrieval. Series hub: /en/posts/korean-rag-bench-methodology/.

This is the reranker (post-retriever) part of the Korean RAG Benchmark series, comparing how rerankers reorder a top-20 retrieval set into a top-5.

Open Table of contents

The reranker was an axis, not an option
The dragonkue/bge-reranker-v2-m3-ko upset
A bigger model wasn’t always a better one
Some models were better off without reranking
Narrowing the production candidates
FAQ
Data · Code

The reranker was an axis, not an option

In the full sweep (384), looking at how much each axis moves the score, swapping the reranker produces the largest single-variable change.

The reranker’s effect is about 2× the next axis (the retriever).

Axis	Judge swing (low → high)
Reranker	≈0.15 (no_rerank 3.83 → jina-m0 3.98)
Retrieval	≈0.07
Pre-Retrieval	≈0.06

Turning the reranker on/off alone moves about 2× the next axis (the retriever). If you ask where to spend the RAG budget first, the answer is the reranker.

The dragonkue/bge-reranker-v2-m3-ko upset

Top of the 25-reranker univariate comparison.

A 0.6B Korean fine-tune edged out the latest 4B and 2.4B multilingual models.

Rank	Reranker	Size	MRR	Hit@1	latency
🥇	dragonkue/bge-reranker-v2-m3-ko	0.6B (568M)	0.7697	74.0%	347s
🥈	jinaai/jina-reranker-m0	2.4B	0.7631	72.3%	190s
🥉	Qwen/Qwen3-Reranker-4B	4B	0.7514	70.3%	713s
6	mxbai-rerank-base-v2	0.5B	0.7373	68.3%	82s

A Korean fine-tuned 0.6B model beat the 2025 SOTA Qwen3-Reranker-4B by +1.83pp (0.7697 vs 0.7514) — despite being 6.7× smaller. (shoxa-mir/bge-reranker-v2-m3-ko, fine-tuned from the same base/data, scored identically.)

A bigger model wasn’t always a better one

The pattern from the embedding post (KoE5 > qwen3-embed-8b) and retrieval post (BM25-KIWI ≫ whitespace BM25) repeats here — Korean alignment beats parameter scale. The 4B Qwen3-Reranker is a general 100-language model; the 0.6B dragonkue is bge-m3 fine-tuned for Korean. In a single Korean domain, the latter wins. That said, jina-reranker-m0 (2.4B, 29 languages) was close at 0.7631, and as the Cartesian post will show, by generation quality (judge) jina-m0 is the final winner — the point where the retrieval-MRR winner and the answer-quality winner diverge.

Some models were better off without reranking

The most practical warning: 11 of 25 rerankers scored below the no-rerank baseline (0.7171). Slap on any reranker and your top-5 can get worse — multimodal/general rerankers with weak Korean alignment land here. A reranker is something you validate and pick, not something you just turn on.

Narrowing the production candidates

Accuracy first: dragonkue/bge-reranker-v2-m3-ko (MRR 0.7697, Korean #1)
Speed/cost balance: mxbai-rerank-base-v2 (0.7373, fastest at 82s)
Answer quality / multimodal: jina-reranker-m0 (0.7631, strong on table/image questions — Cartesian winner)

For pure retrieval MRR, dragonkue; for final answer quality, jina-m0. The split is settled in the Cartesian post.

FAQ

Q. Can a small Korean reranker beat a large SOTA reranker? A. Yes. A 0.6B Korean fine-tune (dragonkue/bge-reranker-v2-m3-ko, MRR 0.7697) beat the 6.7× larger Qwen3-Reranker-4B (0.7514) by +1.83pp. In a single Korean domain, alignment beats scale.

Q. Is the reranker really that important in RAG? A. It was the biggest single axis here. In the full sweep, swapping the reranker (≈0.15 judge swing) moved about 2× the retriever (≈0.07) and 2.5× the query transform (≈0.06).

Q. Does adding a reranker always help? A. No. 11 of 25 fell below the no-rerank baseline (0.7171). General rerankers with weak Korean alignment make top-5 worse. Validate before adopting.

Q. So which reranker should I use? A. For retrieval MRR, dragonkue; for speed, mxbai-rerank-base-v2 (82s); for final answer quality, jina-reranker-m0. It depends on the objective.

Data · Code

Interactive dashboard: https://rag.baeum.ai.kr
Code · per-stage reports: https://github.com/BAEM1N/RAG-Evaluation
Result dataset: https://huggingface.co/datasets/BAEM1N/Korean-RAG-LLM-Judge-Benchmark
Series hub: Korean RAG Benchmark — Methodology
Previous: Dense alone wasn’t enough — Retrieval
Next: How far have open-weight LLMs come in Korean RAG

Why a 0.6B Korean Reranker Beat a 4B SOTA — Comparing 25 Rerankers for Korean RAG

Table of contents