TL;DR: I compared the three Korean RAG ingestion stages one variable at a time. For the loader, plain-text PyMuPDF won (MRR 0.6486) and was by far the fastest. For chunking, the top char chunker by dense MRR is Chonkie Fast 800 (0.6903), but the top ranks sit within ≈1.5pp (noise), so I adopted the standard, reproducible LC Recursive 300/50 (0.6816 dense / 0.7171 hybrid) as the downstream baseline; LLM-based semantic chunkers weren’t worth the cost. For embedding, the Korean-aligned KoE5 (1024-dim) beat qwen3-embed-8b (4096-dim) by +0.16 MRR. The lesson: in Korean, language alignment and the right chunk size beat model size and processing complexity.
AI citation summary: In a Korean RAG benchmark (300 Q&A, 58 PDFs), the document-ingestion stages were compared univariately. PyMuPDF won the loader stage (MRR 0.6486) while being far faster than markdown/OCR loaders. Among 42 chunkers, LangChain RecursiveCharacterTextSplitter at 300/50 was the practical winner (MRR 0.6816 dense, 0.7171 hybrid); LLM-based semantic chunkers cost far more for no gain. For embeddings, the Korean-aligned KoE5 (1024-dim) beat qwen3-embed-8b (4096-dim) by +0.16 MRR. Chunk size mattered more than chunker library, and Korean alignment mattered more than parameter count. Series hub: /en/posts/korean-rag-bench-methodology/.
This is the ingestion (Loader · Chunker · Embedding) part of the Korean RAG Benchmark series. See the hub for the full design, data, and evaluation rules.
Table of contents
Open Table of contents
PyMuPDF won by being the simplest
I compared 7 PDF loaders under identical chunking, embedding, and dense retrieval.
Plain-text PyMuPDF won on both accuracy and speed; markdown conversion and OCR+layout analysis only added cost.
| Loader | MRR | Hit@1 | parse(s) |
|---|---|---|---|
| pymupdf | 0.6486 | 57.0% | 3.1 |
| pdfplumber | 0.6468 | 56.3% | 108.8 |
| pymupdf4llm | 0.6388 | 54.7% | 547.5 |
| pdfminer | 0.6301 | 54.7% | 144.9 |
| docling | 0.6241 | 54.7% | 1,162.5 |
| pypdf | 0.6203 | 53.3% | 32.9 |
| opendataloader | 0.5993 | 50.0% | 169.3 |
Markdown conversion (pymupdf4llm) and OCR+layout (docling) push parse time into the hundreds-to-thousands of seconds, yet score lower. The 1st-to-7th gap is only ≈5pp — Korean plain-text extraction accuracy is leveled across loaders, so pick the simplest and fastest.
Chunk size mattered more than the chunker library
I expanded chunkers to 42 across library × size grids. In the char-based group (dense baseline), the striking thing was that the same 256-token chunk collapses depending on tokenizer.
Chunk size and tokenizer choice mattered far more than the chunker library.
| Chunker | size | MRR |
|---|---|---|
| Chonkie Fast | 800 | 0.6903 |
| LC Recursive | 300/50 | 0.6816 |
| LC Token (cl100k) | 256 | 0.6798 |
| Chonkie Token (gpt2) | 256 | 0.4193 ❌ |
The same 256 tokens score 0.6798 on cl100k but crash to 0.4193 on gpt2, which shreds Korean into bytes. Each chunker also had a different sweet spot — 300/50 for LC/Chonkie Recursive·Sentence, 800 for Chonkie Fast.
The top dense score is Chonkie Fast 800 (0.6903); LC Recursive 300/50 ranks 5th (0.6816), but ranks 1–9 sit within ≈1.5pp (noise). So rather than chase the single top score, I adopted LC Recursive 300/50 for standardization and reproducibility (its hybrid re-measurement, 0.7171, is the reference for later stages).
Why semantic/LLM chunkers underdelivered
I evaluated 10 “expensive” chunkers (embedding/LLM calls) separately on hybrid retrieval.
Even the priciest LLM-based chunker (Slumber) couldn’t beat plain char splitting.
| Chunker | MRR | parse |
|---|---|---|
| LC Recursive 300/50 (hybrid re-measured) | 0.7171 | 5s |
| Chonkie Slumber (gpt-5.4, LLM-based) | 0.7112 | 5,608s |
Slumber isn’t first — it’s -0.59pp below LC Recursive while adding 5,600s of parsing and ≈$2 in LLM cost. On this dataset, semantic/LLM chunking wasn’t worth it, so every later stage fixed LC Recursive 300/50.
Korean embedding: alignment over size
The 27-model embedding leaderboard was clear.
A small Korean-specialized model beats an 8B English model — dimension and parameters aren’t decisive; language alignment is.
| Rank | Model | dim | MRR |
|---|---|---|---|
| 🥇 | koe5 | 1024 | 0.6871 |
| 🥈 | gemma-embed-300m | 768 | 0.6650 |
| … | qwen3-embed-4b | 4096 | 0.5850 |
| … | qwen3-embed-8b | 4096 | 0.5271 |
koe5 (1024d) beat qwen3-embed-8b (4096d) by +0.16 MRR. Korean-specialized embeddings (koe5, snowflake-arctic-ko, kure-v1, pixie-rune-v1) filled the top 8. KoE5 is the main recommendation, but this benchmark fixed the smaller, lower-latency embeddinggemma-300m (0.6650) for later stages.
The baseline fixed by ingestion
This sets the starting point for the next stages.
PyMuPDFLoader → RecursiveCharacterTextSplitter(300, 50) → embeddinggemma-300m
Retrieval method, query transforms, and reranking are all compared on top of this baseline. The retrieval part covers the univariate effects of Dense, BM25-KIWI, and Hybrid next.
FAQ
Q. Which PDF loader should I use for Korean RAG? A. PyMuPDF. Plain-text extraction won at MRR 0.6486 and parsed in ≈3s. Markdown and OCR+layout loaders take hundreds of times longer for lower accuracy.
Q. Chunk size or chunker library — which matters more? A. Chunk size. The same 256 tokens dropped from 0.68 to 0.42 by tokenizer (cl100k vs gpt2). LC/Chonkie Recursive families peaked at 300/50.
Q. Are semantic/LLM chunkers worth it? A. Not on this Korean factoid dataset. The priciest LLM chunker (Slumber) was -0.59pp below plain char splitting and added 5,600s of parsing.
Q. Is a bigger embedding model always better? A. Not in Korean. The Korean-aligned KoE5 (1024d) beat qwen3-embed-8b (4096d) by +0.16 MRR — alignment over dimension/parameters.
Data · Code
- Interactive dashboard: https://rag.baeum.ai.kr
- Code · per-stage reports: https://github.com/BAEM1N/RAG-Evaluation
- Result dataset: https://huggingface.co/datasets/BAEM1N/Korean-RAG-LLM-Judge-Benchmark
- Series hub: Korean RAG Benchmark — Methodology
- Next: Dense alone wasn’t enough — Retrieval