BAEM1N.DEV
RSS Feed
A space for sharing hands-on experiments in AI, data engineering, and
automation. Covering RAG pipelines, AI Agents, LLMOps, and practical
benchmarks.
Written by Bae Gimin (baem1n), Founder of DDOK.AI and an AI/Data
Educator & Engineer. Benchmark posts prioritize reproducible
methodology, public source code, raw CSV/results, and clear failure
conditions. Contact:
gm.bae@ddok.ai
.
Trust & disclosure: when the author maintains a related open-source
project, that relationship is disclosed in the article. Experimental
claims are tied to code, raw data, or measurement notes whenever
possible.
Subscribe via
RSS
for updates.
Featured
Korean RAG Benchmark Conclusion: Look at the Pipeline Before Upgrading the Model Synthesis of a Korean RAG benchmark — the same GPT-5.4 with a tuned pipeline hits accuracy 0.827, +6.0pp over a 10× costlier model. A 0.6B Korean reranker beats a 4B SOTA by +1.83pp. Reranker is the dominant axis. The 7 findings and the recommended production pipeline.
Korean RAG Benchmark: Why I Took the Whole Pipeline Apart with 300 Questions Methodology of a Korean RAG benchmark that decomposes the pipeline into 6 stages and runs a full 384-combination Cartesian sweep. 300 Q&A × 58 PDF × 5 domains, 46 generators (27 open + 19 closed), 4-metric LLM-as-Judge, ≈1.2M LLM calls.
RunPod Referral Link: Get $5-$500 in GPU Credits Sign up with my RunPod referral link and get a $5-$500 credit bonus when you add $10 for the first time.
Vultr Referral Link: Get $300 in Credits Sign up with my Vultr referral link and get $300 in credits to try VPS, Cloud GPU, Kubernetes, and Object Storage.
Qwen3.5 Local Inference Benchmark Results: 4 Machines × 5 Engines Generation and prefill throughput for Qwen3.5 (9B, 27B, 35B-A3B MoE, 122B-A10B MoE) on M5 Max, RTX 3090×2, DGX Spark GB10, and Ryzen AI MAX 395 — measured with llama.cpp, MLX, Ollama, vLLM, and Lemonade.
Recent Posts
Stacking Univariate Winners Didn't Give the Optimum — A 384-Combination Korean RAG Sweep Scoring all 384 Pre×Retrieval×Reranker combinations for Korean RAG — query2doc, only 4th by univariate e2e judge, becomes the global winner (judge 4.067/acc 0.827) once paired with jina-reranker-m0. The MRR winner ≠ the answer-quality winner. Interaction is why the full sweep was needed.
How Far Have Open-Weight LLMs Come in Korean RAG — 46 Generators and Judge Reliability Comparing 46 Korean RAG generators (27 open + 19 closed) — gpt-oss-120b and kimi-k2.5 tie for the open-weight lead (acc 0.740), and gpt-oss-20b reaches 0.727 at 13GB VRAM. The closed leader gpt-5.4 (0.787) is -4.7pp ahead. A single LLM-as-Judge shook the rankings.
Why a 0.6B Korean Reranker Beat a 4B SOTA — Comparing 25 Rerankers for Korean RAG Univariate comparison of 25 rerankers for Korean RAG — a 0.6B Korean fine-tune (dragonkue/bge-reranker-v2-m3-ko) hits MRR 0.7697, beating the 6.7× larger 2025 SOTA Qwen3-Reranker-4B (0.7514) by +1.83pp. The reranker was the single biggest axis.
Dense Alone Wasn't Enough: BM25-KIWI, Hybrid, and Query Transforms in Korean RAG Univariate retrieval comparison for Korean RAG — Hybrid 3:7 (Dense + BM25-KIWI) hits MRR 0.7171, beating every single-method retriever. BM25 needs morphology (KIWI): +14.4pp over whitespace. Pre-retrieval query transforms were noise-level on their own.
Korean RAG Ingestion: Simpler Choices Won in Loader, Chunker, Embedding Univariate comparison over 300 Korean Q&A — PyMuPDF wins the loader at MRR 0.6486; the top char chunker by dense MRR is Chonkie Fast 800 (0.6903), but within ≈1.5pp noise, so the standard LC Recursive 300/50 (0.6816 dense / 0.7171 hybrid) was adopted downstream; KoE5 beats an 8B English model by +0.16 MRR. Korean alignment mattered more than processing complexity.
Local LLM Inference Benchmark: Experimental Design Across 4 Hardware Platforms and 5 Engines Methodology, experimental design, and gotchas for a cross-platform benchmark measuring Qwen3.5 on M5 Max, RTX 3090×2, DGX Spark, and Ryzen AI MAX 395+.
Qwen3.5 Cross-Platform Benchmark: 4 Hardware Targets × 5 Engines Compared Apples-to-apples Qwen3.5 numbers across Mac M5 Max, RTX 3090×2, DGX Spark, and Ryzen AI MAX 395+. Cold prefill, cache disabled, randomized run order.
Building a GraphRAG Pipeline — From Vector Search to Graph Expansion Solve multi-hop questions that plain vector RAG can't answer. Vectorize graph nodes with from_existing_graph in one line, auto-convert natural language to Cypher with CypherQAChain.
Mastering Vector Search with langchain-age — Hybrid Search, MMR, and Metadata Filtering Why Hybrid Search matters for pgvector, when to use each strategy, and real recall benchmarks. Includes HNSW vs IVFFlat selection criteria and MongoDB-style metadata filtering.
Full AI Agent Stack on One PostgreSQL — LangGraph + langchain-age Can you replace Neo4j+Redis+Pinecone with just PostgreSQL for an AI Agent? A real architecture that unifies graph, vectors, checkpoints, and long-term memory in one database.