Posts

All the articles I've posted.

Building PhoenixCallbackHandler: Wrapping OpenInference Tracing as a LangChain Callback

15 Jun, 2026

A package design for a Phoenix LangChain CallbackHandler that reuses OpenInference's tracer instead of reimplementing run-to-span conversion.
Tracing LangChain with Arize Phoenix: auto_instrument vs CallbackHandler

15 Jun, 2026

A deep dive into Phoenix's official register(auto_instrument=True) path, how OpenInference instruments LangChain, and when a callback handler API is useful.
Getting Started with Arize Phoenix: Open-Source LLMOps for Tracing, Evaluation, and Debugging

15 Jun, 2026

A practical introduction to Arize Phoenix, how it compares with LangSmith and Langfuse, and why OpenTelemetry/OpenInference matter for LLM observability.
Korean RAG Benchmark Conclusion: Look at the Pipeline Before Upgrading the Model

1 Jun, 2026

Synthesis of a Korean RAG benchmark — the same GPT-5.4 with a tuned pipeline hits accuracy 0.827, +6.0pp over a 10× costlier model. A 0.6B Korean reranker beats a 4B SOTA by +1.83pp. Reranker is the dominant axis. The 7 findings and the recommended production pipeline.
Stacking Univariate Winners Didn't Give the Optimum — A 384-Combination Korean RAG Sweep

1 Jun, 2026

Scoring all 384 Pre×Retrieval×Reranker combinations for Korean RAG — query2doc, only 4th by univariate e2e judge, becomes the global winner (judge 4.067/acc 0.827) once paired with jina-reranker-m0. The MRR winner ≠ the answer-quality winner. Interaction is why the full sweep was needed.
How Far Have Open-Weight LLMs Come in Korean RAG — 46 Generators and Judge Reliability

1 Jun, 2026

Comparing 46 Korean RAG generators (27 open + 19 closed) — gpt-oss-120b and kimi-k2.5 tie for the open-weight lead (acc 0.740), and gpt-oss-20b reaches 0.727 at 13GB VRAM. The closed leader gpt-5.4 (0.787) is -4.7pp ahead. A single LLM-as-Judge shook the rankings.
Why a 0.6B Korean Reranker Beat a 4B SOTA — Comparing 25 Rerankers for Korean RAG

1 Jun, 2026

Univariate comparison of 25 rerankers for Korean RAG — a 0.6B Korean fine-tune (dragonkue/bge-reranker-v2-m3-ko) hits MRR 0.7697, beating the 6.7× larger 2025 SOTA Qwen3-Reranker-4B (0.7514) by +1.83pp. The reranker was the single biggest axis.
Dense Alone Wasn't Enough: BM25-KIWI, Hybrid, and Query Transforms in Korean RAG

1 Jun, 2026

Univariate retrieval comparison for Korean RAG — Hybrid 3:7 (Dense + BM25-KIWI) hits MRR 0.7171, beating every single-method retriever. BM25 needs morphology (KIWI): +14.4pp over whitespace. Pre-retrieval query transforms were noise-level on their own.
Korean RAG Ingestion: Simpler Choices Won in Loader, Chunker, Embedding

1 Jun, 2026

Univariate comparison over 300 Korean Q&A — PyMuPDF wins the loader at MRR 0.6486; the top char chunker by dense MRR is Chonkie Fast 800 (0.6903), but within ≈1.5pp noise, so the standard LC Recursive 300/50 (0.6816 dense / 0.7171 hybrid) was adopted downstream; KoE5 beats an 8B English model by +0.16 MRR. Korean alignment mattered more than processing complexity.
Korean RAG Benchmark: Why I Took the Whole Pipeline Apart with 300 Questions

1 Jun, 2026

Methodology of a Korean RAG benchmark that decomposes the pipeline into 6 stages and runs a full 384-combination Cartesian sweep. 300 Q&A × 58 PDF × 5 domains, 46 generators (27 open + 19 closed), 4-metric LLM-as-Judge, ≈1.2M LLM calls.
RunPod Referral Link: Get $5-$500 in GPU Credits

29 Apr, 2026

Sign up with my RunPod referral link and get a $5-$500 credit bonus when you add $10 for the first time.
Vultr Referral Link: Get $300 in Credits

29 Apr, 2026

Sign up with my Vultr referral link and get $300 in credits to try VPS, Cloud GPU, Kubernetes, and Object Storage.
Local LLM Inference Benchmark: Experimental Design Across 4 Hardware Platforms and 5 Engines

Updated: 16 Apr, 2026

Methodology, experimental design, and gotchas for a cross-platform benchmark measuring Qwen3.5 on M5 Max, RTX 3090×2, DGX Spark, and Ryzen AI MAX 395+.
Qwen3.5 Cross-Platform Benchmark: 4 Hardware Targets × 5 Engines Compared

Updated: 16 Apr, 2026

Apples-to-apples Qwen3.5 numbers across Mac M5 Max, RTX 3090×2, DGX Spark, and Ryzen AI MAX 395+. Cold prefill, cache disabled, randomized run order.
Qwen3.5 Local Inference Benchmark Results: 4 Machines × 5 Engines

Updated: 16 Apr, 2026

Generation and prefill throughput for Qwen3.5 (9B, 27B, 35B-A3B MoE, 122B-A10B MoE) on M5 Max, RTX 3090×2, DGX Spark GB10, and Ryzen AI MAX 395 — measured with llama.cpp, MLX, Ollama, vLLM, and Lemonade.

Posts

Building PhoenixCallbackHandler: Wrapping OpenInference Tracing as a LangChain Callback

Tracing LangChain with Arize Phoenix: auto_instrument vs CallbackHandler

Getting Started with Arize Phoenix: Open-Source LLMOps for Tracing, Evaluation, and Debugging

Korean RAG Benchmark Conclusion: Look at the Pipeline Before Upgrading the Model

Stacking Univariate Winners Didn't Give the Optimum — A 384-Combination Korean RAG Sweep

How Far Have Open-Weight LLMs Come in Korean RAG — 46 Generators and Judge Reliability

Why a 0.6B Korean Reranker Beat a 4B SOTA — Comparing 25 Rerankers for Korean RAG

Dense Alone Wasn't Enough: BM25-KIWI, Hybrid, and Query Transforms in Korean RAG

Korean RAG Ingestion: Simpler Choices Won in Loader, Chunker, Embedding

Korean RAG Benchmark: Why I Took the Whole Pipeline Apart with 300 Questions

RunPod Referral Link: Get $5-$500 in GPU Credits

Vultr Referral Link: Get $300 in Credits

Local LLM Inference Benchmark: Experimental Design Across 4 Hardware Platforms and 5 Engines

Qwen3.5 Cross-Platform Benchmark: 4 Hardware Targets × 5 Engines Compared

Qwen3.5 Local Inference Benchmark Results: 4 Machines × 5 Engines