Preserve structure
Keep headings, lists, and tables as typed elements, not flat text.
How-to — — by Mahmoud Zalt
How to ship a production RAG pipeline for PDFs end to end: parsing, chunking, embeddings, retrieval, and Q&A, without losing your weekend.
A production RAG pipeline for PDFs is not a single LLM call. It is a small system with five layers, each of which can fail in a different way. The first layer parses the PDF into clean text plus structure (titles, tables, lists, page numbers). The second layer chunks that text into retrievable pieces sized for your embedding model. The third layer turns each chunk into a vector and stores it next to its metadata. The fourth layer retrieves the right chunks at query time, usually with a hybrid of vector search and keyword search plus a reranker. The fifth layer hands those chunks to an LLM with a prompt that forces grounded citations and refuses to answer when retrieval came back empty. Skip any layer and you ship a chatbot that hallucinates with confidence on the documents it just read.
Most teams lose the RAG game in the first 50 lines of code, because they reach for pdfminer or PyPDF2 and call it a day. Those libraries give you a text blob with the layout flattened: tables become word salad, multi-column pages interleave, and headers and footers leak into every chunk. The result is embeddings that look fine on a unit test and fall apart on the first real customer PDF. The honest options in 2026 are layout-aware parsers: Unstructured.io is the workhorse open-source choice with element-level metadata, LlamaParse from LlamaIndex handles complex tables and forms cleaner than anything else I have tested, and Azure Document Intelligence or AWS Textract are the bigger guns for scanned PDFs that need real OCR. Pick one, normalize output to a consistent schema (text, type, page, bounding box), and write the parsed result to disk so you never reparse the same PDF twice.
Keep headings, lists, and tables as typed elements, not flat text.
Tag every element with page number and source so citations are honest later.
Detect scanned pages and route to Textract, Tesseract, or Document Intelligence.
Remove repeated headers, footers, and page numbers before chunking, not after.
Store the parsed JSON in object storage so reindexing never reparses.
Chunking is where most pipelines silently degrade. A naive fixed-size split (1000 characters, no overlap) will cut sentences in half, split tables from their headers, and place a question on one chunk and its answer on the next. The pattern that works in production is semantic chunking with structural respect: split on heading boundaries first, then on paragraph boundaries, then on sentence boundaries, only falling back to a hard size cap. Aim for chunks between 300 and 800 tokens, with 10 to 20 percent overlap to preserve context across boundaries. Keep tables as a single chunk if they fit, or chunk by row with the header repeated. Always attach metadata (source filename, page, section heading, chunk index) so the retriever can filter and the LLM can cite. Tools like LangChain and LlamaIndex ship recursive character splitters that are a fine starting point. Tune chunk size on a held-out evaluation set, not on vibes.
Once parsing and chunking are solid, the embedding choice matters less than people think. The interesting work moves to retrieval quality and prompt design, where small fixes compound fast. A real production stack tends to converge on the same shape across teams: a vector database for similarity, a keyword index for exact matches, a reranker to combine them, and a tightly scoped prompt that refuses to answer outside the retrieved context. That last piece is what makes a RAG system feel trustworthy instead of confident-wrong.
If wiring all of that together sounds like a multi-week project, that is because it usually is. LangChain and LlamaIndex give you the parts; you still have to assemble, evaluate, and operate the pipeline yourself. The reason I built Sistava the way I did is that most founders do not want a RAG framework, they want an AI Employee that already knows their documents. Upload the PDFs, the parsing, chunking, embedding, and retrieval happen in the background, and the employee answers from your real knowledge with citations. Both paths are valid; pick the one that matches the time you actually have.
The embedding model market settled around three honest defaults in 2026. OpenAI text-embedding-3-large gives the best quality-per-dollar for English heavy workloads and ships with a tunable dimension count, which is useful when you want to trade recall for storage. Cohere embed-v3 is strong on multilingual and clean to integrate. For self-hosted, BGE-large or BGE-M3 from BAAI are the open-weight workhorses and run cheaply on a single GPU. On the vector database side: pgvector is the answer for most teams under 10 million chunks because it lives next to your existing Postgres, with HNSW indexes giving low-latency lookups. Pinecone is the easy managed default if you want zero ops. Qdrant is the strongest open-source option for filterable hybrid search. Weaviate and Milvus exist and are fine; pick by what your team already operates.
text-embedding-3-large for hosted, BGE-M3 for self-hosted multilingual.
pgvector for most teams, Pinecone for zero-ops, Qdrant for open-source hybrid search.
Combine dense vector search with BM25 keyword search to catch exact terms.
Add a cross-encoder reranker like Cohere Rerank or BGE-Reranker to lift top-3 quality.
The last layer is where most RAG demos break in production. A grounded Q&A system needs three things in the prompt: the retrieved chunks with their source metadata, a clear instruction to answer only from the provided context, and a refusal pattern when the context does not contain the answer. The model should cite chunk IDs or page numbers inline so users can verify. Add a confidence check: if retrieval similarity scores are below a threshold, route to a human or return a clean "I do not have that information" instead of a fabricated answer. Track every query, the retrieved chunks, and the final answer in observability (Langfuse, LangSmith, or your own logging) so you can spot drift. Real production RAG is a feedback loop: you watch the bad answers, fix parsing or chunking or prompting, and re-evaluate against a frozen test set. The systems that work are the ones with that loop running weekly, not the ones with the cleverest initial design.
A working prototype with LangChain or LlamaIndex takes a day. Getting to production quality (good parsing, evaluated chunking, hybrid retrieval, reranking, monitoring, eval set) is typically 2 to 6 weeks of focused engineering, depending on PDF complexity and how strict your accuracy bar is.
Start at 500 tokens with 100 tokens of overlap and tune on an evaluation set. Dense reference documents (legal, technical) often want smaller chunks (300-400). Narrative content (reports, articles) tolerates larger chunks (600-800). Always split on structural boundaries before falling back to size.
Pure vector search is enough for demos. For production, a reranker (Cohere Rerank, BGE-Reranker, or Voyage rerank-2) lifts top-3 precision noticeably, especially on PDFs with repetitive content. Retrieve top 15-20 with vector search, rerank down to top 3-5, and feed those to the LLM.
Tables should be parsed as structured elements and kept whole as a single chunk when they fit. For larger tables, chunk by row with the header repeated. For figures, extract any caption and OCR the figure if it contains text. Tools like LlamaParse and Unstructured handle this better than naive parsers.
Yes. Sistava AI Employees ship PDF training built in: upload the documents, the employee parses, chunks, embeds, and retrieves automatically, and answers with citations. It is the right call when you want the outcome (answers grounded in your docs) without operating the pipeline yourself. For deeper customization, building with LlamaIndex or LangChain still wins.
If you want a concrete next step after this overview, the question most founders ask once their RAG pipeline works is how to give that knowledge to a real AI Employee with memory, schedules, and channels. That is a different shape of problem than retrieval quality, and worth reading separately. The piece below walks through hiring an AI Employee that owns a knowledge function end to end, instead of being a search box bolted onto a chat window.
The honest takeaway after building this pipeline more than once is that the algorithm choices matter less than the operating loop. Parsing well, chunking with structure, embedding with a strong default, retrieving with a hybrid plus a reranker, and prompting for grounded answers will get you most of the way there. Everything else (fancier rerankers, custom embeddings, agentic retrieval) is a tuning knob you reach for after you have a baseline you trust. If you want to own that loop, LangChain and LlamaIndex are the right tools, and the patterns above are battle-tested. If you would rather skip the build and have AI Employees who already know your documents on day one, that is exactly what Sistava ships out of the box. Pick whichever path returns weeks of your time and gets you to grounded answers fastest.