Sistava

A Production RAG Pipeline for PDFs: Parsing, Chunking, Embeddings

How-to — by Mahmoud Zalt

How to ship a production RAG pipeline for PDFs end to end: parsing, chunking, embeddings, retrieval, and Q&A, without losing your weekend.

What are the five layers of a production RAG pipeline for PDFs?

A production RAG pipeline for PDFs is not a single LLM call. It is a small system with five layers, each of which can fail in a different way. The first layer parses the PDF into clean text plus structure (titles, tables, lists, page numbers). The second layer chunks that text into retrievable pieces sized for your embedding model. The third layer turns each chunk into a vector and stores it next to its metadata. The fourth layer retrieves the right chunks at query time, usually with a hybrid of vector search and keyword search plus a reranker. The fifth layer hands those chunks to an LLM with a prompt that forces grounded citations and refuses to answer when retrieval came back empty. Skip any layer and you ship a chatbot that hallucinates with confidence on the documents it just read.

At a Glance

5 layers
Parse, chunk, embed, retrieve, answer
300-800
Tokens per chunk for most PDF content
10-20%
Chunk overlap to preserve context across boundaries
top-k 8-20
Chunks retrieved before reranking down to 3-5

How do you parse PDFs without losing tables and structure?

Most teams lose the RAG game in the first 50 lines of code, because they reach for pdfminer or PyPDF2 and call it a day. Those libraries give you a text blob with the layout flattened: tables become word salad, multi-column pages interleave, and headers and footers leak into every chunk. The result is embeddings that look fine on a unit test and fall apart on the first real customer PDF. The honest options in 2026 are layout-aware parsers: Unstructured.io is the workhorse open-source choice with element-level metadata, LlamaParse from LlamaIndex handles complex tables and forms cleaner than anything else I have tested, and Azure Document Intelligence or AWS Textract are the bigger guns for scanned PDFs that need real OCR. Pick one, normalize output to a consistent schema (text, type, page, bounding box), and write the parsed result to disk so you never reparse the same PDF twice.

Benefits

Preserve structure

Keep headings, lists, and tables as typed elements, not flat text.

Page-level metadata

Tag every element with page number and source so citations are honest later.

OCR fallback

Detect scanned pages and route to Textract, Tesseract, or Document Intelligence.

Strip noise

Remove repeated headers, footers, and page numbers before chunking, not after.

Cache parsed output

Store the parsed JSON in object storage so reindexing never reparses.

What is the right chunking strategy for embeddings?

Chunking is where most pipelines silently degrade. A naive fixed-size split (1000 characters, no overlap) will cut sentences in half, split tables from their headers, and place a question on one chunk and its answer on the next. The pattern that works in production is semantic chunking with structural respect: split on heading boundaries first, then on paragraph boundaries, then on sentence boundaries, only falling back to a hard size cap. Aim for chunks between 300 and 800 tokens, with 10 to 20 percent overlap to preserve context across boundaries. Keep tables as a single chunk if they fit, or chunk by row with the header repeated. Always attach metadata (source filename, page, section heading, chunk index) so the retriever can filter and the LLM can cite. Tools like LangChain and LlamaIndex ship recursive character splitters that are a fine starting point. Tune chunk size on a held-out evaluation set, not on vibes.

Chunk PDFs in production

  1. Split by structure first — Walk the parsed element tree and start new chunks at each heading or section break.
  2. Pack to target size — Accumulate sibling elements until you reach 300 to 800 tokens, then close the chunk.
  3. Add overlap windows — Carry the last 50 to 150 tokens of each chunk into the next so context is not lost.
  4. Attach metadata — Tag each chunk with source, page, heading path, chunk index, and a stable hash for upserts.
  5. Evaluate before scaling — Run 50 to 100 representative queries against your chunked index and measure recall before adding more PDFs.

Once parsing and chunking are solid, the embedding choice matters less than people think. The interesting work moves to retrieval quality and prompt design, where small fixes compound fast. A real production stack tends to converge on the same shape across teams: a vector database for similarity, a keyword index for exact matches, a reranker to combine them, and a tightly scoped prompt that refuses to answer outside the retrieved context. That last piece is what makes a RAG system feel trustworthy instead of confident-wrong.

If wiring all of that together sounds like a multi-week project, that is because it usually is. LangChain and LlamaIndex give you the parts; you still have to assemble, evaluate, and operate the pipeline yourself. The reason I built Sistava the way I did is that most founders do not want a RAG framework, they want an AI Employee that already knows their documents. Upload the PDFs, the parsing, chunking, embedding, and retrieval happen in the background, and the employee answers from your real knowledge with citations. Both paths are valid; pick the one that matches the time you actually have.

Which embedding model and vector database should you pick?

The embedding model market settled around three honest defaults in 2026. OpenAI text-embedding-3-large gives the best quality-per-dollar for English heavy workloads and ships with a tunable dimension count, which is useful when you want to trade recall for storage. Cohere embed-v3 is strong on multilingual and clean to integrate. For self-hosted, BGE-large or BGE-M3 from BAAI are the open-weight workhorses and run cheaply on a single GPU. On the vector database side: pgvector is the answer for most teams under 10 million chunks because it lives next to your existing Postgres, with HNSW indexes giving low-latency lookups. Pinecone is the easy managed default if you want zero ops. Qdrant is the strongest open-source option for filterable hybrid search. Weaviate and Milvus exist and are fine; pick by what your team already operates.

Benefits

Embedding model

text-embedding-3-large for hosted, BGE-M3 for self-hosted multilingual.

Vector store

pgvector for most teams, Pinecone for zero-ops, Qdrant for open-source hybrid search.

Hybrid retrieval

Combine dense vector search with BM25 keyword search to catch exact terms.

Reranker

Add a cross-encoder reranker like Cohere Rerank or BGE-Reranker to lift top-3 quality.

How do you make Q&A grounded and trustworthy?

The last layer is where most RAG demos break in production. A grounded Q&A system needs three things in the prompt: the retrieved chunks with their source metadata, a clear instruction to answer only from the provided context, and a refusal pattern when the context does not contain the answer. The model should cite chunk IDs or page numbers inline so users can verify. Add a confidence check: if retrieval similarity scores are below a threshold, route to a human or return a clean "I do not have that information" instead of a fabricated answer. Track every query, the retrieved chunks, and the final answer in observability (Langfuse, LangSmith, or your own logging) so you can spot drift. Real production RAG is a feedback loop: you watch the bad answers, fix parsing or chunking or prompting, and re-evaluate against a frozen test set. The systems that work are the ones with that loop running weekly, not the ones with the cleverest initial design.

Frequently asked questions

FAQ

How long does it take to build a production RAG pipeline for PDFs?

A working prototype with LangChain or LlamaIndex takes a day. Getting to production quality (good parsing, evaluated chunking, hybrid retrieval, reranking, monitoring, eval set) is typically 2 to 6 weeks of focused engineering, depending on PDF complexity and how strict your accuracy bar is.

What chunk size works best for PDF RAG?

Start at 500 tokens with 100 tokens of overlap and tune on an evaluation set. Dense reference documents (legal, technical) often want smaller chunks (300-400). Narrative content (reports, articles) tolerates larger chunks (600-800). Always split on structural boundaries before falling back to size.

Do I need a reranker, or is vector search enough?

Pure vector search is enough for demos. For production, a reranker (Cohere Rerank, BGE-Reranker, or Voyage rerank-2) lifts top-3 precision noticeably, especially on PDFs with repetitive content. Retrieve top 15-20 with vector search, rerank down to top 3-5, and feed those to the LLM.

How do I handle tables and figures inside PDFs?

Tables should be parsed as structured elements and kept whole as a single chunk when they fit. For larger tables, chunk by row with the header repeated. For figures, extract any caption and OCR the figure if it contains text. Tools like LlamaParse and Unstructured handle this better than naive parsers.

Can I skip building this and use a managed service?

Yes. Sistava AI Employees ship PDF training built in: upload the documents, the employee parses, chunks, embeds, and retrieves automatically, and answers with citations. It is the right call when you want the outcome (answers grounded in your docs) without operating the pipeline yourself. For deeper customization, building with LlamaIndex or LangChain still wins.

If you want a concrete next step after this overview, the question most founders ask once their RAG pipeline works is how to give that knowledge to a real AI Employee with memory, schedules, and channels. That is a different shape of problem than retrieval quality, and worth reading separately. The piece below walks through hiring an AI Employee that owns a knowledge function end to end, instead of being a search box bolted onto a chat window.

The honest takeaway after building this pipeline more than once is that the algorithm choices matter less than the operating loop. Parsing well, chunking with structure, embedding with a strong default, retrieving with a hybrid plus a reranker, and prompting for grounded answers will get you most of the way there. Everything else (fancier rerankers, custom embeddings, agentic retrieval) is a tuning knob you reach for after you have a baseline you trust. If you want to own that loop, LangChain and LlamaIndex are the right tools, and the patterns above are battle-tested. If you would rather skip the build and have AI Employees who already know your documents on day one, that is exactly what Sistava ships out of the box. Pick whichever path returns weeks of your time and gets you to grounded answers fastest.