PDF parser
PyMuPDF for fast text PDFs, Unstructured for messy layouts with tables, vision model fallback for scans.
How-to — — by Mahmoud Zalt
A practical guide to building a RAG pipeline for PDFs: ingest, parse, chunk, embed, index, and answer with citations, plus when to skip and use Sistava.
RAG (retrieval augmented generation) for PDFs is the pattern that turns a folder of documents into something you can actually ask questions of, with answers grounded in the source text instead of model guesses. The pipeline matters because every step adds or loses fidelity. Bad parsing drops tables and headers, bad chunking splits a sentence across two retrievable units, weak embeddings retrieve the wrong section, and a sloppy prompt makes the model hallucinate even when the right chunk was found. I have rebuilt this stack three times across products, and the gap between a one-evening prototype and something a real user trusts is mostly invisible glue: page-level metadata, OCR fallbacks for scanned PDFs, deduplication across versions, and citation rendering. Most teams underestimate that glue by a factor of five, which is the reason most internal RAG projects ship a demo, then stall before anyone uses it for real work.
The three load-bearing choices are the PDF parser, the embedding model, and the vector store. For parsing, PyMuPDF is fast and accurate on text-native PDFs, Unstructured handles messy real-world layouts including tables and headers, and a vision model fallback like GPT-4o or Claude with vision rescues scanned or image-heavy pages where OCR alone fails. For embeddings, OpenAI text-embedding-3-small is the default cost-effective pick, Cohere embed v3 is strong on multilingual content, and Voyage AI tends to lead on retrieval benchmarks for English business documents. For the vector store, Qdrant is the cleanest open-source pick, pgvector keeps everything inside Postgres if you already run it, and Pinecone removes ops at a cost. Pick once, instrument retrieval quality early, and resist swapping mid-build because changing any one of these means re-embedding the whole corpus.
PyMuPDF for fast text PDFs, Unstructured for messy layouts with tables, vision model fallback for scans.
OpenAI text-embedding-3-small as default, Voyage AI for top retrieval quality, Cohere v3 for multilingual.
Qdrant for open-source, pgvector if Postgres is already in your stack, Pinecone for zero ops.
LangChain for breadth, LlamaIndex for document-centric pipelines, raw code if you want one less dependency.
Cohere Rerank or BAAI bge-reranker on top-k results lifts answer quality more than swapping the base embedding model.
The five-step pipeline is the same shape whether you build it in 80 lines of Python or buy it. Ingest watches a folder, drive, or upload endpoint and adds new files to a queue. Parse turns each PDF into structured text plus page numbers, headings, and table markers. Chunk splits that text into 200 to 800 token segments with 10 to 15 percent overlap and attaches metadata (source, page, section, version). Embed converts each chunk into a vector and writes it to the index alongside the metadata. At query time the user question becomes a vector, the index returns top-k matches, an optional reranker reorders them, and the model writes the answer with citations back to the original chunks. Each step has a quality knob, and tuning the knobs is where most of the work lives.
Reading the pipeline laid out flat makes it look approachable, and the prototype really is approachable. The cost shows up later: handling 500-page contracts, scanned invoices mixed with native PDFs, document versions that supersede each other, multi-tenant isolation, and answering follow-up questions that depend on the previous turn. Each of those layers is a week of work that does not feel like progress and does not show up in any tutorial. This is the moment most founders ask whether the build is worth it for a side concern, or whether the time is better spent on the product the customer is actually paying for.
If your goal is an AI Employee that reads your PDFs and answers questions in chat, email, or Slack with citations, that is exactly the shape Sistava ships out of the box. You upload the documents, the employee handles ingest, embedding, retrieval, and answering, and you do not write a single line of pipeline code. For teams whose core product is something other than infrastructure, this trade is almost always worth it. If your core product is a search or document platform, building the pipeline yourself remains the right call because you want to own every retrieval knob.
Trustworthy retrieval comes from four practices that are easy to skip and expensive to add back. First, hybrid search: combine dense vector search with sparse BM25 keyword search, because vectors miss exact identifiers and BM25 misses semantic paraphrases. Second, reranking: pass the top 20 or 50 chunks through a cross-encoder reranker like Cohere Rerank before sending to the model, which lifts answer accuracy more than any embedding swap. Third, strict citation prompts: instruct the model to answer only from retrieved context and to include source plus page numbers inline, then refuse if no relevant chunk was found. Fourth, an evaluation harness: a small fixed set of question and expected-answer pairs you re-run on every change, so you notice when a parser tweak silently broke retrieval. Without all four, RAG demos look great and fail quietly the moment a real user asks something unexpected.
Combine dense vectors with BM25 keyword search so you catch both semantic matches and exact identifiers.
Rerank the top 20-50 chunks with Cohere Rerank or bge-reranker before passing to the model.
Force the model to answer only from retrieved context and refuse when no relevant chunk is found.
Run a fixed question-answer set on every change to catch silent regressions before users see them.
Build it yourself when the document workflow is the product you are charging for, when you need custom ranking signals that no platform exposes, or when compliance forces a specific deployment topology like on-premise embeddings. In those cases LangChain or LlamaIndex plus Qdrant plus your chosen embedding model is the right stack, and the engineering time pays back because the pipeline IS the moat. Use a platform when document Q&A is internal tooling, a feature inside a larger product, or part of an AI Employee role like a researcher, sales rep, or support agent who needs to read your knowledge base. Sistava ships ingest, embed, retrieve, and answer per employee with citations, memory, and channel delivery already wired. Lindy and CrewAI cover overlapping territory at the framework level, and n8n with a vector node works for lightweight DIY flows. The honest test: if no customer pays you specifically for retrieval quality, buy the pipeline and spend the time on what they do pay for.
Between 200 and 800 tokens with 10 to 15 percent overlap is the sweet spot for most business documents. Smaller chunks improve precision on factual lookups, larger chunks preserve context for reasoning questions. Test both with your real questions before locking the value in.
Qdrant is the cleanest open-source pick with fast filtering and hybrid search, pgvector is best if you already run Postgres and want one less moving part, and Pinecone removes all ops at a per-month cost. Weaviate and Milvus are credible alternatives for larger scale.
Usually yes. A cross-encoder reranker like Cohere Rerank applied to the top 20 to 50 vector results typically lifts answer accuracy by 10 to 25 percent and helps more than swapping the base embedding model. The latency cost is 100 to 300 milliseconds per query.
Add an OCR step using Tesseract or AWS Textract for scanned pages, and use a vision-capable model like GPT-4o or Claude for pages with charts and diagrams. Mark image-derived chunks in metadata so the prompt knows the source quality varies.
Yes. Upload PDFs to the employee, and Sistava handles ingest, embedding, retrieval, and answering with citations across chat, email, and Slack. You pick the employee role, the platform runs the pipeline, and pricing starts at {PERSONAL_USD}.
If your interest in PDF RAG comes from wanting an AI Employee that reads company knowledge and answers questions across channels, the next read goes one level deeper on deployment. It covers how to scope the knowledge base, which roles benefit most, and the integrations that turn document Q&A from a chat toy into actual internal tooling people use weekly. Treat it as the companion piece for anyone choosing buy over build.
The honest framing for building RAG over PDFs: the prototype is a weekend, the production system is a quarter, and the gap is mostly metadata, evaluation, and edge cases. If retrieval IS your product, that quarter is the most valuable engineering you will do, and the pipeline you own becomes the moat. If retrieval is a means to a different end (an AI Employee that reads your contracts, a support agent that quotes your docs, a researcher that summarizes whitepapers), a platform that handles ingest, embed, retrieve, and answer in the box pays back inside the first month. Sistava fits the second case cleanly, ships citations and memory per employee, and frees the founder to spend that quarter on the product the customer is paying for. Pick the path that matches what your users actually pay you for, and the rest sorts itself out.