Sistava is an AI workforce platform where solo founders hire AI employees to run their business around the clock. Each AI employee has a specific role like sales, marketing, or customer support, with real tool integrations, persistent memory, and the ability to work inside your existing apps like Slack, Gmail, and HubSpot.

What is an AI employee?

An AI employee is an autonomous AI agent with a defined role, persona, skill set, and tool access. Unlike a chatbot that only answers questions, an AI employee takes on recurring work like writing emails, qualifying leads, answering support tickets, and publishing content, and it works on its own around the clock without being prompted each time.

How is Sistava different from project management software?

Sistava is not project management software. You hire AI employees who do the work, not a tool that tracks work done by humans. Your AI employees run sales outreach, write marketing content, answer support tickets, and handle operations on their own, without constant supervision.

How much does Sistava cost?

Sistava has a free plan you can start without a credit card, plus paid plans that scale with how much work you hand to your AI employees. See the pricing page for current plans.

What can AI employees do on Sistava?

Your AI employees take on the recurring work that runs a business: qualifying and reaching out to leads, writing and publishing marketing content, answering support tickets, and handling day to day operations. Each one comes with a role and skill set, so it can start working the day you hire it.

Sistava is built for solo founders and small teams who need to run sales, marketing, support, and operations without hiring a full human team. It gives you the equivalent of a growth team you can hire in minutes.

How to Build RAG for PDFs From Ingest to Q&A

How-to — 2026-01-11 — by Mahmoud Zalt

A practical guide to building a RAG pipeline for PDFs: ingest, parse, chunk, embed, index, and answer with citations, plus when to skip and use Sistava.

What is RAG for PDFs and why does the pipeline matter?

RAG (retrieval augmented generation) for PDFs is the pattern that turns a folder of documents into something you can actually ask questions of, with answers grounded in the source text instead of model guesses. The pipeline matters because every step adds or loses fidelity. Bad parsing drops tables and headers, bad chunking splits a sentence across two retrievable units, weak embeddings retrieve the wrong section, and a sloppy prompt makes the model hallucinate even when the right chunk was found. I have rebuilt this stack three times across products, and the gap between a one-evening prototype and something a real user trusts is mostly invisible glue: page-level metadata, OCR fallbacks for scanned PDFs, deduplication across versions, and citation rendering. Most teams underestimate that glue by a factor of five, which is the reason most internal RAG projects ship a demo, then stall before anyone uses it for real work.

At a Glance

5: Pipeline stages (ingest, parse, chunk, embed, retrieve)
200-800: Recommended chunk size in tokens
3-8: Typical top-k chunks per query
10-15%: Token overlap between adjacent chunks

How do you choose the parser, embeddings, and vector store?

The three load-bearing choices are the PDF parser, the embedding model, and the vector store. For parsing, PyMuPDF is fast and accurate on text-native PDFs, Unstructured handles messy real-world layouts including tables and headers, and a vision model fallback like GPT-4o or Claude with vision rescues scanned or image-heavy pages where OCR alone fails. For embeddings, OpenAI text-embedding-3-small is the default cost-effective pick, Cohere embed v3 is strong on multilingual content, and Voyage AI tends to lead on retrieval benchmarks for English business documents. For the vector store, Qdrant is the cleanest open-source pick, pgvector keeps everything inside Postgres if you already run it, and Pinecone removes ops at a cost. Pick once, instrument retrieval quality early, and resist swapping mid-build because changing any one of these means re-embedding the whole corpus.

Benefits

PDF parser

PyMuPDF for fast text PDFs, Unstructured for messy layouts with tables, vision model fallback for scans.

Embedding model

OpenAI text-embedding-3-small as default, Voyage AI for top retrieval quality, Cohere v3 for multilingual.

Vector store

Qdrant for open-source, pgvector if Postgres is already in your stack, Pinecone for zero ops.

Orchestration

LangChain for breadth, LlamaIndex for document-centric pipelines, raw code if you want one less dependency.

Reranker

Cohere Rerank or BAAI bge-reranker on top-k results lifts answer quality more than swapping the base embedding model.

What does the ingest to Q&A pipeline look like step by step?

The five-step pipeline is the same shape whether you build it in 80 lines of Python or buy it. Ingest watches a folder, drive, or upload endpoint and adds new files to a queue. Parse turns each PDF into structured text plus page numbers, headings, and table markers. Chunk splits that text into 200 to 800 token segments with 10 to 15 percent overlap and attaches metadata (source, page, section, version). Embed converts each chunk into a vector and writes it to the index alongside the metadata. At query time the user question becomes a vector, the index returns top-k matches, an optional reranker reorders them, and the model writes the answer with citations back to the original chunks. Each step has a quality knob, and tuning the knobs is where most of the work lives.

The five-step PDF RAG pipeline

Ingest — Watch a folder, drive, or upload endpoint, dedupe by hash, and queue new and updated PDFs for processing.
Parse — Extract text with PyMuPDF or Unstructured, OCR scanned pages, and preserve page numbers, headings, and tables as metadata.
Chunk — Split into 200-800 token segments with 10-15 percent overlap, never split mid-sentence, and attach source plus page metadata.
Embed and index — Embed each chunk and upsert into Qdrant or pgvector with vector, text, and metadata for citation lookup.
Retrieve and answer — Embed the query, fetch top-k matches, rerank, and prompt the model to answer only from retrieved context with citations.

Reading the pipeline laid out flat makes it look approachable, and the prototype really is approachable. The cost shows up later: handling 500-page contracts, scanned invoices mixed with native PDFs, document versions that supersede each other, multi-tenant isolation, and answering follow-up questions that depend on the previous turn. Each of those layers is a week of work that does not feel like progress and does not show up in any tutorial. This is the moment most founders ask whether the build is worth it for a side concern, or whether the time is better spent on the product the customer is actually paying for.

If your goal is an AI Employee that reads your PDFs and answers questions in chat, email, or Slack with citations, that is exactly the shape Sistava ships out of the box. You upload the documents, the employee handles ingest, embedding, retrieval, and answering, and you do not write a single line of pipeline code. For teams whose core product is something other than infrastructure, this trade is almost always worth it. If your core product is a search or document platform, building the pipeline yourself remains the right call because you want to own every retrieval knob.

How do you make retrieval and citations actually trustworthy?

Trustworthy retrieval comes from four practices that are easy to skip and expensive to add back. First, hybrid search: combine dense vector search with sparse BM25 keyword search, because vectors miss exact identifiers and BM25 misses semantic paraphrases. Second, reranking: pass the top 20 or 50 chunks through a cross-encoder reranker like Cohere Rerank before sending to the model, which lifts answer accuracy more than any embedding swap. Third, strict citation prompts: instruct the model to answer only from retrieved context and to include source plus page numbers inline, then refuse if no relevant chunk was found. Fourth, an evaluation harness: a small fixed set of question and expected-answer pairs you re-run on every change, so you notice when a parser tweak silently broke retrieval. Without all four, RAG demos look great and fail quietly the moment a real user asks something unexpected.

Benefits

Hybrid search

Combine dense vectors with BM25 keyword search so you catch both semantic matches and exact identifiers.

Cross-encoder reranking

Rerank the top 20-50 chunks with Cohere Rerank or bge-reranker before passing to the model.

Strict citation prompt

Force the model to answer only from retrieved context and refuse when no relevant chunk is found.

Evaluation harness

Run a fixed question-answer set on every change to catch silent regressions before users see them.

When should you build RAG yourself versus use a platform like Sistava?

Build it yourself when the document workflow is the product you are charging for, when you need custom ranking signals that no platform exposes, or when compliance forces a specific deployment topology like on-premise embeddings. In those cases LangChain or LlamaIndex plus Qdrant plus your chosen embedding model is the right stack, and the engineering time pays back because the pipeline IS the moat. Use a platform when document Q&A is internal tooling, a feature inside a larger product, or part of an AI Employee role like a researcher, sales rep, or support agent who needs to read your knowledge base. Sistava ships ingest, embed, retrieve, and answer per employee with citations, memory, and channel delivery already wired. Lindy and CrewAI cover overlapping territory at the framework level, and n8n with a vector node works for lightweight DIY flows. The honest test: if no customer pays you specifically for retrieval quality, buy the pipeline and spend the time on what they do pay for.

Frequently asked questions

FAQ

What is the best chunk size for PDF RAG?

Between 200 and 800 tokens with 10 to 15 percent overlap is the sweet spot for most business documents. Smaller chunks improve precision on factual lookups, larger chunks preserve context for reasoning questions. Test both with your real questions before locking the value in.

Which vector database should I pick in 2026?

Qdrant is the cleanest open-source pick with fast filtering and hybrid search, pgvector is best if you already run Postgres and want one less moving part, and Pinecone removes all ops at a per-month cost. Weaviate and Milvus are credible alternatives for larger scale.

Do I need a reranker on top of vector search?

Usually yes. A cross-encoder reranker like Cohere Rerank applied to the top 20 to 50 vector results typically lifts answer accuracy by 10 to 25 percent and helps more than swapping the base embedding model. The latency cost is 100 to 300 milliseconds per query.

How do I handle scanned PDFs and images inside documents?

Add an OCR step using Tesseract or AWS Textract for scanned pages, and use a vision-capable model like GPT-4o or Claude for pages with charts and diagrams. Mark image-derived chunks in metadata so the prompt knows the source quality varies.

Can Sistava AI Employees answer questions from my PDFs without me building the pipeline?

Yes. Upload PDFs to the employee, and Sistava handles ingest, embedding, retrieval, and answering with citations across chat, email, and Slack. You pick the employee role, the platform runs the pipeline, and pricing starts at {PERSONAL_USD}.

If your interest in PDF RAG comes from wanting an AI Employee that reads company knowledge and answers questions across channels, the next read goes one level deeper on deployment. It covers how to scope the knowledge base, which roles benefit most, and the integrations that turn document Q&A from a chat toy into actual internal tooling people use weekly. Treat it as the companion piece for anyone choosing buy over build.

The honest framing for building RAG over PDFs: the prototype is a weekend, the production system is a quarter, and the gap is mostly metadata, evaluation, and edge cases. If retrieval IS your product, that quarter is the most valuable engineering you will do, and the pipeline you own becomes the moat. If retrieval is a means to a different end (an AI Employee that reads your contracts, a support agent that quotes your docs, a researcher that summarizes whitepapers), a platform that handles ingest, embed, retrieve, and answer in the box pays back inside the first month. Sistava fits the second case cleanly, ships citations and memory per employee, and frees the founder to spend that quarter on the product the customer is paying for. Pick the path that matches what your users actually pay you for, and the rest sorts itself out.