Mixedbread

Chat with PDFs

If you have a lot of documents like PDFs and slides in the company knowledge base, you might want to find a specific page or figure by describing it.

Chat with PDFs

In this cookbook, we build a document search engine with Q&A using Mixedbread. It shows rich document support (PDFs, scanned files, PPTX), high quality parsing and native question answering - everything you need to find and chat with your documents.

Prerequisites & Setup

Before you begin, make sure you have:

  1. API Key: Get your API key from the page
  2. SDK: Install the Mixedbread SDK for your preferred language:
pip install mixedbread
  1. (Optional) For the section, you'll need an and the OpenAI SDK:
pip install openai

Get Sample Documents

We'll use a sample of 3 documents (PDFs, scanned files, and PPTX). Download and extract:

curl -L -o doc-qa-sample.zip https://github.com/mixedbread-ai/cookbook-assets/releases/download/v1.0.0/doc-qa-sample.zip
unzip doc-qa-sample.zip

The sample includes:

  • attention.pdf - 3 PDF pages from the paper "Attention Is All You Need"
  • UFO.pdf - 3 PDF pages of a scanned UFO document released by FOIA
  • OBGYN.pptx - 3 PPTX pages from a RCSI presentation on Obstetrics and Gynecology

Create a Store

First, create a Mixedbread store for your documents:

Create Document Store
from mixedbread import Mixedbread

mxbai = Mixedbread(api_key="YOUR_API_KEY")

store = mxbai.stores.create(
  name="doc-qa-cookbook"
)

Upload Documents

Upload your PDFs and slides:

Upload PDFs and Slides
from pathlib import Path

for filename in ["attention.pdf", "UFO.pdf", "OBGYN.pptx"]:
  mxbai.stores.files.upload(
      store_identifier="doc-qa-cookbook",
      file=Path(filename)
  )
  print(f"Uploaded: {filename}")

(Optional) High Quality Parsing

Search works on visual content (screenshots of pages) directly - Mixedbread stores can find relevant pages regardless of quality.

For downstream tasks, Mixedbread stores return per page with image_url and ocr_text fields. When LLMs don't support images or struggle with complex visuals, you can use ocr_text field to provide more context.

Enable high quality parsing to automatically get better OCR text:

Upload with High Quality Parsing
mxbai.stores.files.upload(
  store_identifier="doc-qa-cookbook",
  file=Path(filename),
  config={
      "parsing_strategy": "high_quality",
  }
)
  • Fast (default): Standard PDF text extraction, fast and efficient
  • High quality: Runs advanced OCR to extract cleaner, more accurate text

Search Documents

Once your documents are uploaded and indexed, search using natural language:

Search Documents
results = mxbai.stores.search(
  store_identifiers=["doc-qa-cookbook"],
  query="architecture diagram showing the transformer model layers",
  top_k=1
)

for result in results.data:
  print(f"{result.score:.3f} - {result.filename}")
  print(f"  {result.ocr_text[:200]}...")

Try searching for content in the scanned UFO document or the RCSI slides:

Search Scanned Documents
ufo_results = mxbai.stores.search(
  store_identifiers=["doc-qa-cookbook"],
  query="Eastern Air Line pilots UFO sighting",
  top_k=1
)

rcsi_results = mxbai.stores.search(
  store_identifiers=["doc-qa-cookbook"],
  query="Acceptable MCA Doppler shift waveform",
  top_k=1
)

Build Q&A with LLMs

Now for the main event: getting AI-powered answers from your documents.

Native Question Answering

Mixedbread stores provide question answering capabilities out of the box:

Native Question Answering
response = mxbai.stores.question_answering(
  store_identifiers=["doc-qa-cookbook"],
  query="What date and location did each sighting of UFO occur?",
  top_k=2
)

print("Answer:", response.answer)
print("\nSources:")
for i, source in enumerate(response.sources):
  print(f"  [{i}] {source.filename} - {source.ocr_text[:200]}...")

The response includes:

  • answer: AI-generated answer with citation tags like <cite i="0"/>
  • sources: The document used to generate the answer

Citations map directly to sources - <cite i="0"/> refers to sources[0].

RAG with OpenAI

Want more control? Use Mixedbread stores for retrieval and OpenAI or other LLM providers for generation:

  1. Retrieve - Call Mixedbread to search documents with your query
  2. Build context - Convert chunks to messages with image URLs and OCR text
  3. Generate - Call the LLM with the query and context to get an answer
RAG with OpenAI
from openai import OpenAI

openai = OpenAI(api_key="YOUR_OPENAI_KEY")

# Step 1: Retrieve relevant context from Mixedbread
query = "Explain the self-attention mechanism"

results = mxbai.stores.search(
  store_identifiers=["doc-qa-cookbook"],
  query=query,
  top_k=2
)

# Step 2: Build multimodal content with images and text
content = [{"type": "input_text", "text": f"Question: {query}\n\nHere are the relevant document pages:"}]

for i, r in enumerate(results.data):
  # Add the page image for visual context
  content.append({
      "type": "input_image",
      "image_url": r.image_url.url
  })
  # Add OCR text and source info
  content.append({
      "type": "input_text",
      "text": f"[Source {i}: {r.filename}]\n{r.ocr_text}"
  })

# Step 3: Generate answer with OpenAI Responses API
response = openai.responses.create(
  model="gpt-5",
  instructions="Answer questions based on the provided document pages. Cite sources when possible.",
  input=content
)

print(response.output_text)

This pattern gives you full control over:

  • The prompt template and system message
  • Which model to use (GPT-5, Claude, etc.)
  • How to format and present citations
  • Streaming responses

Next Steps

You now have a working document Q&A system. Here are ways to extend it:

  • Add more documents: Upload your own PDFs, Word docs, or presentations
  • Filter by metadata: Create document collections by topic, date, or author
  • Build a chat interface: Create a conversational UI with chat history
Last updated: February 2, 2026