Benchmarks

Mixedbread Quality Benchmarks

The agents you build are only as smart as their retrieval. We obsess over retrieval quality across every modality your data lives in: text, documents, images, audio, and video, and report results transparently against the strongest competing models.

All numbers below are produced with Wholembed V3, our unified, omnimodal, multilingual late-interaction retrieval model.

BrowseComp-PlusOpen BrowseComp-Plus paper

Multi-hop deep-research retrieval over a 100k-doc web corpus.

Dataset

BrowseComp-Plus is a deep-research benchmark with multi-hop questions over ~100k web documents, designed to isolate retrieval quality from model capability. We report results in both the default (standardized) scaffold and the stronger get_document scaffold.

Evaluation methodology

We evaluate retrieval pipelines paired with the BrowseComp-Plus harness, reporting accuracy and tool-call counts as published on the benchmark leaderboard.

Testing dates

March 2026