Research Staff - Synthetic Data

Research, Full-time

Remote

We encourage you to apply even if you don't meet 100% of the requirements

At Mixedbread, we believe that search is the ultimate bottleneck for AI. Our mission is to create the next generation of Information Retrieval systems: truly multimodal, transformer-native, and capable of surfacing just the right context across any format.

About the role

We are building a full-stack retrieval research lab that rethinks how search works in the era of large models. Our research spans late interaction methods, multimodal retrievers, and LLM-augmented search, with the goal of pushing retrieval beyond its current limits. Part of building retrieval for all imaginable use-cases is ensuring that we have sufficient data that will allow the model to know how to represent all kinds of information, in all kinds of scenario. Synthetic data is increasingly becoming one of the core components of all efficient models. Recent efforts have shown that good synthetic data is a lot more than a one-and-done model call. Understanding how to build the best pipelines for retrieval is yet to be understood: if you would like to be the person breaking new grounds in this area, this role is for you.

What you'll do

Design synthetic data generation pipelines across modalities
Lead and contribute to applied research projects in Information Retrieval
Advance Omni, our core search platform, improving retrieval performance and efficiency
Create innovative feedback loop to constantly improve data generation
Collaborate across engineering and product teams to identify key areas of improvements in our models' performance and design the data necessary to bridge that gap.
Publish and share results through papers, blog posts, and conferences

Research Directions

Late Interaction: Designing the next generation of fine-grained retrieval models beyond ColBERT
Omni-Modality: Building retrievers that unify text, image, and beyond
Data Refinement: Figuring out how to improve existing training datasets, modifying and filtering them to improve both training efficiency and downstream performance.
End-to-End Synthetic Pipelines: Understanding every part of the pipeline, from generation, to quality scoring, to re-writing and curating, among others. Define new components as needed.

What we're looking for

Experience in information retrieval, embeddings, or related fields.
A deep interest in synthetic data pipelines and data curation work.
A degree is appreciated but not mandatory for this position.
Proficiency in Python and ability to implement models from scratch
Passion for advancing search beyond single-vector similarity methods
Strong written and verbal communication skills

Nice to have

Publications in top-tier conferences (NeurIPS, ICML, SIGIR, ACL, etc.) or high-quality blog posts.
Open-source contributions in Data, IR, NLP, or ML libraries.
Familiarity with retrieval and late interaction systems.

What We Offer

Competitive compensation + equity
Comprehensive health, dental, and vision coverage
Visa sponsorship + relocation support
Professional development budget
Access to the best AI tools and subscriptions
Team off-sites + conference attendance
Transportation support
Wellness support, including gym membership and sports club subscriptions
Food support

Mixedbread is an equal opportunity employer committed to building a diverse and inclusive team. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Apply for this position

Join us in building the future of search.