Research Staff - Synthetic Data

Research, Full-time
Remote
Apply

We encourage you to apply even if you don't meet 100% of the requirements

At Mixedbread, we believe that search is the ultimate bottleneck for AI. Our mission is to create the next generation of Information Retrieval systems: truly multimodal, transformer-native, and capable of surfacing just the right context across any format.

About the role

We are building a full-stack retrieval research lab that rethinks how search works in the era of large models. Our research spans late interaction methods, multimodal retrievers, and LLM-augmented search, with the goal of pushing retrieval beyond its current limits. Part of building retrieval for all imaginable use-cases is ensuring that we have sufficient data that will allow the model to know how to represent all kinds of information, in all kinds of scenario. Synthetic data is increasingly becoming one of the core components of all efficient models. Recent efforts have shown that good synthetic data is a lot more than a one-and-done model call. Understanding how to build the best pipelines for retrieval is yet to be understood: if you would like to be the person breaking new grounds in this area, this role is for you.

What you'll do

  • Design synthetic data generation pipelines across modalities
  • Lead and contribute to applied research projects in Information Retrieval
  • Advance Omni, our core search platform, improving retrieval performance and efficiency
  • Create innovative feedback loop to constantly improve data generation
  • Collaborate across engineering and product teams to identify key areas of improvements in our models' performance and design the data necessary to bridge that gap.
  • Publish and share results through papers, blog posts, and conferences

Research Directions

  • Late Interaction: Designing the next generation of fine-grained retrieval models beyond ColBERT
  • Omni-Modality: Building retrievers that unify text, image, and beyond
  • Data Refinement: Figuring out how to improve existing training datasets, modifying and filtering them to improve both training efficiency and downstream performance.
  • End-to-End Synthetic Pipelines: Understanding every part of the pipeline, from generation, to quality scoring, to re-writing and curating, among others. Define new components as needed.

What we're looking for

  • Experience in information retrieval, embeddings, or related fields.
  • A deep interest in synthetic data pipelines and data curation work.
  • A degree is appreciated but not mandatory for this position.
  • Proficiency in Python and ability to implement models from scratch
  • Passion for advancing search beyond single-vector similarity methods
  • Strong written and verbal communication skills

Nice to have

  • Publications in top-tier conferences (NeurIPS, ICML, SIGIR, ACL, etc.) or high-quality blog posts.
  • Open-source contributions in Data, IR, NLP, or ML libraries.
  • Familiarity with retrieval and late interaction systems.

Benefits & Perks

  • Competitive compensation with strong equity incentives
  • Flexible, remote-first work environment
  • Low-meeting, async-first research culture
  • Frequent team off-sites and conference attendance
  • Work with a world-class team to redefine the future of search

Mixedbread is an equal opportunity employer committed to building a diverse and inclusive team. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Apply for this position
Join us in building the future of search.

Please provide a link to your LinkedIn profile.

Please provide a link to your GitHub profile.

Please provide a link to your X profile.