mxbai-embed-large-v1
Discover mxbai-embed-large-v1, our state-of-the-art English embedding model. Learn about its powerful performance, versatility across various NLP tasks, and how to effectively use it for semantic search, information retrieval, and other applications.
API Reference
Embeddings
Model Reference
mxbai-embed-large-v1
Blog Post
Open Source Strikes Bread - New Fluffy Embedding Model
Model Description
mxbai-embed-large-v1 is our powerful English embedding model that provides state-of-the-art performance among efficiently sized models. It outperforms closed source models like OpenAI's text-embedding-ada-002.
The model was trained on a vast dataset of over 700 million pairs using contrastive training and fine-tuned on more than 30 million high-quality triplets using the AnglE loss function. This extensive training enables the model to adapt to a wide range of topics and domains, making it suitable for various real-world applications and Retrieval-Augmented Generation (RAG) use cases.
mxbai-embed-large-v1 is well-suited for binary embeddings. This helps you save 32x storage and achieve 40x faster retrieval, while maintaining over 96% of the performance.
mxbai-embed-large-v1 achieves top performance on the Massive Text Embedding Benchmark (MTEB), which measures embedding models across seven tasks: classification, clustering, pair classification, re-ranking, retrieval, semantic textual similarity, and summarization. The model's strong performance across these diverse tasks demonstrates its versatility and robustness.
Layers | Embedding Dimension | Recommended Sequence Length | Language |
---|---|---|---|
24 | 1024 | 512 | English |
Using a Prompt
Adding a domain-specific prompt to a text can help the model understand how the embedding will be used.
For retrieval tasks, the query can be preceded by the prompt: Represent this sentence for searching relevant passages:
. For other tasks, the text can be used as-is without any additional prompt.
The prompt
parameter is available via our /embeddings endpoint,
SDKs, and some third-party integrations, to automatically prepend the prompt to the texts for
you. By default, we calculate the embeddings using the provided text directly.
Suitable Scoring Methods
- Cosine Similarity: Ideal for measuring the similarity between text vectors, commonly used in tasks like semantic textual similarity and information retrieval.
- Euclidean Distance: Useful for measuring dissimilarity between embeddings, especially effective in clustering and outlier detection.
- Dot Product: Appropriate when embeddings are normalized; used in tasks where alignment of vector orientation is critical.
Limitations
- Language: mxbai-embed-large-v1 is trained on English text and is specifically designed for the English language.
- Sequence Length: The suggested maximum sequence length is 512 tokens. Longer sequences may be truncated, leading to a loss of information.
Examples
Calculate Sentence Similarities
The following code illustrates how to compute similarities between sentences using the cosine similarity score function.
from mixedbread_ai.client import MixedbreadAi
from sentence_transformers.util import cos_sim
mxbai = MixedbreadAI(api_key="YOUR_API_KEY")
model = "mixedbread-ai/mxbai-embed-large-v1"
docs = [
"A man is eating food.",
"A man is eating pasta.",
]
result = mxbai.embeddings(
model=model,
input=docs,
)
embeddings = [item.embedding for item in result.data]
# Calculate cosine similarity
similarity = cos_sim(embeddings[0], embeddings[1])
print(similarity)
Information Retrieval
The following code snippet demonstrates the retrieval of information related to a specific query from a given corpus. Note that the prompt Represent this sentence for searching relevant passages:
is used for the query.
from mixedbread_ai.client import MixedbreadAi
from sentence_transformers.util import cos_sim
mxbai = MixedbreadAI(api_key="YOUR_API_KEY")
model = "mixedbread-ai/mxbai-embed-large-v1"
prompt = 'Represent this sentence for searching relevant passages:'
query = "A man is eating a piece of bread"
docs = [
"A man is eating food.",
"A man is eating pasta.",
"The girl is carrying a baby.",
"A man is riding a horse.",
]
query_result = mxbai.embeddings(
model=model,
prompt=prompt,
input=[query],
)
docs_result = mxbai.embeddings(
model=model,
input=docs
)
query_embedding = query_result.data[0].embedding
docs_embeddings = [item.embedding for item in docs_result.data]
# Calculate cosine similarity
similarities = cos_sim(query_embedding, docs_embeddings)
similarity_scores = similarities.squeeze().tolist()
# Retrieve documents sorted by similarity
retrieved_docs = sorted(zip(docs, similarity_scores), key=lambda x: x[1], reverse=True)
# Print the retrieved documents and their similarity scores
for doc, score in retrieved_docs:
print(f"Document: {doc}\nSimilarity Score: {score}\n")
Last updated on 3/25/2025
Embedding Models
Explore the delicious Mixedbread embed family, featuring state-of-the-art performance, size efficiency, and open-source availability. Elevate your search, classification, recommendation, and more.
deepset-mxbai-embed-de-large-v1
Discover deepset-mxbai-embed-de-large-v1, a powerful German/English embedding model developed through collaboration between deepset and Mixedbread. This state-of-the-art open-source model offers superior performance, supports binary quantization and Matryoshka representation learning, and enables significant cost reductions in real-world applications.