Generated Metadata
Mixedbread Stores automatically generate metadata for each file ingested. This generated metadata provides structured information about the content of the file, including language, size, headings, number of pages, and more.
The generated metadata can be retrieved using the generated_metadata chunk field.
Metadata TypesLink to section
The generated_metadata object is a typed structure discriminated by the type field.
If type is not present, it is inferred from file_type (MIME); otherwise it defaults to text.
Supported type values: markdown, text, pdf, code, audio, video, image.
Every metadata object includes a file_extension field (e.g., ".mp3", ".pdf") reflecting the original file extension.
For a full list of supported file formats, see Supported File Types.
Markdown - Heading ExtractionLink to section
When processing markdown files, the system automatically extracts and preserves
heading structure to enhance search relevance and provide context. This feature
works for all markdown formats (.md, .markdown, .mdx).
What You GetLink to section
Each markdown chunk’s generated_metadata includes:
type: Always"markdown".file_type: Always"text/markdown".language: Detected language of the text.word_count: Word count for the chunk.file_size: File size in bytes.chunk_headings: Headings found within the current chunk ([{ level: number, text: string }]).heading_context: The document structure context leading up to this chunk ([{ level: number, text: string }]).start_line: Starting line number of the chunk in the source file.num_lines: Number of lines in the chunk.
Example OutputLink to section
Consider this markdown document:
# Getting Started
## Installation
### Prerequisites
You need Python 3.8+ installed.
...
### Setup
Run the following command:
```bash
pip install package
```
...
## Configuration
### Environment Variables
Set these variables in your `.env` file.
### Database Setup
Configure your database connection.When processed, the chunks would have generated_metadata like this:
Chunk 1 (Prerequisites section):
{
"type": "markdown",
"file_type": "text/markdown",
"language": "en",
"word_count": 6,
"file_size": 1234,
"chunk_headings": [
{"level": 1, "text": "Getting Started"},
{"level": 2, "text": "Installation"},
{"level": 3, "text": "Prerequisites"}
],
"heading_context": []
}Chunk 2 (Setup section):
{
"type": "markdown",
"file_type": "text/markdown",
"language": "en",
"word_count": 12,
"file_size": 1234,
"chunk_headings": [
{"level": 3, "text": "Setup"}
],
"heading_context": [
{"level": 1, "text": "Getting Started"},
{"level": 2, "text": "Installation"},
{"level": 3, "text": "Prerequisites"}
]
}Chunk 3 (Configuration section):
{
"type": "markdown",
"file_type": "text/markdown",
"language": "en",
"word_count": 20,
"file_size": 1234,
"chunk_headings": [
{"level": 2, "text": "Configuration"},
{"level": 3, "text": "Environment Variables"},
{"level": 3, "text": "Database Setup"}
],
"heading_context": [
{"level": 1, "text": "Getting Started"},
{"level": 2, "text": "Installation"},
{"level": 3, "text": "Setup"}
]
}Text - Common FieldsLink to section
Plain text chunks include a simpler generated_metadata shape:
type:"text"file_type:"text/plain"language: Detected language of the textword_count: Word count for the chunkfile_size: File size in bytesstart_line: Starting line number of the chunk in the source filenum_lines: Number of lines in the chunk
ExampleLink to section
{
"type": "text",
"file_type": "text/plain",
"language": "en",
"word_count": 57,
"file_size": 2048,
"start_line": 0,
"num_lines": 15
}Code - Language and SizeLink to section
For supported source files (e.g., Python, TypeScript, Java, C#), generated_metadata includes:
type:"code"file_type: One oftext/x-python,text/x-script.python,application/typescript,text/typescript,text/x-java-source,text/x-csharp, orapplication/javascriptlanguage: Detected programming languageword_count: Tokenized word count approximation for codefile_size: File size in bytesstart_line: Starting line number of the chunk in the source filenum_lines: Number of lines in the chunk
ExampleLink to section
{
"type": "code",
"file_type": "text/x-python",
"language": "python",
"word_count": 120,
"file_size": 8192,
"start_line": 0,
"num_lines": 42
}PDF - Document StatsLink to section
PDF chunks have specialized document-level stats:
type:"pdf"file_type:"application/pdf"total_pages: Total number of pages in the documenttotal_size: Total size of the original file in bytes
ExampleLink to section
{
"type": "pdf",
"file_type": "application/pdf",
"total_pages": 42,
"total_size": 1048576
}Image - DimensionsLink to section
Image chunks include basic dimensional metadata:
type:"image"file_type: String (MIME type, e.g.,image/jpeg,image/png)file_size: File size in byteswidth: Image width in pixelsheight: Image height in pixels
ExampleLink to section
{
"type": "image",
"file_type": "image/jpeg",
"file_size": 204800,
"width": 1920,
"height": 1080
}Audio - Media InformationLink to section
Audio chunks include specialized media metadata:
type:"audio"file_type: String (MIME type, e.g.,audio/mpeg,audio/wav)file_size: File size in bytestotal_duration_seconds: Total duration of the audio in secondssample_rate: Audio sample rate in Hzchannels: Number of audio channelsaudio_format: Audio format codebpm: Detected beats per minute (optional, only present when detected)start_time_seconds: Start time of the chunk in secondsend_time_seconds: End time of the chunk in secondsduration_seconds: Duration of the chunk in secondschunk_size_bytes: Size of the chunk in bytes
ExampleLink to section
{
"type": "audio",
"file_type": "audio/mpeg",
"file_size": 5242880,
"total_duration_seconds": 180.5,
"sample_rate": 44100,
"channels": 2,
"audio_format": 1,
"start_time_seconds": 0,
"end_time_seconds": 38.93,
"duration_seconds": 38.93,
"chunk_size_bytes": 650736
}Video - Media InformationLink to section
Video chunks include specialized media metadata:
type:"video"file_type: String (MIME type, e.g.,video/mp4,video/webm)file_size: File size in bytestotal_duration_seconds: Total duration of the video in secondsfps: Frames per secondwidth: Video width in pixelsheight: Video height in pixelsframe_count: Total number of frameshas_audio_stream: Whether the video contains an audio trackbpm: Detected beats per minute (optional, only present when detected)start_time_seconds: Start time of the chunk in secondsend_time_seconds: End time of the chunk in secondsduration_seconds: Duration of the chunk in secondschunk_size_bytes: Size of the chunk in bytes
ExampleLink to section
{
"type": "video",
"file_type": "video/mp4",
"file_size": 15728640,
"total_duration_seconds": 120.0,
"fps": 30.0,
"width": 1920,
"height": 1080,
"frame_count": 3600,
"has_audio_stream": true,
"start_time_seconds": 0.0,
"end_time_seconds": 120.0,
"duration_seconds": 120.0,
"chunk_size_bytes": 2097152
}Type Inference and FlexibilityLink to section
- If
typeis not present ingenerated_metadata, it is inferred fromfile_typewhen possible. - If neither
typenor a recognizedfile_typeis present, thetypedefaults to"text". - The system may include additional fields as needed for future enhancements; clients should read the documented fields and ignore unknown ones.
Supported Metadata Types
Learn about supported metadata types and how to structure metadata for optimal search performance, filtering capabilities, and content organization in Mixedbread Stores.
Search
Learn how to search your Store with semantic queries, configuration options, and advanced filtering capabilities.