Mixedbread
Ingest

Generated Metadata

Mixedbread Stores automatically generate metadata for each file ingested. This generated metadata provides structured information about the content of the file, including language, size, headings, number of pages, and more.

The generated metadata can be retrieved using the generated_metadata chunk field.

Metadata TypesLink to section

The generated_metadata object is a typed structure discriminated by the type field. If type is not present, it is inferred from file_type (MIME); otherwise it defaults to text.

Supported type values: markdown, text, pdf, code, audio, video, image.

Every metadata object includes a file_extension field (e.g., ".mp3", ".pdf") reflecting the original file extension.

For a full list of supported file formats, see .

Markdown - Heading ExtractionLink to section

When processing markdown files, the system automatically extracts and preserves heading structure to enhance search relevance and provide context. This feature works for all markdown formats (.md, .markdown, .mdx).

What You GetLink to section

Each markdown chunk’s generated_metadata includes:

  1. type: Always "markdown".
  2. file_type: Always "text/markdown".
  3. language: Detected language of the text.
  4. word_count: Word count for the chunk.
  5. file_size: File size in bytes.
  6. chunk_headings: Headings found within the current chunk ([{ level: number, text: string }]).
  7. heading_context: The document structure context leading up to this chunk ([{ level: number, text: string }]).
  8. start_line: Starting line number of the chunk in the source file.
  9. num_lines: Number of lines in the chunk.

Example OutputLink to section

Consider this markdown document:

# Getting Started
## Installation
### Prerequisites
You need Python 3.8+ installed.
...

### Setup
Run the following command:

```bash
pip install package
```
...

## Configuration
### Environment Variables
Set these variables in your `.env` file.

### Database Setup
Configure your database connection.

When processed, the chunks would have generated_metadata like this:

Chunk 1 (Prerequisites section):

{
  "type": "markdown",
  "file_type": "text/markdown",
  "language": "en",
  "word_count": 6,
  "file_size": 1234,
  "chunk_headings": [
    {"level": 1, "text": "Getting Started"},
    {"level": 2, "text": "Installation"},
    {"level": 3, "text": "Prerequisites"}
  ],
  "heading_context": []
}

Chunk 2 (Setup section):

{
  "type": "markdown",
  "file_type": "text/markdown",
  "language": "en",
  "word_count": 12,
  "file_size": 1234,
  "chunk_headings": [
    {"level": 3, "text": "Setup"}
  ],
  "heading_context": [
    {"level": 1, "text": "Getting Started"},
    {"level": 2, "text": "Installation"},
    {"level": 3, "text": "Prerequisites"}
  ]
}

Chunk 3 (Configuration section):

{
  "type": "markdown",
  "file_type": "text/markdown",
  "language": "en",
  "word_count": 20,
  "file_size": 1234,
  "chunk_headings": [
    {"level": 2, "text": "Configuration"},
    {"level": 3, "text": "Environment Variables"},
    {"level": 3, "text": "Database Setup"}
  ],
  "heading_context": [
    {"level": 1, "text": "Getting Started"},
    {"level": 2, "text": "Installation"},
    {"level": 3, "text": "Setup"}
  ]
}

Text - Common FieldsLink to section

Plain text chunks include a simpler generated_metadata shape:

  • type: "text"
  • file_type: "text/plain"
  • language: Detected language of the text
  • word_count: Word count for the chunk
  • file_size: File size in bytes
  • start_line: Starting line number of the chunk in the source file
  • num_lines: Number of lines in the chunk

ExampleLink to section

{
  "type": "text",
  "file_type": "text/plain",
  "language": "en",
  "word_count": 57,
  "file_size": 2048,
  "start_line": 0,
  "num_lines": 15
}

Code - Language and SizeLink to section

For supported source files (e.g., Python, TypeScript, Java, C#), generated_metadata includes:

  • type: "code"
  • file_type: One of text/x-python, text/x-script.python, application/typescript, text/typescript, text/x-java-source, text/x-csharp, or application/javascript
  • language: Detected programming language
  • word_count: Tokenized word count approximation for code
  • file_size: File size in bytes
  • start_line: Starting line number of the chunk in the source file
  • num_lines: Number of lines in the chunk

ExampleLink to section

{
  "type": "code",
  "file_type": "text/x-python",
  "language": "python",
  "word_count": 120,
  "file_size": 8192,
  "start_line": 0,
  "num_lines": 42
}

PDF - Document StatsLink to section

PDF chunks have specialized document-level stats:

  • type: "pdf"
  • file_type: "application/pdf"
  • total_pages: Total number of pages in the document
  • total_size: Total size of the original file in bytes

ExampleLink to section

{
  "type": "pdf",
  "file_type": "application/pdf",
  "total_pages": 42,
  "total_size": 1048576
}

Image - DimensionsLink to section

Image chunks include basic dimensional metadata:

  • type: "image"
  • file_type: String (MIME type, e.g., image/jpeg, image/png)
  • file_size: File size in bytes
  • width: Image width in pixels
  • height: Image height in pixels

ExampleLink to section

{
  "type": "image",
  "file_type": "image/jpeg",
  "file_size": 204800,
  "width": 1920,
  "height": 1080
}

Audio - Media InformationLink to section

Audio chunks include specialized media metadata:

  • type: "audio"
  • file_type: String (MIME type, e.g., audio/mpeg, audio/wav)
  • file_size: File size in bytes
  • total_duration_seconds: Total duration of the audio in seconds
  • sample_rate: Audio sample rate in Hz
  • channels: Number of audio channels
  • audio_format: Audio format code
  • bpm: Detected beats per minute (optional, only present when detected)
  • start_time_seconds: Start time of the chunk in seconds
  • end_time_seconds: End time of the chunk in seconds
  • duration_seconds: Duration of the chunk in seconds
  • chunk_size_bytes: Size of the chunk in bytes

ExampleLink to section

{
  "type": "audio",
  "file_type": "audio/mpeg",
  "file_size": 5242880,
  "total_duration_seconds": 180.5,
  "sample_rate": 44100,
  "channels": 2,
  "audio_format": 1,
  "start_time_seconds": 0,
  "end_time_seconds": 38.93,
  "duration_seconds": 38.93,
  "chunk_size_bytes": 650736
}

Video - Media InformationLink to section

Video chunks include specialized media metadata:

  • type: "video"
  • file_type: String (MIME type, e.g., video/mp4, video/webm)
  • file_size: File size in bytes
  • total_duration_seconds: Total duration of the video in seconds
  • fps: Frames per second
  • width: Video width in pixels
  • height: Video height in pixels
  • frame_count: Total number of frames
  • has_audio_stream: Whether the video contains an audio track
  • bpm: Detected beats per minute (optional, only present when detected)
  • start_time_seconds: Start time of the chunk in seconds
  • end_time_seconds: End time of the chunk in seconds
  • duration_seconds: Duration of the chunk in seconds
  • chunk_size_bytes: Size of the chunk in bytes

ExampleLink to section

{
  "type": "video",
  "file_type": "video/mp4",
  "file_size": 15728640,
  "total_duration_seconds": 120.0,
  "fps": 30.0,
  "width": 1920,
  "height": 1080,
  "frame_count": 3600,
  "has_audio_stream": true,
  "start_time_seconds": 0.0,
  "end_time_seconds": 120.0,
  "duration_seconds": 120.0,
  "chunk_size_bytes": 2097152
}

Type Inference and FlexibilityLink to section

  • If type is not present in generated_metadata, it is inferred from file_type when possible.
  • If neither type nor a recognized file_type is present, the type defaults to "text".
  • The system may include additional fields as needed for future enhancements; clients should read the documented fields and ignore unknown ones.
Last updated: April 7, 2026