# Embedding Model Guide

#### **What is an Embedding Model**

An embedding model converts text, documents, or images into numerical vectors that represent their meaning. These vectors are stored in a vector database, allowing systems to search for information based on similarity rather than exact keywords.

Different embedding models are designed for different purposes. Some are optimized for long document retrieval, others for multilingual understanding, code search, or semantic similarity. **Choosing the right embedding model helps ensure that the system retrieves the most relevant information for a given task**.

| Embedding Model          | Best Uses                                                            | Description                                                                                                                                                                   |
| ------------------------ | -------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Text Embedding 3 Large   | Enterprise Search, Large Document Retrieval, Knowledge Base Indexing | Handles accurate semantic search and retrieval across complex datasets.                                                                                                       |
| Text Embedding Ada 002   | Legacy Systems, Lightweight Semantic Search, Simple Vector Databases | Suited for basic semantic search and lightweight applications at lower costs.                                                                                                 |
| Gemini Embedding 001     | Multi-lingual Datasets, RAG pipelines                                | Best suited for multilingual and long-context document retrieval. It performs well across multiple languages and maintains strong semantic understanding in longer documents. |
| Multilingual Embedding 2 | Multilingual Search, Cross-Language Document Retrieval.              | For multilingual text retrieval, similarity, and search across many languages.                                                                                                |
| English Embedding 4      | English-only datasets, document retrieval.                           | <p>Optimized for English </p><p>documents.</p>                                                                                                                                |

#### Advanced Settings

* **Language**: Select the language(s) used in the database to improve retrieval accuracy.
* **Chunk Size**: Determines how much text from a document is processed in each segment. Smaller chunks focus on specific details and improve precision, while larger chunks include more context but may introduce less relevant information.\
  \&#xNAN;*Important: Chunk Size must always be larger than Chunk Overlap.*
* **Chunk Overlap**: Controls how much text is shared between neighboring chunks. More overlap helps maintain context between chunks, while less overlap improves processing efficiency.
* **Smart Table Processing:** Detects tables in PDFs and converts them into structured text that is readable for LLMs. This uses additional compute costs.
* **Smart Image Processing:** Detects images in PDFs and converts any readable content into structured information for LLMs. This uses additional compute costs.
* **Smart OCR Processing:** Adds an OCR-based upload option for scanned or complex PDFs. This uses additional compute costs.
* **Image Extraction:** Extracts images from PDFs or image files (e.g. PNG, JPEG) so these images can be referenced in the responses.
* **Contextualized Chunking \[Experimental]:** Adds an LLM-generated summary header to each PDF chunk. This helps retrieval systems understand the context of each section, improving search and answer relevance.
* **Enable Large PDF Chunk:** Concatenates multiple PDF pages into 1 chunk (for a larger chunk size)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.en.theblockbrain.ai/for-builders/embedding-model-guide.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.