LangChain Series: A Deep Dive into Embedding Models
LangChain Series: A Deep Dive into Embedding Models
Introduction
In our previous posts, we’ve seen how chains and agents form the backbone of LangChain applications. We’ve also touched on a critical concept that powers features like Retrieval-Augmented Generation (RAG) and semantic search: embeddings. An embedding model is a neural network that converts text (or other data types) into a dense vector of numbers. This vector captures the “meaning” or semantic content of the text, allowing us to perform powerful similarity comparisons.
This post will focus on the “E” in RAG: the Embedding Models. We’ll explore how LangChain provides a simple, unified interface for interacting with dozens of different embedding models, making it easy to integrate them into your applications and even swap them out with minimal code changes.
The Standard Embeddings Interface
One of LangChain’s most powerful design principles is its use of standardized interfaces. For embedding models, this means that whether you’re using a model from OpenAI, Hugging Face, Cohere, or another provider, LangChain wraps it in a class that behaves in a consistent way.
This standard Embeddings class has two core methods you need to know:
embed_query(text: str) -> List[float]: This method takes a single string of text and returns a single embedding vector. It’s typically used for embedding a user’s search query right before performing a similarity search.embed_documents(texts: List[str]) -> List[List[float]]: This method takes a list of strings and returns a list of embeddings, one for each document. It’s designed for efficiency, as it can often process multiple documents in a single batch request to the underlying model provider. This is the method you’d use when indexing a large knowledge base.
This standardization is a huge benefit. It means you can write your application’s logic once and then easily experiment with different embedding models to see which one performs best for your specific use case.
Initializing Embedding Models
While LangChain doesn’t have a single init_embeddings function, the concept you proposed is a perfect way to understand the different patterns for initializing these models. In practice, you will import a specific class for each provider, such as OpenAIEmbeddings or HuggingFaceEmbeddings.
Let’s explore these patterns using your conceptual init_embeddings helper as a guide, and show the actual LangChain code for each.
Method 1: Using a Model String
Conceptually, you might imagine identifying a model with a single string that combines the provider and model name.
Conceptual Code:
1
# init_embeddings("openai:text-embedding-3-small")
This is a clean pattern that encapsulates all the necessary information.
Actual LangChain Code: In LangChain, you achieve this by importing the specific provider class and passing the model name to its constructor.
1
2
3
4
5
6
7
8
9
# pip install langchain-openai
from langchain_openai import OpenAIEmbeddings
# Initialize the model
model = OpenAIEmbeddings(model="text-embedding-3-small")
# Embed a single query
vector = model.embed_query("Hello, world!")
print(f"Vector dimension: {len(vector)}")
Method 2: Using Explicit Provider and Model Arguments
Another way to think about it is by explicitly separating the provider and the model.
Conceptual Code:
1
# init_embeddings(model="text-embedding-3-small", provider="openai")
Actual LangChain Code: This translates to choosing the correct class to import based on the provider. The core logic remains the same, which highlights the consistency of the LangChain interface. Let’s look at OpenAI and then conceptually at another provider like Cohere.
1
2
3
4
5
6
7
8
9
10
11
12
13
from langchain_openai import OpenAIEmbeddings
# from langchain_cohere import CohereEmbeddings # Example for another provider
# OpenAI
openai_model = OpenAIEmbeddings(model="text-embedding-3-small")
documents = ["Hello, world!", "Goodbye, world!"]
openai_vectors = openai_model.embed_documents(documents)
print(f"OpenAI embedded {len(openai_vectors)} documents.")
# Cohere (Conceptual Example)
# cohere_model = CohereEmbeddings(model="embed-english-v3.0")
# cohere_vectors = cohere_model.embed_documents(documents)
# print(f"Cohere embedded {len(cohere_vectors)} documents.")
Method 3: Passing Additional Parameters
You often need to pass extra parameters to the model, such as an API key, endpoint URL, or other configuration options.
Conceptual Code:
1
# init_embeddings("openai:text-embedding-3-small", api_key="sk-...")
Actual LangChain Code:
Any additional keyword arguments you pass to the LangChain Embeddings class constructor are passed directly to the underlying client for that provider. This makes it easy to configure authentication, timeouts, and other settings.
1
2
3
4
5
6
7
8
9
from langchain_openai import OpenAIEmbeddings
# The api_key is passed directly to the underlying OpenAI client
model = OpenAIEmbeddings(
model="text-embedding-3-small",
api_key="sk-..." # Replace with your actual key if not using .env
)
vector = model.embed_query("This is a test.")
A Practical Example
Let’s tie this all together in a complete, runnable script that demonstrates the process.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
# 1. Load environment variables from .env
load_dotenv()
# 2. Initialize the OpenAI embedding model
# LangChain will automatically use the OPENAI_API_KEY from your .env file
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")
# 3. Define a list of documents to embed
documents = [
"The cat sat on the mat.",
"The dog chased the ball.",
"Photosynthesis is the process used by plants to convert light energy.",
]
# 4. Embed the documents
document_vectors = embeddings_model.embed_documents(documents)
print(f"Successfully embedded {len(document_vectors)} documents.")
print(f"Dimension of the first vector: {len(document_vectors[0])}")
# 5. Define a user query to embed
user_query = "What is a feline?"
# 6. Embed the user query
query_vector = embeddings_model.embed_query(user_query)
print(f"\nSuccessfully embedded the query.")
print(f"Dimension of the query vector: {len(query_vector)}")
Conclusion
LangChain’s standardized Embeddings interface is a cornerstone of its design. It abstracts away the provider-specific details, allowing you to focus on your application’s logic. This makes it incredibly easy to get started with any embedding model and to experiment with different ones as you optimize your application for performance, cost, and accuracy.