Post

LangChain Series: Documents and Loaders - The Gateway for Your Data

LangChain Series: Documents and Loaders - The Gateway for Your Data

Introduction

In our journey through the LangChain ecosystem, we’ve learned that we can build powerful agents and chains that reason and take action. But to make our LLM applications truly useful, we need to connect them to our own data. The first step in any Retrieval-Augmented Generation (RAG) pipeline is to load that data.

This post will focus on this crucial first step. We’ll explore how LangChain represents data using its standard Document object, and how its vast ecosystem of Document Loaders provides a simple, unified way to ingest data from almost any source imaginable.

The Document - LangChain’s Standard Unit of Data

Before we can work with data, we need a standard way to represent it. In LangChain, the primary unit of data is the Document object. It’s a simple but powerful container with two key attributes:

  1. page_content (string): This holds the actual text content of a piece of data.
  2. metadata (dictionary): This is a flexible dictionary used to store any extra information about the content. This is crucial for tracking the source of the data (e.g., file name, URL, page number), its creation date, or any other attributes you might want to filter by later.

Here’s how you can create a Document manually:

1
2
3
4
5
6
7
8
9
from langchain_core.documents import Document

doc = Document(
    page_content="LangChain is a framework for developing applications powered by language models.",
    metadata={"source": "intro.txt", "chapter": 1}
)

print(doc.page_content)
print(doc.metadata)

While you can create them manually, you’ll most often get Document objects from a Document Loader.

The Blob - A Raw Data Container

Before data is parsed into a structured Document, it often starts as raw, unstructured data. LangChain provides a Blob object to represent this raw data in a flexible way. Think of a Blob as a low-level container that holds the data “as-is” from its source.

It’s a simple object that can be created from a string or bytes and then read in various formats.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from langchain_core.documents import Blob

# Create a blob from a string
blob = Blob.from_data("Hello, world!")

# Read the blob as a string
print(blob.as_string())
# Output: Hello, world!

# Read the blob as bytes
print(blob.as_bytes())
# Output: b'Hello, world!'

# Read the blob as a file-like byte stream
with blob.as_bytes_io() as f:
    print(f.read())
# Output: b'Hello, world!'

While you may not interact with Blob objects directly very often, they are a fundamental building block that Document Loaders use internally to handle data ingestion before parsing it into the final Document format.

Document Loaders: Your Gateway to Any Data Source

Manually reading files, parsing their content, and creating Document objects would be tedious. Document Loaders are designed to do this for you. LangChain has a massive ecosystem of over 100 different loaders that can ingest data from a huge variety of sources, including:

  • Files (Text, PDF, CSV, JSON, HTML, Word, etc.)
  • Web pages
  • YouTube transcripts
  • Notion pages
  • Slack conversations
  • Databases (SQL, NoSQL)
  • And many more…

Each loader provides a simple, consistent interface for loading data. Let’s see it in action.

A Practical Example: Loading a Text File

First, ensure you have the necessary community package installed:

1
pip install langchain-community

Next, let’s create a sample file named sample.txt with the following content:

1
2
This is the first line of our sample document.
This is the second line.

Now, we can use the TextLoader to load this file into a Document.

1
2
3
4
5
6
7
8
9
10
11
from langchain_community.document_loaders import TextLoader

# Initialize the loader with the file path
loader = TextLoader("sample.txt")

# Load the documents
documents = loader.load()

print(f"Loaded {len(documents)} document.")
print("Content:", documents[0].page_content)
print("Metadata:", documents[0].metadata)

Output:

1
2
3
4
5
Loaded 1 document.
Content: This is the first line of our sample document.
This is the second line.

Metadata: {'source': 'sample.txt'}

Notice how the loader automatically read the file’s content into page_content and populated the metadata with the source file path.

Another Example: Loading from a URL

To show the versatility, let’s load the content from a web page using WebBaseLoader. You’ll need to install the beautifulsoup4 library for this.

1
pip install beautifulsoup4
1
2
3
4
5
6
7
8
9
10
11
from langchain_community.document_loaders import WebBaseLoader

# Initialize the loader with the URL
loader = WebBaseLoader("http://example.com/")

# Load the documents
documents = loader.load()

print(f"Loaded {len(documents)} document from the URL.")
print("First 100 characters:", documents[0].page_content[:100])
print("Metadata:", documents[0].metadata)

The Loading Process: load vs. lazy_load

Every document loader has two primary methods for loading data:

  1. load(): This method loads all the data from the source at once and returns a complete list of Document objects. It’s simple and fine for small sources, but can consume a lot of memory if you’re loading thousands of large files.
  2. lazy_load(): This method is much more memory-efficient. It returns a Python iterator that yields one Document at a time as you loop over it. This is the recommended approach for large datasets or for building streaming data pipelines.

Here’s how you would use lazy_load:

1
2
3
4
5
6
7
# Using the same TextLoader from before
loader = TextLoader("sample.txt")

# Lazily load documents one by one
for doc in loader.lazy_load():
    print("Processing a lazily loaded document...")
    print(doc.metadata)

Conclusion

Document Loaders are the essential first step in building any RAG application. They provide a powerful and consistent interface for ingesting data from virtually any source. By understanding the standard Document object and the load vs. lazy_load pattern, you are now equipped to bring your own data into the LangChain ecosystem.

Our next logical step is to process these loaded documents. Often, they are too large to fit into an embedding model’s context window, so we need to split them into smaller chunks. This will be the topic of our next post in the series.

This post is licensed under CC BY 4.0 by the author.