Post

Vector Database Series: A Deep Dive into Weaviate

Vector Database Series: A Deep Dive into Weaviate

Introduction

Welcome to the second post in our series on vector databases. In our previous article, we explored the fundamentals of vector databases and took a close look at Pinecone. Now, we turn our attention to another powerful and popular player in this space: Weaviate.

Weaviate is an open-source, AI-native vector database with a distinct philosophy. While Pinecone excels as a highly optimized and managed service for vector search, Weaviate provides a more holistic data management platform. It’s designed to be a flexible, all-in-one solution that can not only store and search vectors but also generate them and manage complex data relationships.

What is Weaviate? Key Concepts

Weaviate is built with the understanding that the vector embedding is just one part of your data. It stores “data objects” that can have various properties, just like a document in a NoSQL database, with the vector being a key property.

Here are some of Weaviate’s key differentiators:

  • Open-Source: At its core, Weaviate is open-source. This gives you the flexibility to self-host it on your own infrastructure (e.g., using Docker or Kubernetes) for maximum control, or use the managed Weaviate Cloud Service (WCS) for convenience.
  • Vectorization Modules: This is perhaps Weaviate’s most significant feature. It can integrate directly with vectorization models from providers like OpenAI, Cohere, and Hugging Face. This means you can send your raw data (like text or images) directly to Weaviate, and it will generate the vector embeddings for you. This simplifies your application architecture by removing the need for a separate vectorization step.
  • GraphQL and RESTful APIs: Weaviate provides a rich GraphQL API that allows for incredibly expressive queries. You can perform vector searches, apply complex metadata filters, and even traverse relationships between data objects, all in a single API call.
  • Hybrid Search: Weaviate combines traditional keyword-based (BM25) search with modern vector search out-of-the-box, allowing you to build robust search systems that leverage the best of both worlds.

Getting Started with Weaviate: A Code Walkthrough

Let’s explore how to use Weaviate with its Python client. For this example, we’ll assume you are using the Weaviate Cloud Service (WCS), which provides a free sandbox tier.

Step 1: Installation

First, install the Weaviate Python client.

1
pip install weaviate-client

Step 2: Connecting to a Weaviate Instance

You’ll need your Weaviate cluster URL and API key, which you can get from the WCS console.

1
2
3
4
5
6
7
8
9
10
11
12
13
import weaviate
import os

# Connect to your Weaviate Cloud Service instance
client = weaviate.Client(
    url=os.getenv("WEAVIATE_CLUSTER_URL"),  # Your WCS URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.getenv("WEAVIATE_API_KEY")),  # Your WCS API key
    additional_headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY") # Needed for the vectorizer module
    }
)

print("Client is ready!" if client.is_ready() else "Client is not ready!")

Step 3: Creating a Schema (Class)

In Weaviate, you define the structure of your data by creating a class (similar to a table in SQL). In the schema, you define the properties of your data objects and, crucially, configure the vectorizer module.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class_name = "Article"

class_obj = {
    "class": class_name,
    "description": "A collection of articles",
    "vectorizer": "text2vec-openai",  # Use OpenAI's model to vectorize the data
    "moduleConfig": {
        "text2vec-openai": {
            "model": "ada",
            "type": "text"
        }
    },
    "properties": [
        {
            "name": "title",
            "dataType": ["text"],
            "description": "The title of the article",
        },
        {
            "name": "content",
            "dataType": ["text"],
            "description": "The full content of the article",
        },
        {
            "name": "category",
            "dataType": ["string"],
            "description": "The category of the article",
        }
    ]
}

# Clean up previous runs
if client.schema.exists(class_name):
    client.schema.delete_class(class_name)

# Create the class
client.schema.create_class(class_obj)
print(f"Class '{class_name}' created successfully.")

Step 4: Adding Data Objects

Now, let’s add some articles. Notice that we are sending the raw text for title and content. Weaviate will automatically use the text2vec-openai module to generate embeddings for these objects.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
articles = [
    {"title": "The Future of AI", "content": "Artificial intelligence is rapidly evolving...", "category": "tech"},
    {"title": "A Guide to Healthy Cooking", "content": "Eating healthy starts in the kitchen...", "category": "lifestyle"},
    {"title": "Global Economic Trends", "content": "The world economy shows signs of recovery...", "category": "business"},
    {"title": "The Art of Storytelling", "content": "Crafting a compelling narrative is essential...", "category": "writing"},
]

# Use a batch process for efficient data import
with client.batch as batch:
    for article in articles:
        batch.add_data_object(
            data_object=article,
            class_name=class_name
        )

print("Data objects added successfully.")

Weaviate’s query language is very expressive. The most common type of search is near_text, which takes a text concept and finds the most semantically similar objects.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Perform a vector search to find articles related to "technology trends"
response = (
    client.query
    .get(class_name, ["title", "category"])
    .with_near_text({"concepts": ["technology trends"]})
    .with_limit(2)
    .do()
)

print("\nVector search results for 'technology trends':")
print(response)

# Perform a hybrid search with a metadata filter
hybrid_response = (
    client.query
    .get(class_name, ["title", "content"])
    .with_near_text({"concepts": ["creative expression"]})
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueString": "writing"
    })
    .with_limit(1)
    .do()
)

print("\nHybrid search results for 'creative expression' in the 'writing' category:")
print(hybrid_response)

Step 6: Cleaning Up

You can delete individual objects by their UUID or delete an entire class.

1
2
3
# To delete the class and all its objects (use with caution!)
# client.schema.delete_class(class_name)
# print(f"\nClass '{class_name}' deleted.")

Conclusion

Weaviate stands out as a powerful, open-source vector database that offers a high degree of flexibility and control. Its integrated vectorization modules and rich GraphQL API make it an excellent choice for developers who want a more all-in-one solution for building complex, AI-native applications.

If you value open-source technology, need to self-host, or want your database to handle the vectorization process for you, Weaviate is a fantastic option to consider.

Stay tuned for the final post in our series, where we’ll explore how to add vector search capabilities to a classic relational database with pgvector for PostgreSQL.

Suggested Reading

This post is licensed under CC BY 4.0 by the author.