Vector Database Series: A Deep Dive into pgvector
Vector Database Series: A Deep Dive into pgvector
Introduction
Welcome to the final installment of our vector database series. We’ve explored Pinecone, a highly-optimized managed service, and Weaviate, a flexible, open-source platform with built-in vectorization. Now, we’ll look at a different but incredibly practical approach: pgvector.
pgvector is an open-source extension for PostgreSQL that brings vector similarity search capabilities directly into your traditional relational database. This isn’t a separate database system; it’s an enhancement to the one that millions of developers already know and trust. For teams with a significant investment in PostgreSQL, pgvector offers a compelling way to add AI-powered features without overhauling their tech stack.
What is pgvector?
pgvector seamlessly integrates vector operations into PostgreSQL by providing three core features:
- A New
vectorData Type: It introduces a new data type to store your high-dimensional embeddings directly within your tables. - New Similarity Operators: It adds new SQL operators to calculate the distance between vectors. The three main ones are:
<->: Euclidean distance (L2 norm)<#>: Negative inner product<=>: Cosine distance (1 minus cosine similarity)
- Index Support for ANN: To make searching fast,
pgvectorsupports creating indexes for Approximate Nearest Neighbor (ANN) search. It currently supports two popular indexing methods:- IVFFlat: A simple and fast inverted file index.
- HNSW (Hierarchical Navigable Small World): A more modern, graph-based index that often provides better performance-accuracy trade-offs.
The Pros and Cons of Using pgvector
Choosing pgvector means embracing a different set of trade-offs compared to using a dedicated vector database.
The Pros
- Unified Data Store: This is the killer feature. You can keep your vectors in the same database, even the same table, as your existing business data. This allows for powerful
JOINs and complex queries that combine relational filters and vector search in a single, atomic transaction. - Leverages Existing Infrastructure: You can use your existing PostgreSQL backups, security policies, monitoring tools, and operational expertise. There’s no new system to learn how to manage.
- Transactional Guarantees: Your vector data benefits from the full power of PostgreSQL’s ACID compliance, ensuring data consistency.
- Mature Ecosystem: You can tap into the vast ecosystem of clients, libraries, and tools already available for PostgreSQL.
The Cons
- Not a Specialized Solution: While
pgvectoris highly optimized, it may not match the raw query latency or scale of a dedicated, distributed vector database that is purpose-built for vector search across billions of records. - Manual Management: You are responsible for managing, tuning, and scaling your PostgreSQL instance. With a managed service like Pinecone, this is handled for you.
- Bring Your Own Vectors: Like Pinecone,
pgvectordoes not have built-in vectorization modules. You must generate your own embeddings in your application before storing them in the database.
Getting Started with pgvector: A Code Walkthrough
Let’s see how to use pgvector with Python and the popular SQLAlchemy library.
Step 1: Setup and Installation
You need a PostgreSQL instance with the pgvector extension enabled. Many managed database providers like Supabase, Neon, Timescale, and AWS RDS now offer easy support for this. In your database, you must run:
1
CREATE EXTENSION IF NOT EXISTS vector;
Next, install the necessary Python libraries.
1
pip install sqlalchemy psycopg2-binary numpy
Step 2: Connecting and Creating a Table
We’ll use SQLAlchemy to connect to the database and define our table. We need to import the Vector type from the pgvector.sqlalchemy module.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sqlalchemy import create_engine, text, insert, select, Index
from sqlalchemy.orm import declarative_base, sessionmaker
from sqlalchemy import Column, Integer, Text
from pgvector.sqlalchemy import Vector
import numpy as np
# Replace with your database connection string
DATABASE_URL = "postgresql://user:password@host:port/database"
engine = create_engine(DATABASE_URL)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()
# Define our table schema
class Item(Base):
__tablename__ = 'items'
id = Column(Integer, primary_key=True)
content = Column(Text)
embedding = Column(Vector(3)) # We'll use 3-dimensional vectors for this example
# Create the table
Base.metadata.drop_all(engine) # Drop table if it exists, for a clean run
Base.metadata.create_all(engine)
print("Table 'items' created successfully.")
Step 3: Generating and Inserting Data
We’ll create some dummy 3D vectors and insert them into our items table.
1
2
3
4
5
6
7
8
9
10
11
12
13
# Generate some dummy data
items_to_insert = [
{'content': 'This is a document about cats', 'embedding': [0.1, 0.2, 0.7]},
{'content': 'This is a document about dogs', 'embedding': [0.3, 0.5, 0.2]},
{'content': 'This is a document about cars', 'embedding': [0.8, 0.1, 0.1]},
]
# Insert the data
for item in items_to_insert:
session.execute(insert(Item).values(content=item['content'], embedding=item['embedding']))
session.commit()
print("Data inserted successfully.")
Step 4: Creating an Index for Performance
For a small number of vectors, a sequential scan is fine. But for any real application, you must create an index to enable fast ANN search. Here, we create an HNSW index for cosine similarity search.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# The number of lists for IVFFlat, or the number of neighbors for HNSW
lists = 100
# Create the index
index = Index(
'hnsw_index',
Item.embedding,
postgresql_using='hnsw',
postgresql_with={'m': 16, 'ef_construction': 64},
postgresql_ops={'embedding': 'vector_cosine_ops'}
)
index.create(engine)
print("HNSW index created.")
Step 5: Performing Vector Search
Now, we can perform a similarity search using the <=> operator for cosine distance.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Create a query vector
query_embedding = np.array([0.2, 0.3, 0.5])
# Find the 2 most similar items
results = session.scalars(
select(Item).order_by(Item.embedding.cosine_distance(query_embedding)).limit(2)
).all()
print("\nTop 2 most similar items:")
for item in results:
print(f" - ID: {item.id}, Content: {item.content}")
# The real power: combining vector search with a WHERE clause
print("\nCombining with a WHERE clause:")
filtered_results = session.scalars(
select(Item)
.filter(Item.content.like('%document%'))
.order_by(Item.embedding.cosine_distance(query_embedding))
.limit(2)
).all()
for item in filtered_results:
print(f" - ID: {item.id}, Content: {item.content}")
session.close()
Conclusion
pgvector represents a pragmatic and powerful solution for adding vector search capabilities to applications already built on PostgreSQL. It simplifies the tech stack, leverages the robustness and maturity of the Postgres ecosystem, and allows for seamless integration of relational and vector data.
While dedicated vector databases may offer better performance at extreme scale, pgvector is an outstanding choice for many real-world use cases. This concludes our series on vector databases. The choice between a dedicated service like Pinecone, a flexible platform like Weaviate, or an integrated extension like pgvector ultimately depends on your project’s specific needs, existing infrastructure, and operational preferences.