Vector Databases

Discovery Node supports multiple vector database backends for storing and searching embeddings. This guide covers the configuration, features, and trade-offs of each supported backend.

Overview

Vector databases enable semantic search by storing and efficiently searching high-dimensional embeddings. Discovery Node uses these embeddings to understand the meaning of search queries and find semantically similar products.

Supported Backends

PGVector (PostgreSQL Extension)

PGVector is the default and recommended backend for most deployments. It adds vector similarity search directly to PostgreSQL.

Advantages:

No additional infrastructure required
Unified data and vector storage
Full ACID compliance
Cost-effective for small to medium datasets
Simple backup and recovery
Native PostgreSQL features (joins, transactions, etc.)

Configuration:

# .env configuration
VECTOR_STORAGE_BACKEND=pgvector
DATABASE_URL=postgresql://user:password@localhost:5432/discovery_db

# No additional configuration needed - uses the main database

Installation:

-- Enable the extension (run as superuser)
CREATE EXTENSION IF NOT EXISTS vector;

Performance considerations:

Best for datasets up to 1M products
Supports HNSW indexing for faster searches
Can leverage PostgreSQL query optimization

Pinecone (Cloud Vector Database)

Pinecone is a fully-managed vector database service optimized for production scale.

Advantages:

Highly scalable (billions of vectors)
Managed service (no maintenance)
Real-time updates
Advanced filtering capabilities
Global deployment options
Optimized for similarity search

Configuration:

# .env configuration
VECTOR_STORAGE_BACKEND=pinecone
PINECONE_API_KEY=your-api-key-here
PINECONE_ENVIRONMENT=your-environment
PINECONE_CLOUD=aws
PINECONE_REGION=us-east-1
PINECONE_DENSE_INDEX=discovery-dense
PINECONE_SPARSE_INDEX=discovery-sparse
PINECONE_BATCH_SIZE=96

Index setup:

# The setup script creates these indexes automatically
# Dense index: For semantic embeddings
# Sparse index: For keyword-based search

Performance considerations:

Best for large-scale production deployments
Sub-50ms query latency at any scale
Automatic scaling and optimization

How Vector Storage Affects Search

Search Quality

PGVector:

Uses cosine similarity for semantic matching
Supports exact and approximate nearest neighbor search
Quality depends on index type and parameters
Can combine with PostgreSQL full-text search

Pinecone:

Optimized similarity algorithms
Hybrid search combining dense and sparse vectors
Built-in re-ranking capabilities
Consistent quality at scale

Search Performance

Aspect	PGVector	Pinecone
Query latency (1K products)	~5-10ms	~20-30ms
Query latency (1M products)	~50-200ms	~20-40ms
Query latency (10M+ products)	~1-5s	~30-50ms
Indexing speed	Fast	Moderate
Real-time updates	Immediate	Near real-time

Hybrid Search Implementation

Discovery Node implements hybrid search differently based on the backend:

With PGVector:

-- Combines vector similarity with metadata filtering
SELECT * FROM products
WHERE category = 'Electronics'
AND embedding <=> query_embedding < 0.5
ORDER BY embedding <=> query_embedding
LIMIT 20;

With Pinecone:

# Uses Pinecone's native hybrid search
results = index.query(
    vector=query_embedding,
    sparse_vector=sparse_query,
    filter={"category": "Electronics"},
    top_k=20,
    include_metadata=True
)

Embedding Models

Both backends work with the same embedding models:

# Configure embedding model
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_API_KEY=your-openai-api-key

# Model dimensions
# text-embedding-3-small: 1536 dimensions
# text-embedding-3-large: 3072 dimensions

Migration Between Backends

PGVector to Pinecone

# Use the migration script
python scripts/migrate_vectors.py --from pgvector --to pinecone

# Or re-ingest all data
python main.py ingest --force-regenerate-embeddings

Pinecone to PGVector

# Export from Pinecone and import to PGVector
python scripts/migrate_vectors.py --from pinecone --to pgvector

Choosing the Right Backend

Use PGVector when:

Starting a new project
Dataset is under 1M products
Want simplified architecture
Need transactional consistency
Cost is a primary concern
Already using PostgreSQL

Use Pinecone when:

Scaling beyond 1M products
Need consistent sub-50ms latency
Require global distribution
Want managed infrastructure
Need advanced search features
Have dedicated search traffic

Performance Optimization

PGVector Optimization

-- Create HNSW index for better performance
CREATE INDEX ON products 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Tune PostgreSQL for vector workloads
-- postgresql.conf
shared_buffers = 25% of RAM
work_mem = 256MB
maintenance_work_mem = 2GB

Pinecone Optimization

# Optimize batch processing
PINECONE_BATCH_SIZE=96  # Tune based on your data

# Use metadata filtering to reduce search space
filter = {
    "category": {"$in": ["Electronics", "Computers"]},
    "price": {"$gte": 100, "$lte": 1000}
}

Monitoring and Debugging

PGVector Monitoring

-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE indexname LIKE '%embedding%';

-- Monitor query performance
EXPLAIN ANALYZE
SELECT * FROM products
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;

Pinecone Monitoring

# Check index statistics
stats = pinecone_index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
print(f"Dimensions: {stats.dimension}")

# Monitor API usage in Pinecone console

Cost Considerations

PGVector Costs

Infrastructure: PostgreSQL server costs only
Storage: ~4KB per product (with 1536-dim embeddings)
Compute: Scales with your PostgreSQL instance
Example: 100K products ≈ 400MB additional storage

Pinecone Costs

Storage: $0.096/GB/month (as of 2024)
Queries: Included in plan limits
Example: 100K products ≈ $0.04/month storage
Plans: Free tier available for testing

Best Practices

Start with PGVector for proof of concept
Monitor query latency as dataset grows
Plan migration path if expecting rapid growth
Test both backends with your actual data
Use appropriate embedding models for your domain
Implement caching for frequent queries
Regular maintenance (reindexing, optimization)

Troubleshooting

Common PGVector Issues

Slow queries:

Check if HNSW index exists
Increase work_mem
Consider partitioning large tables

Out of memory:

Reduce embedding dimensions
Increase server memory
Use approximate search (lower ef_search)

Common Pinecone Issues

Rate limits:

Implement exponential backoff
Use batch operations
Consider upgrading plan

Inconsistent results:

Check if all embeddings are indexed
Verify metadata filters
Ensure consistent embedding model

Next Steps

Configure Data Ingestion for your sources
Understand How Search Works with each backend
Learn about MCP Integration for AI assistants

Overview​

Supported Backends​

PGVector (PostgreSQL Extension)​

Pinecone (Cloud Vector Database)​

How Vector Storage Affects Search​

Search Quality​

Search Performance​

Hybrid Search Implementation​

Embedding Models​

Migration Between Backends​

PGVector to Pinecone​

Pinecone to PGVector​

Choosing the Right Backend​

Use PGVector when:​

Use Pinecone when:​

Performance Optimization​

PGVector Optimization​

Pinecone Optimization​

Monitoring and Debugging​

PGVector Monitoring​

Pinecone Monitoring​

Cost Considerations​

PGVector Costs​

Pinecone Costs​

Best Practices​

Troubleshooting​

Common PGVector Issues​

Common Pinecone Issues​

Next Steps​