Discovery Node v0.1.0

🎉 Initial Alpha Release

We're excited to announce the first release of Discovery Node - a powerful data ingestion and vector search platform for e-commerce product catalogs following the Commerce Mesh Protocol (CMP) standards.

🚀 Key Features

Core Capabilities

Multi-Source Data Ingestion: Support for local files and CMP-compliant feeds
Vector Search: Dual vector database support (PostgreSQL with pgvector and Pinecone)
Flexible Architecture: Modular design with support for multiple embedding providers
Background Processing: Celery-based async task processing for scalable ingestion
RESTful API: FastAPI-based API for search and data access
MCP Server: SSE MCP Server for search and data access

Data Ingestion

Brand Registry Support: Ingest organization and brand data from CMP-compliant registries
Product Feed Processing: Handle sharded product feeds with automatic shard discovery
Batch Processing: Efficient batch processing of large product catalogs
Error Handling: Robust error handling with retry mechanisms

Vector Search

Multiple Embedding Providers: Support for OpenAI embeddings with extensible architecture
Hybrid Search: Combined dense and sparse vector search capabilities
Configurable Indexes: Switch between pgvector and Pinecone based on requirements
Real-time Updates: Automatic vector updates on product data changes

Database & Storage

PostgreSQL: Primary database with pgvector extension for vector similarity search
Alembic Migrations: Database version control and migration management
Efficient Schema: Optimized schema for product, brand, and organization data

📋 Requirements

Python 3.11+
PostgreSQL 15+ with pgvector extension
Redis (for Celery task queue)
OpenAI API key (for embeddings)
Optional: Pinecone API key (for Pinecone vector store)

🛠️ Configuration

The system is configured via environment variables:

# Database
DATABASE_URL=postgresql://user:pass@localhost/discovery_node

# Vector Store
VECTOR_PROVIDER=pgvector  # or pinecone

# Embeddings
EMBEDDING_MODEL_PROVIDER=openai
EMBEDDING_API_KEY=your-openai-api-key
EMBEDDING_MODEL_NAME=text-embedding-3-small

# Optional Pinecone
PINECONE_API_KEY=your-pinecone-key
PINECONE_ENVIRONMENT=your-environment
PINECONE_INDEX_NAME=your-index-name

🏃 Getting Started

Install Dependencies
```
pip install -r requirements.txt
```
Setup Database
```
alembic upgrade head
```
Configure Ingestion Edit ingestion.yaml to define your data sources

Start Services

# Start Redis
redis-server

# Start Celery Worker
celery -A app.worker worker --loglevel=info

# Start API Server
uvicorn main:app --reload

Trigger Ingestion
```
python main.py ingest-all
```

📊 Sample Data

The release includes sample data for testing:

samples/acme-solutions/: Example brand registry and product feed
Includes TVs and cameras with multiple variants
Demonstrates proper CMP data structure

🐛 Known Issues

Feed Index Organization URN: The system now supports both orgid and organization.urn formats in feed indexes
Vector Updates: Products are updated by URN, not UUID
Embedding Service: Currently requires OpenAI API key; local embedding support planned

🔮 Future Enhancements

Additional embedding providers (Cohere, local models)
More data source adapters (Shopify, BigCommerce, etc.)
Advanced search features (filters, facets, recommendations)
Admin UI for monitoring and management
Comprehensive test coverage
Performance optimizations for large catalogs

🤝 Contributing

We welcome contributions! Please see our contributing guidelines (coming soon).

📄 License

Discovery Node is released under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

Built with the Commerce Mesh Protocol (CMP) standards for interoperable commerce data.

Note: This is an alpha release (v0.1.0) of Discovery Node. APIs and interfaces may change in future releases. We're actively developing new features and welcome feedback from the community.

🎉 Initial Alpha Release​

🚀 Key Features​

Core Capabilities​

Data Ingestion​

Vector Search​

Database & Storage​

📋 Requirements​

🛠️ Configuration​

🏃 Getting Started​

📊 Sample Data​

🐛 Known Issues​

🔮 Future Enhancements​

🤝 Contributing​

📄 License​

🙏 Acknowledgments​