
248
DAQuery 36M+ vectors in under 30ms⚡
Here's the technique behind it:
(Perplexity, Azure AI, etc, use it in production)
We built a RAG system that queries 36M+ vectors in <30ms using Binary Quantization.
It's a vector compression technique that trades off speed, memory, and retrieval accuracy.
Essentially, we generate text embeddings (in float32) and convert them to binary vectors, resulting in a 32x reduction in memory and storage.
Here's the tech stack:
→ LlamaIndex for orchestration
→ Milvus (by Zilliz) as the vector DB
→ Moonshot AI's Kimi-K2 as the LLM hosted on Groq
Here's the workflow:
1️⃣ Ingest documents and generate binary embeddings
2️⃣ Create a binary vector index and store embeddings in vector DB
3️⃣ Retrieve top-k similar documents to user's query
4️⃣ LLM generates a response based on additional context
After building this, we wrapped up the app in a Streamlit interface. The video shows the interaction.
We tested the deployed setup over the PubMed dataset (36M+ vectors).
This app:
✅ Queried 36M+ vectors in <30ms
✅ Generated a response in <1s
GitHub repo with the code in the comments!
👉 Over to you: Have you tried binary quantization for RAG?
#ai #rag #PerplexityAI
@dailydoseofds_










