Things to Take Care of During Data Engineering Pipeline Design for RAG/LLM Applications
Data Quality is Everything
Garbage retrieval causes hallucinations.
Critical issues:
- Duplicates
- Stale data
- Conflicting documents
Document Parsing
Challenges:
- PDFs
- Tables
- OCR
- Scanned docs
Tools:
- Unstructured
- PyMuPDF
- Docling
- OCR pipelines
Chunking Strategy
One of the biggest RAG quality factors.
Strategies:
- Fixed chunks
- Recursive chunks
- Semantic chunks
- Graph chunks
Metadata Engineering
Essential metadata:
- Source
- Timestamp
- Department
- Security label
- Document type
Metadata improves filtering dramatically.
Embedding Selection
Embedding models must align with domain.
Examples:
- General semantic search
- Code embeddings
- Legal embeddings
- Multilingual embeddings
Vector Database Design
Popular DBs:
- Pinecone
- Weaviate
- Milvus
- Qdrant
- Chroma
Key considerations:
- Hybrid search
- Filtering
- Scalability
- Replication
Hybrid Retrieval
Combine:
- BM25
- Dense vectors
- Reranking
Much better than pure vector search.
Reranking
Second-stage rerankers improve retrieval quality.
Examples:
- Cross encoders
- Cohere rerank
- BGE rerankers
Real-time Ingestion
Need:
- CDC pipelines
- Kafka
- Streaming ETL
Access Control
Enterprise RAG must support:
- Row-level security
- Tenant isolation
- RBAC
Observability
Track:
- Retrieval precision
- Hallucination rate
- Latency
- Context utilization
Tools:
- LangSmith
- Phoenix
- Weights & Biases





