AI & NLP RAG System Production-Ready

Advanced RAG System for Confidential Data Processing

Production-ready Retrieval-Augmented Generation system with hybrid search, multi-cloud LLM orchestration, and 100% on-premise deployment capability for sensitive data processing.

Démonstration complète du système RAG avec ingestion, retrieval hybride, et comparaison Local/Cloud | Début interface user @1min37

1
Streamlit interface (option 1)
2
Ingested documents (Telecom Paris courses: AI, ML, Statistics, MLops, Database, Hadoop, Cloud, Kubernetes, Docker…)
3
Download 3 random documents for demo:
Bureau Veritas NR467 (Technical, Guidelines, Notations),
Code du travail (French law),
STBi Report (Scientific technical report)
4
Ingesting 3 downloaded random documents (NR467, Code du travail & STBi Report for demo)
5
Testing RAG & Agents
6
Q1) "What are the performance requirements for the towing winch emergency release system?"
7
A1) Local (deepseek-r1:1.5b) vs Cloud (Claude Sonnet 4.5) | Answers (NR467)
8
A1) Exploring chunks from ingested documents to retrieve useful information on local database
9
A1) NR467 Part B ingested too - still retrieve useful information as question also linked to this document! → PATH & SPECIFIC PAGE to find useful data
10
Exploring 1 chunk in detail (NR467)
11
🔒 DEACTIVATING INTERNET FOR LOCAL ONLY DEMO (or hybrid but with LLM cloud HTTPS error - Normal)
12
Q2) "For an escort tug, what is the relationship between the Winch Brake Holding (BHL), the rated steady towline force (TESC,R), and the towline breaking strength? What minimum BHL is required for escort operations in calm water?"
13
A2) Local (llama3.1:8b) | Answer & formulas (🔐 SECURITY, CONFIDENTIALITY, PRIVATE USAGE... NO INTERNET NEEDED)
14
No internet → Claude streaming error
15
No internet → RAG LOCAL ONLY (only ingested database, private documents...) | Local (llama3.1:8b → Deepseek-r1:1.5b)
16
Q3) "Describe the three main indirect modes for escort tugs and explain the key difference in thrust application between them"
17
A3) Local (deepseek-r1:1.5b) | Answer & top chunks (🔐 SECURITY, CONFIDENTIALITY, PRIVATE USAGE... NO INTERNET NEEDED)
18
Q4) "Selon le Code du travail, quelles sont les conditions pour renouveler une période d'essai et quelle est la durée maximale totale possible pour un cadre?"
19
A4) Local (deepseek-r1:1.5b) | Answer & PATH | Code du travail
20
Q5) "Décrivez les délais de prévenance que doit respecter l'employeur lorsqu'il met fin à la période d'essai, en fonction de la durée de présence du salarié"
21
A5) Local (deepseek-r1:1.5b) | Answer & Chunks | Code du travail
22
Q6) "Écrivez une fonction Python qui calcule l'indemnité légale de licenciement selon les articles R1234-1 et R1234-2 (tranches 1/4 pour <10 ans, 1/3 pour >10 ans)"
23
Changing settings (Local: deepseek-r1:1.5b → deepseek-r1:8b & RAG Local → Hybrid with NO INTERNET STILL)
24
A6) Local (deepseek-r1:8b) | Answer & Code | Code du travail
25
Q7) "What are the top three areas where companies reported the highest positive impact from setting science-based targets according to the SBTi Survey?"
26
🔄 Exploring data pipeline: User prompt through Streamlit interface or Widget → Orchestrator → Agent routing (Retrieval for this question) & Query classification → RETRIEVAL PIPELINE (Embedding, reranking model run locally) → LLM Generation (Local LLM via Ollama | LOCAL USAGE vs Cloud LLM Via API Key | INTERNET USED)
27
A7) Local (llama3.1:8b) vs Cloud LLM (Claude) | Correct answers, chunks & path | SBTi report
28
Q8) "According to the SBTi report's literature review, what are the estimated future annual savings in CO2 emissions and costs that result from companies' initial climate investments?"
29
A8) Local (llama3.1:8b) vs Cloud LLM (Claude) | Correct answers, chunks & path | SBTi report
30
🔄 Exploring data pipeline: Feedback Learning system (implemented and WIP)
31
Streamlit interface exploration
32
🤖 Event detection system & Bot widget (implemented and WIP)

1. Context and Objectives

In the era of Large Language Models (LLMs), organizations face a critical challenge: leveraging AI capabilities while maintaining data sovereignty and confidentiality. This project addresses this need by developing an advanced Retrieval-Augmented Generation (RAG) system that can operate entirely on-premise or integrate with cloud providers based on security requirements.

Key Objectives

  • Privacy-First Architecture: Enable 100% local deployment for confidential data processing without cloud dependencies
  • Hybrid Retrieval: Combine dense vector search, sparse BM25, and advanced fusion techniques for optimal accuracy
  • Multi-Cloud Flexibility: Support real-time comparison between local LLMs (Ollama) and cloud providers (Claude, GPT-4, Gemini, Mistral, Cohere)
  • Production-Ready Pipeline: Automated document ingestion with intelligent chunking, quality filtering, and metadata enrichment
  • Performance Optimization: GPU-accelerated embeddings and efficient indexing for low-latency retrieval
RAG System Architecture Overview

Figure 1 - High-level architecture of the RAG system

The main challenge was to design a system that balances performance, cost, and privacy while maintaining production-grade reliability and scalability.

2. Hybrid Retrieval Architecture

The retrieval system combines dense vector search (mxbai-embed-large), sparse BM25, and Reciprocal Rank Fusion (RRF) to achieve optimal accuracy across diverse query types.

  • Dense Search: Semantic embeddings mxbai-embed-large (1024-dim) with PostgreSQL PGVector and HNSW indexing (m=32, ef_construction=128)
  • Sparse Search: BM25 keyword matching for exact terminology retrieval
  • RRF Fusion: SQL-native algorithm combining results (60% dense + 40% sparse)
  • Reranking: Optional CrossEncoder (BAAI/bge-reranker-v2-m3) for 5-10% precision improvement
Hybrid Retrieval Pipeline

Figure 2 - Hybrid retrieval pipeline combining dense, sparse, and fusion techniques

Advanced Features

  • Multi-Agent System: 5 specialized agents (Retrieval, Code, Math, Planning, Memory) with ReAct orchestration for complex tasks
  • Meeting Transcription: Automated audio transcription and LLM-powered analysis system

3. LLM Orchestration: Local vs Cloud

The system provides flexible LLM integration with real-time benchmarking capabilities.

  • Local Deployment: Ollama with llama3.1:8b, deepseek-r1:8b, mistral (GPU-optimized, 100% private)
  • Cloud Providers: Claude Sonnet 4.5, GPT-4o, Gemini 2.0, Mistral Large, Cohere Command R+
  • Benchmarking: Side-by-side comparison with latency, cost, and quality metrics
Local vs Cloud LLM Comparison
Streamlit Cloud vs Local Interface

Figure 3 - Real-time comparison between local and cloud LLM responses

4. Automated Document Ingestion Pipeline

A production-grade ETL pipeline processes documents through parsing, chunking, quality filtering, and GPU-accelerated embedding generation.

Document Ingestion Pipeline

Figure 4 - Automated document ingestion pipeline architecture

  • Format Support: PDF, DOCX, TXT, Markdown, code files, Jupyter notebooks
  • Chunking: 512-1024 chars with semantic boundary detection and metadata preservation
  • Quality Filtering: Language detection, deduplication, coherence scoring
  • Storage: PostgreSQL/PGVector with HNSW indexing and full-text search

5. Technical Implementation

PostgreSQL 15+ with PGVector extension, Docker containerization, and Streamlit interface.

  • Database: PGVector with HNSW indexing (m=32, ef_construction=128) and full-text search capabilities
  • Framework: LangChain with multi-provider support (Ollama, Claude, GPT-4, Gemini, Mistral)
  • ML/NLP: PyTorch, Sentence-Transformers, spaCy, CrossEncoder
  • Infrastructure: Docker deployment with Streamlit UI and Python 3.9+

6. User Interface

Streamlit-based interface with chat interaction, source citations, side-by-side local/cloud comparison, and document upload capabilities.

Streamlit Interactive Interface

Figure 7 - Interactive Streamlit interface

8. Conclusion

This project demonstrates a production-ready RAG system combining 100% on-premise capability with multi-cloud flexibility. Hybrid retrieval (dense + sparse + RRF) delivers superior accuracy, while SQL-native fusion provides 10x performance gains over Python implementations.

Future enhancements include multi-agent integration, graph-based retrieval, multimodal support, and enterprise scalability for millions of chunks.

References & Resources

Key Technologies

Project Information

Status: Production-ready, fully deployable

Contact: For repository access or technical inquiries, contact Martin LE CORRE

Documentation: 📄 View detailed README