Big Data & Analytics NLP & Clustering Academic Team Project

Streamlit Food.com: Large-Scale Recipe Analytics Platform

Comprehensive Big Data platform for ingestion, preprocessing, and analysis of Food.com recipes with interactive Streamlit interface, BERTopic thematic clustering, CI/CD pipeline, and dynamic dashboards for real-time insights.

1. Project Overview

This academic team project built a comprehensive platform for ingestion, preprocessing, and analysis of large-scale Food.com recipe data. The goal was to extract key indicators about site evolution while providing reusable tools for future data science projects. The project demonstrates end-to-end Big Data pipeline capabilities from raw data ingestion to interactive visualization.

  • Objective: Extract actionable insights from Food.com recipe corpus for trend analysis and user behavior understanding
  • Scale: Large-scale data processing with robust pipelines for handling millions of recipes and reviews
  • Collaboration: Agile team workflow with iterative development and multidisciplinary expertise integration

2. Interactive Streamlit Application

User-friendly web interface for data preparation and real-time analysis presentation.

  • Data Management: Interactive controls for data ingestion, cleaning, and preprocessing workflows
  • Real-time Results: Live visualization of analysis outputs with dynamic filtering and exploration
  • User Experience: Intuitive interface design for both technical and non-technical stakeholders

3. BERTopic Thematic Clustering

Advanced NLP pipeline leveraging transformer embeddings for recipe topic discovery.

  • BERTopic Framework: State-of-the-art topic modeling combining BERT embeddings with clustering algorithms
  • Trend Extraction: Automatic identification of dominant recipe themes and evolving culinary trends
  • Data Organization: Hierarchical topic structure for intuitive navigation through recipe categories

4. Continuous Integration & Deployment

Automated workflow for seamless model and data updates.

  • Automated Testing: Unit and integration tests ensuring pipeline reliability across updates
  • Continuous Deployment: Streamlined deployment process for rapid iteration and feature releases
  • Version Control: Git-based workflow with code review and automated quality checks

5. Dynamic Dashboards & Visualization

Comprehensive visualization suite for real-time KPI monitoring and trend analysis.

  • KPI Dashboards: Real-time tracking of key metrics including recipe popularity, user engagement, and trend evolution
  • Interactive Charts: Dynamic graphs with drill-down capabilities for detailed exploration
  • Temporal Analysis: Time series visualizations revealing seasonal patterns and long-term trends

6. Performance Optimization & Big Data Scalability

Technical adaptations for handling large data volumes with acceptable performance.

  • Computation Acceleration: Vectorized operations and parallel processing for compute-intensive tasks
  • Memory Management: Chunked processing and streaming for datasets exceeding RAM capacity
  • Scalable Architecture: Modular design enabling horizontal scaling for production deployment

7. Agile Team Collaboration

Iterative, multidisciplinary approach for efficient cross-functional coordination.

  • Sprint Methodology: 2-week sprints with daily stand-ups and retrospectives for continuous improvement
  • Role Distribution: Clear ownership across data engineering, ML modeling, and frontend development
  • Knowledge Sharing: Regular tech talks and documentation for team-wide skill development

8. Results & Learnings

This project successfully delivered a production-ready Big Data analytics platform combining robust data engineering, advanced NLP clustering, and intuitive visualization. The modular architecture ensures reusability across future projects, while CI/CD integration enables rapid iteration. Key learnings include the importance of scalable design patterns, the power of transformer-based topic modeling for unstructured text, and the value of interactive dashboards for stakeholder communication.

Future enhancements include recommendation system integration, sentiment analysis on user reviews, real-time streaming analytics, and deep learning models for recipe generation and personalization.

Technologies & Resources

Key Technologies

Project Information

Type: Academic team project (Big Data & Analytics)

Contact: For technical inquiries, contact Martin LE CORRE