Big Data & ETL ML Pipeline Production 24/7

Instagram Following Surveillance Pipeline

Automated ETL pipeline for massive data collection and analysis with intelligent multi-pass scraping, ML gender prediction, quality tracking, and real-time dashboards using Airflow, Spark, and Elastic Stack.

1. Context and Objectives

This project implements a production-ready ETL pipeline for monitoring Instagram account followings with automated change detection, machine learning predictions, and comprehensive data quality tracking. The system runs 24/7 with complete autonomy once Docker Desktop is launched.

Key Objectives

  • Automated Surveillance: Scrape Instagram followings every ~4 hours (6 times/day) with anti-detection strategies
  • Change Detection: Identify new followings and unfollows through intelligent daily comparisons
  • ML Predictions: Automatic gender prediction with confidence scores using machine learning
  • Quality Tracking: Comprehensive quality score system tracking scraping completeness and accuracy
  • Real-Time Dashboards: Modern web dashboard and Kibana visualizations with advanced filters
  • Structured Data Lake: Three-layer architecture (RAW → FORMATTED → USAGE)

The main challenge was designing a robust system that balances data collection frequency, anti-detection mechanisms, and quality assurance while maintaining 24/7 autonomous operation.

ETL System Overview

Figure 1 - High-level overview of the Instagram surveillance pipeline

Démo d'un run RAG (mode visuel activé via serveur X11)

⚠️ ATTENTION : LES DONNEES RECUPEREES NE RESPECTENT PAS LA RGPD ET CE PROJET A ETE REALISE DANS UN CADRE ACADEMIQUE UNIQUEMENT

2. System Architecture

The ETL pipeline follows a modular microservices architecture orchestrated by Apache Airflow with full Docker containerization.

Automated Execution Flow

The system executes on a precisely timed schedule (Europe/Paris timezone):

  • 6 Daily Scrapings: 02:00, 06:00, 10:00, 14:00, 18:00, 23:00 (+ random delay 0-45min for anti-detection)
  • 3 Passes Per Scraping: Multiple passes with 60-120s random delays to simulate human behavior
  • Daily Aggregation: 23:00 - Fusion of all 6 daily scrapings with deduplication
  • Daily Comparison: 23:00 - Detection of new followings and unfollows (Day vs Day-1)

Microservices Stack

Orchestration

  • Apache Airflow 2.10.3
  • LocalExecutor
  • DAG Scheduler 24/7

Data Collection

  • Selenium 4.36
  • Chrome Headless
  • Cookie Authentication

Processing

  • PySpark 4.0.1
  • Pandas
  • Scikit-learn 1.6.0

Storage & Visualization

  • PostgreSQL 14
  • Elasticsearch 8.11
  • Kibana 8.11
  • Flask Dashboard

Data Lake Architecture

data/
├── raw/         # Raw JSON from scraping
│   └── YYYY-MM-DD_HH-MM-SS/
├── formatted/   # Cleaned data with ML predictions
│   └── YYYY-MM-DD_aggregated/
└── usage/       # Daily aggregations & comparisons
    └── YYYY-MM-DD_comparatif/

Prérequis et guide d'installation

Prérequis

  • Docker Desktop - Pour la conteneurisation de tous les services
  • VS Code - Éditeur de code recommandé
  • Extension Chrome "Cookies.txt" - Pour extraire les cookies d'authentification Instagram
  • Compte Instagram personnel - Nécessaire pour l'authentification et le scraping

Étapes d'installation

1
git clone https://github.com/martin-lcr/Datalake_Instagram_Following_Surveillance.git
2
make install
3
Add your own Instagram cookies using a free chrome extension (Get cookies.txt)
4
Press enter to continue
5
Check on Docker desktop if containers are running and healthy
6
Wait for final checks while Dashboard, Elastic & Kibana are getting accessible
7
Optional: make help → see all available commands
8
Optional: make status

3. ETL Pipeline Stages

Three-stage pipeline combining Selenium scraping, PySpark processing, and multi-target storage.

  • Extraction: Selenium-based scraping with cookie auth, 3 passes, and rate limiting (60-120s delays)
  • Transformation: PySpark processing with ML gender prediction, quality scoring, and daily aggregation
  • Loading: PostgreSQL, Elasticsearch, and JSON data lake (RAW/FORMATTED/USAGE layers)
1
make open
2
Dashboard, Kibana & Airflow launched
3
Change instagram accounts to scrape (targets) on dashboard or .txt
4
Deleting 8 followings from my personal account for demo
5
Adding 7 new accounts as new followings for my personal account for demo

Modifier la liste des comptes instagram surveillés

1
Lancement du DAG (mode visuel activé)
2
DAG Task 1 "generate_scripts" : Read accounts (targets)
3
DAG Task 2 "run_single_scripts[3 for demo]" : 3 pass of 1 scrapping executed for every target, 6x a day with random delay to avoid detection
4
DAG Task 3 & 4 : "aggregate_results" & "index_to_elasticsearch" : Aggregate all scraping of the day at 23:00 automatically to compare every day and have a passive full surveillance without getting caught !
5
Format : JSON (RAW) --> Parquet (FORMATTED) --> PSQL + "YYYYMMDDHHMM"
6
Results from multiple scrapping of the day & comparison with previous detections - Gender detection from AI model
7
Validation of demo

Lancement de la pipeline et trigger manuel du DAG pour demo

4. Quality Tracking System

Automated quality assurance with completeness scoring and intelligent comparison to eliminate false positives.

  • Completeness Score: (scraped / total_instagram × 100) with HIGH/MEDIUM/LOW confidence levels
  • Threshold Filtering: Configurable minimum (default 80%) to ignore incomplete scrapings
  • False Positive Prevention: Only detects genuine changes, not artifacts from partial scrapings

Logs Airflow pour suivi détaillé du déroulement de chaque run

5. Anti-Detection Strategy

Comprehensive mechanisms to avoid Instagram rate limiting through timing strategies and behavioral simulation.

  • Timing: 6×/day with irregular intervals (3-5h) and random delays (0-45 min)
  • Behavior: 3 passes per scraping, 60-120s delays, persistent cookies, headless Chrome
  • Risk Mitigation: Account limits, cookie rotation (1-3 months), graceful error handling

6. Docker Architecture

Docker Architecture

Figure 8 - Architecture de conteneurisation Docker avec tous les services

7. Technical Implementation

Docker containerization with Makefile automation and optional Oracle Cloud deployment.

  • Docker: Full containerization with custom Airflow image, service isolation, persistent volumes
  • Makefile: One-command install (5-7 min), start/stop, dashboard auto-open, cookie validation
  • Cloud: Oracle Free Tier option (4 OCPU, 24 GB RAM, 200 GB storage)

8. Conclusion

This project demonstrates a production-ready ETL pipeline achieving 24/7 autonomous operation with robust quality tracking and effective anti-detection mechanisms. Multi-pass scraping significantly improves data completeness while Docker containerization enables seamless cloud deployment.

Future enhancements include advanced ML for profile analysis, real-time alerts, API development, and multi-platform extension.

Technologies & Resources

Key Technologies

Project Information

Status: Production-ready, running 24/7

Deployment: Docker containerization with Oracle Cloud support

Contact: For technical inquiries, contact Martin LE CORRE

Documentation: 📄 View detailed README

Legal Notice

This project is provided for educational and research purposes only. Use responsibly and in compliance with Instagram's Terms of Service and data protection laws (GDPR).