/

MLB Data Warehouse Data Pipeline

Python, Postgres, Airflow, R, Docker

Project

Overview

This project demonstrates how I designed and deployed a modern, end-to-end data pipeline that ingests, transforms, and delivers MLB statistics for real-world applications. The pipeline is fully containerized with Docker, orchestrated using Apache Airflow, and runs on a Raspberry Pi — proving that scalable data engineering practices can be applied even in lightweight, resource-constrained environments.

1. Data Ingestion – Baseballr + MLB API The pipeline begins by collecting raw baseball data from two primary sources: Baseballr: an R package that provides access to MLB statistics through scraping and API queries. MLB API: the official data feed that includes game results, player stats, and other performance metrics. These sources provide a rich dataset covering everything from individual player performance to team-level results. At this stage, the data is still raw — unstructured, inconsistently formatted, and not yet suitable for analytics.

2. Raw Data Layer – PostgreSQL Staging Once extracted, the raw data is ingested into a PostgreSQL database. Data lands in a dedicated raw schema, which serves as a staging area. This staging layer ensures that raw feeds are preserved before any transformations are applied. It provides traceability and makes the pipeline more resilient by separating unprocessed data from production-ready models. At this point, data quality issues such as duplicates, missing values, or inconsistent formats are still present.

3. Data Transformation – dbt + Kimball Modeling The transformation layer is powered by dbt (Data Build Tool), which enables modular, version-controlled SQL models. Using dbt, I applied Kimball-style dimensional modeling to structure the raw data into: Fact tables: capturing measurable events like player performance metrics, game outcomes, and team stats. Dimension tables: providing descriptive attributes such as player details, teams, and game dates. This modeling approach turns raw, messy data into a well-structured, analytics-friendly data warehouse within PostgreSQL. The result is a semantic layer that is easy to query, consistent across use cases, and optimized for analytical workloads.

4. Orchestration & Infrastructure – Airflow + Docker + Raspberry Pi The pipeline is orchestrated end-to-end with Apache Airflow, which manages scheduling, dependencies, and error handling. Each step — from ingestion to transformation to downstream consumption — is containerized with Docker, ensuring portability and consistency. What makes this project unique is that the entire pipeline runs on a Raspberry Pi. This lightweight deployment highlights the ability to design cost-efficient, scalable systems even in constrained environments, demonstrating both technical adaptability and resource optimization.

5. Downstream Consumption – Real-World Applications Once the data is transformed and modeled, it becomes available for downstream applications: NLP-to-SQL Chatbot: A natural language processing interface allows users to ask questions like “Which pitcher has the most strikeouts this season?” The chatbot converts these queries into SQL, executes them against the warehouse, and returns results in real time. Automated Twitter Updates: A separate script consumes warehouse insights and posts automated content to Twitter. Examples include player leaderboards, fun facts, or daily highlights. This creates a live, automated reporting system powered entirely by the pipeline.

Key Takeaways & Skills Demonstrated End-to-end design: Built a pipeline covering ingestion, transformation, orchestration, and consumption. Modern practices: Applied dbt for modular transformations, Airflow for orchestration, and Kimball modeling for warehouse design. Lightweight scalability: Deployed on a Raspberry Pi using Docker to highlight cost-efficiency without sacrificing functionality. Real-world applications: Enabled practical use cases like automated reporting and natural language querying.

Lastly, here is a diagram of the data warehouse that is created with this code. Hope you enjoy!

/

Technologies

Python

Postgres

Airflow

R

Docker

SQL