Nates Portfolio

MLB Data Warehouse Data Pipeline

Python, Postgres, Airflow, R, Docker

Back

Project

Overview

This project showcases a fully automated data pipeline built with Docker and Apache Airflow, designed to ingest and transform MLB data on a daily schedule. The pipeline itself runs daily on my Raspberry Pi 5, it leverages an R script powered by the baseballr library to extract comprehensive play-by-play, team, and player data. Once ingested, the data is loaded into a PostgreSQL database, where a combination of SQL and dbt is used to perform robust transformations. The end result is a well-structured Kimball-style data warehouse, complete with conformed fact and dimension tables that support advanced baseball analytics and reporting. This project highlights my ability to orchestrate complex workflows, integrate multiple technologies, and apply best practices in modern data engineering. Let’s walk through how I developed this project and deployed it to my Raspberry Pi.

First and foremost, this tutorial will focus on the steps I took to deploy the project to my Raspberri Pi, rather than the development of the pipeline itself. The steps I took to develop the pipeline itself can be found in the README file in the github repository for this project, located here: MLB-Data-Warehouse-Pipeline--Raspberry-Pi.

Alright, let's get started. The first thing I did was unbox and setup my Raspberry Pi. I purchased the CanaKit Raspberry Pi Starter Kit from Amazon, and that came with everything I needed to get started. After putting all of the hardware together, the first step was to plug in a keyboard, mouse and hdmi into the pi. After turning it on, running the setup wizard, and applying updates, I updated the system and installed docker and docker compose with the following commands:

sudo apt update && sudo apt upgrade -y
curl -sSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
sudo apt-get update
sudo apt-get install libffi-dev libssl-dev
sudo apt install python3-dev python3-pip
sudo pip3 install docker-compose

Next, also on the pi itself, run the following command to allow ssh:

sudo raspi-config
# Go to: Interface Options, then SSH, then Enable

Then, run the following command to obtain the hostname of the pi:

hostname -I

Now, create a project directory and clone the following github repo in that directory:
https://github.com/Karns11/MLB-Data-Warehouse-Pipeline--Raspberry-Pi

After cloning the repo, you will need to create a couple folders and files in the same directory that were not pushed to guthub. So, run the following commands:

mkdir dags logs plugins config project-db-data-postgres15 project-db-data
touch .env
mkdir dbt/dbt_packages dbt/target dbt/logs

Once that is complete, add the following env variables to the .emv file, the values can be whatever values your heart desires:
AIRFLOW_UID
MY_PG_USER
MY_PG_PASSWORD
MY_PG_DB
MY_PG_HOST
MY_PG_PORT
_AIRFLOW_WWW_USER_USERNAME
_AIRFLOW_WWW_USER_PASSWORD

At this point, you have everything you need in order to run:
docker compose build
docker compose up airflow-init
docker compose up

This will start up your airlfow ui and postgres db. In order to connect to the postgres db, you should install pgadmin, then register a new server. In the input fields, create a name for the server, then enter the Raspberry Pi's host name for the host name field (which you obtained earlier), and then enter the user name and password that you entered in the .env file. Boom! Now you should be able to see the database you defined the .env file as well!

Next, navigate over to: localhost:8080 or pihostname:8080 in order to access the airflow ui. You should be able to login to the website with the values you entered for these vairables: _AIRFLOW_WWW_USER_USERNAME & _AIRFLOW_WWW_USER_PASSWORD. Once you login you should see the "run r script daily" dag. You can now turn it on and trigger it if you'd like.

By default, once turned on, (and this can be changed in the "run_r_script_dag.py" file) the dag will run daily at 10am est to obtain yesterdays play-by-play data. If you want to change this behavior and backfill, you must change the BACKFILL_FLAG variable to 1, and enter in the dates you would like to run the process for (Skip days that didn't have any MLB games).

Lastly, and this is optional, we can set up a cloudflare tunnel to make sure the airflow UI can be accessed anywhere at anytime (not just using the Pi's host name on the local network). To do this follow these steps:

Google search cloudflare tunnels, Then create an account, then add a tunnel, then copy and paste the provided code, add a domain name (might have to purchase one, I did), add http:localhost:8080, and you should be good to go!

That is all! Now, just leave your Raspberry Pi on at all times and you will be able to access the db and airlfow ui whenever you need! You can also ssh into the pi with VScode with the remote - ssh extension on your desktop and edit the files on your pi, then save them and push to the github repo whenever you like.
I hope this very high-level overview was helpful. If you have any questions, feel free to email me or message me on linked, both are linked here on my portfolio website!

Lastly, here is a diagram of the data warehouse that is created with this code. Hope you enjoy!

Technologies

Python

Postgres

Airflow

Docker

SQL