of my data engineering journey series. In part one, I shared my 12-month roadmap for transitioning from data analyst to data engineer. This is where the actual building begins.
When I published my first article documenting my data engineering journey, something unexpected happened. People resonated with it. I had strangers reaching out saying they were excited to follow along. That felt good.
But it also came with pressure.
Suddenly this wasn’t just a personal goal I could quietly abandon if things got hard. People were watching. People were in the same boat. And that accountability, honestly, is part of why you’re reading this right now.
So I had to move. And like anyone starting a new skill, the first thing I did was look for resources. There are countless tutorials on the internet for data engineering. YouTube videos, courses, written guides. More than you could ever finish.
But I couldn’t bring myself to just consume theory. I needed to build something. Something real, with real data, that actually worked at the end.
So I closed the tutorials and opened a Google Colab notebook instead. I found the GitHub API documentation and decided I was going to build my first ETL pipeline from scratch. No hand-holding. Just me, some Python, and a goal.
This article is that experience documented in full. The code, the confusion, the small wins, and what I actually learned by doing it.
First, what is ETL?
Before I get into what I built, let me quickly explain what ETL actually means because I had to look this up myself not too long ago.
ETL stands for Extract, Transform, Load. It’s one of the most fundamental concepts in data engineering.
- Extract means going somewhere to get data. An API, a database, a website, a file. You’re pulling raw information from a source.
- Transform means cleaning and shaping that data. Removing bad rows, adding new columns, restructuring it so it’s actually useful.
- Load means saving the cleaned data somewhere. A database, a data warehouse, a simple CSV file.
That’s it. Those three steps, done in sequence, are what a data pipeline is. Everything else in data engineering, Airflow, Spark, Databricks, is just more sophisticated ways of doing those same three things at scale.
I’m at the beginning of my roadmap, so I kept it simple. Pure Python, no orchestration tools yet. But the shape of the problem is the same.
What I built
I extracted data from the GitHub API, specifically the most starred Python repositories created in the last 30 days. I then cleaned it, added a new column, and saved the output as a CSV file.
Simple. Real. Entirely mine.
Here’s how it went.
Step 1: Extract
The first thing I had to do was figure out how to talk to the GitHub API. An API is basically a door that a company or platform opens so that developers can request data from it programmatically, without having to manually copy and paste anything.
GitHub has a free, public API. No account or paid plan needed for basic searches.
Here’s the code I wrote to extract the data:
import requests
url = "https://api.github.com/search/repositories"
params = {
"q": "language:python created:>2025-04-22",
"sort": "stars",
"order": "desc",
"per_page": 30
}
response = requests.get(url, params=params)
data = response.json()
print(response.status_code)
print(data.keys())
I’ll be honest. This block confused me at first. The requests library was new to me. The params dictionary with that q syntax felt alien. I didn’t immediately know what .json() was doing or why I needed it.
Let me break it down simply.
requests.get()is how you knock on GitHub’s door and ask for something. Theurlis the address of what you’re asking for. Theparamsdictionary is the specific question you’re asking. In this case: “give me Python repos, sorted by stars, created after April 22, show me 30 results.”.json()converts GitHub’s response from raw text into a Python dictionary that you can actually work with.
When I ran it, I got this:
200
dict_keys(['total_count', 'incomplete_results', 'items'])
The 200 means success. That’s the internet’s way of saying “your request worked.” If you see 403 or 404, something went wrong.
The dictionary has three keys. total_count tells you how many repos matched the search. incomplete_results tells you if GitHub had to cut anything short. And items is where the actual data lives.
I then ran a second block to peek inside:
print("Total matches on GitHub:", data['total_count'])
print("Repos returned:", len(data['items']))
first_repo = data['items'][0]
print("nFirst repo name:", first_repo['name'])
print("Stars:", first_repo['stargazers_count'])
print("Language:", first_repo['language'])
print("URL:", first_repo['html_url'])
Output:
Total matches on GitHub: 9228201
Repos returned: 30
First repo name: skills
Stars: 139136
Language: Python
URL: https://github.com/anthropics/skills
The first result was an Anthropic repo with 139k stars. Real data. Live. Pulled by code I wrote.
That’s Extract done.
Step 2: Transform
Now I had 30 repos sitting in a Python list, each one a nested dictionary with dozens of fields. Most of which I didn’t need. The Transform step is where you take that raw, messy data and shape it into something clean and purposeful.
First I pulled out only the fields I cared about and loaded them into a Pandas dataframe:
import pandas as pd
repos = []
for repo in data['items']:
repos.append({
"name": repo['name'],
"owner": repo['owner']['login'],
"stars": repo['stargazers_count'],
"forks": repo['forks_count'],
"language": repo['language'],
"description": repo['description'],
"url": repo['html_url'],
"created_at": repo['created_at']
})
df = pd.DataFrame(repos)
df.head()
Seeing that dataframe appear was a proper “wow” moment. I went from a wall of JSON to a clean, readable table with labelled columns in a few lines.
Then I did three transformations:
# Drop rows where description is missing
df_clean = df.dropna(subset=['description'])
# Add a viral flag for repos with over 50k stars
df_clean = df_clean.copy()
df_clean['viral'] = df_clean['stars'].apply(lambda x: 'Yes' if x > 50000 else 'No')
# Sort by stars descending
df_clean = df_clean.sort_values('stars', ascending=False).reset_index(drop=True)
print("Before cleaning:", len(df))
print("After cleaning:", len(df_clean))
Output:
Before cleaning: 30
After cleaning: 29
One repo had no description and got dropped. The viral column showed up cleanly. The data was now sorted and structured.
That’s Transform done.
Step 3: Load
The final step. Take the clean data and save it somewhere. I kept this simple and loaded it into a CSV file:
df_clean.to_csv('github_trending_repos.csv', index=False)
print("Pipeline complete. File saved.")
print(f"{len(df_clean)} repos loaded into github_trending_repos.csv")
Output:
Pipeline complete. File saved.
29 repos loaded into github_trending_repos.csv
I downloaded the file and opened it. A clean spreadsheet with 29 rows and 9 columns. Real GitHub data, shaped and saved by a pipeline I built from scratch.
That’s Load done.
What this actually felt like
Before this, whenever I wanted data to work with, I’d go looking for a public dataset someone had already cleaned and uploaded. Kaggle, Google Dataset Search, wherever. I was always a consumer of data that someone else had prepared.
This changed something for me.
The moment I realised I could just point Python at an API I was curious about and extract live data myself, the possibilities felt completely different. I’m not limited to datasets that already exist. I can build the pipeline that creates the dataset.
That’s a different kind of power. And it’s one of the things that drew me toward data engineering in the first place.
What’s next
This pipeline is simple by design. I’m at the start of my roadmap and I’m not going to pretend I’m using Airflow or Spark yet. But the foundation is real. Extract, Transform, Load. It works. I built it. I understand it.
The next step is to make it more robust. Schedule it to run daily. Store the output in a SQLite database instead of a flat CSV. Start tracking how repos trend over time.
And eventually, orchestrate the whole thing with Airflow. But that’s a future article.
For now, the most important thing I proved to myself is that building teaches you things that watching never will. I spent weeks in tutorial land and barely moved. I spent one afternoon actually building, and I understand ETL better than any video made it feel.
Stop watching. Start building.
This is part two of my ongoing data engineering series. Follow along as I document every step of the journey, including the parts that don’t go smoothly. Feel free to check out my more in-depth ETL take on my YouTube channel below.

