Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Disneyland and Disney World: Summer Deals, New Lands and Rides in 2026 and Beyond

    Cox Media fined after bragging it spied on users through their phones

    I Built My First ETL Pipeline as a Complete Beginner. Here’s How.

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tools»I Built My First ETL Pipeline as a Complete Beginner. Here’s How.
    AI Tools

    I Built My First ETL Pipeline as a Complete Beginner. Here’s How.

    By No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    I Built My First ETL Pipeline as a Complete Beginner. Here’s How.
    Share
    Facebook Twitter LinkedIn Pinterest Email

    of my data engineering journey series. In part one, I shared my 12-month roadmap for transitioning from data analyst to data engineer. This is where the actual building begins.

    When I published my first article documenting my data engineering journey, something unexpected happened. People resonated with it. I had strangers reaching out saying they were excited to follow along. That felt good.

    But it also came with pressure.

    Suddenly this wasn’t just a personal goal I could quietly abandon if things got hard. People were watching. People were in the same boat. And that accountability, honestly, is part of why you’re reading this right now.

    So I had to move. And like anyone starting a new skill, the first thing I did was look for resources. There are countless tutorials on the internet for data engineering. YouTube videos, courses, written guides. More than you could ever finish.

    But I couldn’t bring myself to just consume theory. I needed to build something. Something real, with real data, that actually worked at the end.

    So I closed the tutorials and opened a Google Colab notebook instead. I found the GitHub API documentation and decided I was going to build my first ETL pipeline from scratch. No hand-holding. Just me, some Python, and a goal.

    This article is that experience documented in full. The code, the confusion, the small wins, and what I actually learned by doing it.

    First, what is ETL?

    Before I get into what I built, let me quickly explain what ETL actually means because I had to look this up myself not too long ago.

    ETL stands for Extract, Transform, Load. It’s one of the most fundamental concepts in data engineering.

    • Extract means going somewhere to get data. An API, a database, a website, a file. You’re pulling raw information from a source.
    • Transform means cleaning and shaping that data. Removing bad rows, adding new columns, restructuring it so it’s actually useful.
    • Load means saving the cleaned data somewhere. A database, a data warehouse, a simple CSV file.

    That’s it. Those three steps, done in sequence, are what a data pipeline is. Everything else in data engineering, Airflow, Spark, Databricks, is just more sophisticated ways of doing those same three things at scale.

    I’m at the beginning of my roadmap, so I kept it simple. Pure Python, no orchestration tools yet. But the shape of the problem is the same.

    What I built

    I extracted data from the GitHub API, specifically the most starred Python repositories created in the last 30 days. I then cleaned it, added a new column, and saved the output as a CSV file.
    Simple. Real. Entirely mine.

    Here’s how it went.

    Step 1: Extract

    The first thing I had to do was figure out how to talk to the GitHub API. An API is basically a door that a company or platform opens so that developers can request data from it programmatically, without having to manually copy and paste anything.

    GitHub has a free, public API. No account or paid plan needed for basic searches.

    Here’s the code I wrote to extract the data:

    import requests
    
    url = "https://api.github.com/search/repositories"
    
    params = {
        "q": "language:python created:>2025-04-22",
        "sort": "stars",
        "order": "desc",
        "per_page": 30
    }
    
    response = requests.get(url, params=params)
    data = response.json()
    
    print(response.status_code)
    print(data.keys())

    I’ll be honest. This block confused me at first. The requests library was new to me. The params dictionary with that q syntax felt alien. I didn’t immediately know what .json() was doing or why I needed it.

    Let me break it down simply.

    • requests.get() is how you knock on GitHub’s door and ask for something. The url is the address of what you’re asking for. The
    • params dictionary is the specific question you’re asking. In this case: “give me Python repos, sorted by stars, created after April 22, show me 30 results.”
    • .json() converts GitHub’s response from raw text into a Python dictionary that you can actually work with.

    When I ran it, I got this:

    200 
    dict_keys(['total_count', 'incomplete_results', 'items'])

    The 200 means success. That’s the internet’s way of saying “your request worked.” If you see 403 or 404, something went wrong.
    The dictionary has three keys. total_count tells you how many repos matched the search. incomplete_results tells you if GitHub had to cut anything short. And items is where the actual data lives.

    I then ran a second block to peek inside:

    print("Total matches on GitHub:", data['total_count'])
    print("Repos returned:", len(data['items']))
    
    first_repo = data['items'][0]
    print("nFirst repo name:", first_repo['name'])
    print("Stars:", first_repo['stargazers_count'])
    print("Language:", first_repo['language'])
    print("URL:", first_repo['html_url'])

    Output:

    Total matches on GitHub: 9228201
    Repos returned: 30
    
    First repo name: skills
    Stars: 139136
    Language: Python
    URL: https://github.com/anthropics/skills

    The first result was an Anthropic repo with 139k stars. Real data. Live. Pulled by code I wrote.

    That’s Extract done.

    Step 2: Transform

    Now I had 30 repos sitting in a Python list, each one a nested dictionary with dozens of fields. Most of which I didn’t need. The Transform step is where you take that raw, messy data and shape it into something clean and purposeful.

    First I pulled out only the fields I cared about and loaded them into a Pandas dataframe:

    import pandas as pd
    
    repos = []
    
    for repo in data['items']:
        repos.append({
            "name": repo['name'],
            "owner": repo['owner']['login'],
            "stars": repo['stargazers_count'],
            "forks": repo['forks_count'],
            "language": repo['language'],
            "description": repo['description'],
            "url": repo['html_url'],
            "created_at": repo['created_at']
        })
    
    df = pd.DataFrame(repos)
    df.head()

    Seeing that dataframe appear was a proper “wow” moment. I went from a wall of JSON to a clean, readable table with labelled columns in a few lines.

    Then I did three transformations:

    # Drop rows where description is missing
    df_clean = df.dropna(subset=['description'])
    
    # Add a viral flag for repos with over 50k stars
    df_clean = df_clean.copy()
    df_clean['viral'] = df_clean['stars'].apply(lambda x: 'Yes' if x > 50000 else 'No')
    
    # Sort by stars descending
    df_clean = df_clean.sort_values('stars', ascending=False).reset_index(drop=True)
    
    print("Before cleaning:", len(df))
    print("After cleaning:", len(df_clean))

    Output:

    Before cleaning: 30 
    After cleaning: 29

    One repo had no description and got dropped. The viral column showed up cleanly. The data was now sorted and structured.
    That’s Transform done.

    Step 3: Load

    The final step. Take the clean data and save it somewhere. I kept this simple and loaded it into a CSV file:

    df_clean.to_csv('github_trending_repos.csv', index=False)
    
    print("Pipeline complete. File saved.")
    print(f"{len(df_clean)} repos loaded into github_trending_repos.csv")

    Output:

    Pipeline complete. File saved.
    29 repos loaded into github_trending_repos.csv

    I downloaded the file and opened it. A clean spreadsheet with 29 rows and 9 columns. Real GitHub data, shaped and saved by a pipeline I built from scratch.

    That’s Load done.

    What this actually felt like

    Before this, whenever I wanted data to work with, I’d go looking for a public dataset someone had already cleaned and uploaded. Kaggle, Google Dataset Search, wherever. I was always a consumer of data that someone else had prepared.

    This changed something for me.

    The moment I realised I could just point Python at an API I was curious about and extract live data myself, the possibilities felt completely different. I’m not limited to datasets that already exist. I can build the pipeline that creates the dataset.

    That’s a different kind of power. And it’s one of the things that drew me toward data engineering in the first place.

    What’s next

    This pipeline is simple by design. I’m at the start of my roadmap and I’m not going to pretend I’m using Airflow or Spark yet. But the foundation is real. Extract, Transform, Load. It works. I built it. I understand it.

    The next step is to make it more robust. Schedule it to run daily. Store the output in a SQLite database instead of a flat CSV. Start tracking how repos trend over time.

    And eventually, orchestrate the whole thing with Airflow. But that’s a future article.

    For now, the most important thing I proved to myself is that building teaches you things that watching never will. I spent weeks in tutorial land and barely moved. I spent one afternoon actually building, and I understand ETL better than any video made it feel.

    Stop watching. Start building.

    This is part two of my ongoing data engineering series. Follow along as I document every step of the journey, including the parts that don’t go smoothly. Feel free to check out my more in-depth ETL take on my YouTube channel below.

    Connect with me on LinkedIn, YouTube, and Twitter.

    Beginner built complete ETL heres Pipeline
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleThe pitch trick that helped an eSports startup raise $20M when VCs only wanted AI
    Next Article Cox Media fined after bragging it spied on users through their phones
    • Website

    Related Posts

    AI Tools

    From TF-IDF to Transformers: Implementing Four Generations of Semantic Search

    AI Tools

    datasette 1.0a30

    AI Tools

    The Ultimate Beginners’ Guide to Building an AI Agent in Python

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Disneyland and Disney World: Summer Deals, New Lands and Rides in 2026 and Beyond

    0 Views

    Cox Media fined after bragging it spied on users through their phones

    0 Views

    I Built My First ETL Pipeline as a Complete Beginner. Here’s How.

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Disneyland and Disney World: Summer Deals, New Lands and Rides in 2026 and Beyond

    0 Views

    Cox Media fined after bragging it spied on users through their phones

    0 Views

    I Built My First ETL Pipeline as a Complete Beginner. Here’s How.

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.