In this project, we leverage the PokeAPI, a RESTful API providing comprehensive Pokémon character data, to showcase data engineering skills through the development of an Extract, Transform, Load (ETL) pipeline on Databricks. The project aims to retrieve data from the PokeAPI endpoints, including information on Pokémon species, abilities, types, and more, using HTTP requests. Then we process and refine the raw data into a structured format suitable for analysis and visualization. This involves cleaning, parsing, and transforming the data to ensure consistency and usability. Then we store the transformed data into a scalable and efficient data storage solution as Delta Lake, enabling seamless querying and analysis.
To answer questions like:
What is the average weight of the pokemon by Pokemon type?
List the highest accuracy move by Pokemon type
Count the number of moves by Pokemon and order from greatest to least
In our project, the ingestion process involves leveraging PySpark to retrieve data from the REST API and subsequently landing it into Amazon S3 as JSON files partitioned by timestamp. We can view the data files in DBFS. After landing, we can read them into raw layer as streaming tables. Here we follow SCD-Type 2 and only pull the most recent records forward incrementally. Following ingestion, preprocessing involves flattening nested structures for enhanced usability. We then construct dimension and fact tables according to established data models.
The cdm and dimensional model layers involves Pyspark transformations to calculate average weight, bmi and joining of the tables. These dimensional models can be extracted into tableau for creating dashboards.
The Challenge regarding synchronization of write-streams:
The Fact object doesn't wait for the dimensions → If a Pokemon is added, it will only be included in the next processing of the fact.
Management of PII information can be done.