dbt-pipeline use case(part 1): Data Transformation
As mentioned earlier, dbt is a framework that uses SQL as its syntax base for processing/transforming analytical data. In other words, its primary focus is on the Transformation (T) stage of ETL (Extraction, Transformation, and Load).
In our previous post, we discussed fundamental features of DBT, aiming to provide an overview of the functionalities necessary for using the framework. If you haven’t seen it yet, click here to learn more about the topic.
Great! We now understand what DBT is, some of its basic and essential features, so let’s dive into a real-world case from start to finish using DBT to solidify our understanding, shall we? Absolutely!
As mentioned earlier, the goal of this article is to explore a case for data transformation and processing, using this project as a reference for all the content described in this article.
LET´S TALK ABOUT OUR USE CASE!
We need to support a brand-new consumer goods company in conducting analyses and addressing some of the questions below:
- What is the company’s monthly revenue?
- What are the top 5 categories of products that are most purchased?
- Which products are the best-sellers and should, therefore, be analyzed in greater depth for inventory maintenance?
- Which products had the highest application of offers and/or discounts, resulting in increased sales? And which ones did not?
- Which stores have the highest sales volume? And so on! As you can see, in summary, through the construction of an analytical framework, our goal is to assist the company in performing its descriptive analyses, enabling them to explore their data!
Exciting! Now, what will be our data source?
As described in detail below, the available database consists of 2 Excel files, which need to be ingested and imported into the database to facilitate the data processing and serve as a source for subsequent analyses, as we will discuss later.
- These files are sourced from Kaggle and can be downloaded via this link.
- Since Kaggle generally provides files for Data Science data models and often lacks domain knowledge for most of the data, it typically only contains their IDs, which is precisely the case with our files.
Description of files and their respective data attributes used:
- transactions: Contains the transaction history for all customers over a period of at least 1 year before the offered incentive.
- id: A unique identifier representing a customer.
- chain: An integer representing a store chain.
- dept: An aggregated category grouping (e.g., water).
- category: The product category (e.g., carbonated water).
- company: The company ID selling the item.
- brand: The brand ID to which the item belongs.
- date: The date of the purchase.
- productize: The product purchase value (e.g., $16 for a water product).
- productmeasure: Units of the product purchase.
- purchasequantity: Number of units purchased.
- purchaseamount: Dollar value of the purchase.
Note: Negative values in productquantity and/or purchaseamount indicate product returns.
2. offers: Contains information about the offers:
- offer: An ID representing a specific offer.
- category: Product category (e.g., carbonated water).
- quantity: Number of units to be purchased to receive the discount.
- company: The company ID selling the item.
- offervalue: Dollar value of the offer.
- brand: The brand ID to which the item belongs
Structuring the model to consider the data transformation process!
DATA IMPORT:
The first step in data transformation is to make the files available in the /data folder and import them using the command described later.
When executing the command, DBT will create tables with the same names as the files. Therefore, in the naming, do not consider anything case-sensitive (meaning, do not consider spaces or any special characters, and do not consider initcap). The same rule should be applied to the attributes of the files.
Important: When executing the dbt command, the tables will be created, and the data will be imported. Consequently, if the same command is executed again, an error will be presented due to the tables already existing.
DATA TRANSFORMATION:
As previously discussed, since this project considers a limited dataset (without the domains of each ID), the ‘flat table technique’ was chosen for analytical data modeling. This technique is commonly used by Data Science teams to create models, with columns representing features (also known as variables). If we had domain-specific data, the modeling strategies would be considerably expanded, as different data modeling techniques could be applied, resulting in greater richness in the analysis process.
That said, let’s begin discussing the structure for the transformation process! In the end, the following DAG (Directed Acyclic Graph) should be created:
For the purpose of organizing and processing the data, the ETL process is divided into 3 stages (and folders):
- raw: Contains the source.yml file, which contains the source tables.
- staging: Contains the staging models, intermediate models with applied transformations.
- processed: Contains the final models, which create analytical tables and data sources for analysis, where final transformations are applied.
RAW:
purchases.yml: This file contains the raw data source for the entire pipeline.
- sources.name: It holds the name of the repository schema containing the raw data from the warehouse.
- tables.name: It contains the names of the created tables that hold raw data, in this case, transactions and offers.
STAGING:
stg_offers.sql: A base created to contain all offers, retaining only the latest offers (by brand, category, and supplier).
stg_transactions.sql: A base created to list all transactions, marking information about purchases and returns.
stg_transactions_offers.sql: A base created to list all transactions, identifying transactions + products that had offers and/or discounts, in addition to the previously processed information.
schema.yml: This file contains data validation rules to be applied to staging tables, in this case, applying a validation to ensure that the stg_transactions table has non-null data in the ‘id’ column since it serves as the primary key of the table, and we expect it not to have null records.
PROCESSED:
analytics_offers.sql: A base created to list the volume of offers by category, supplier, and brand.
analytics_transactions.sql: A base created for the analysis of raw transaction and offer data with applied treatments.
analytics_transactions_yearmonth.sql: A base created for monthly granular results of transactions, identifying product categories, suppliers, product sales volumes, as well as revenue.
schema.yml: This file contains data validation rules for analytical tables, in this case, applying validation to two main tables:
- analytics_offers: Checks if the combination of category, supplier, and brand is unique, as it serves as the key for the final table created. Also, it validates that records are never null, as expected.
- analytics_transactions: Checks if the combination of transaction ID, category, supplier, and brand results in null records, which is not expected.
How to run the full data loading process?
- In the terminal, within your project, execute ‘dbt seed’ to import the files available in /data folder.
dbt seed
2. In the terminal, within your project, execute ‘dbt run’ to run the entire ETL pipeline.
dbt run
3. In the terminal, within your project, execute ‘dbt test’ to run tests related to the created models.
dbt test
Would you like to run the entire project without having to reproduce the created code?
Absolutely!
If you don't want to recreate the entire code, simply download/clone it from the git repository (https://github.com/jmilhomem/dbt_purchases_project) and run it! To do this, consider the following prerequisites:
- Use Linux/Ubuntu.
- Have dbt installed.
- Have a local or cloud-based Postgres instance (this project uses AWS RDS).
- Have the profiles.yml file filled with the configuration of the created analytical repository, available at ~/.dbt.
- Have Python 3.6 installed.
- Have Make installed.
The process for executing the project is available and described in detail in the readme. Just follow it and enjoy!
Pretty cool, right?! I hope you’ve learned and had fun in the process!
The next step is the integration with Python, which aims to orchestrate and monitor the data loading log.
Stay tuned, and see you soon! 😉
Be posted! See you there! ❤
To check out previous/other posts related to:
In English:
- dbt: the framework that uses SQL and how to install it
- dbt: Principles and the first steps with the framework!
This post in Portuguese: Case de pipeline com dbt(parte 1): Transformação de dados