dbt: Principles and the first steps with the framework!
Following my plan to write all my posts in English, here is the second one related to the DBT Framework saga! So, let’s get started with it! 💃 🎉
As I mentioned in the previous post, dbt is a framework that uses SQL as a syntax base for processing and transforming analytical data. In other words, it focuses on the Transformation (T) step of the ETL (Extraction, Transformation, and Load).
Once you understand what the framework is, how to install and configure it, it’s time to understand the fundamental features to start the first steps in developing pipelines by using this incredible framework!
How do I create my transformation data processes?
The first important concept is that the well-known ETL job, a step of a complete transformation design, is a model in dbt. The models are .sql files in which the transformation are actually developed.
When executed, the models become objects in the analytical data store. There are four types of materializations:
- View (default): when the pipeline is executed, a view is created in the database from this model, based on the model’s SQL.
- Table: when the pipeline runs, a table is created in the database and data is loaded based on the model’s SQL. Every time the model is executed, the records will be deleted, and a new data load will be performed, doing a full data load.
- Incremental: a table needs to be previously created in the database to receive the data load, which, at each execution, will insert new rows of records (if the data is new) or update them (if it’s about existing data). The process will be carried out considering a pre-established key in the model for data merge validation.
- Ephemeral: No objects are created and stored in the database when run in the pipeline, but this materialization type works as a source for dependent models like common table expressions.
In the example above, we can see a workflow, which was composed through model references. The final model analytics_transactions(4) was created referencing the stg_transactions_offers(3) model, which in turn referenced the stg_transactions(2) model, which in turn referenced the transactions(1) table, which was configured in source file.
Below we can see the code content of the analytics_transactions model to exemplify the reference that must be applied using the ref command.
In addition to the ref command, there is also the source command, which you should use to reference a table defined in the source file (we’ll talk about it shortly). This command identifies the source table to be used. The command must be composed considering the following syntax:
source(‘schema_name’, ‘table_name’)
Below we can see the code content of the analytics_transactions model to exemplify the reference using the source command.
How to define data sources?
Since the Extraction process (and the ETL) must be done by another tool / framework, it is important to have all data sources configured, so that it is possible to develop the first model to be worked on.
Therefore, the best practice is to create the source.yml file, which I suggest to be created with the name of the source application of the raw data, precisely to facilitate the understanding and maintenance of codes, given the transparent semantics!
The purpose of this file is to contain the database name, schema name, table names and their descriptions, in addition to their columns and data dictionaries!
This material is absolutely relevant and important in terms of documentation, to support developers in reading and maintaining the codes!
And can I work with a sample data file?
If you want to do a test using a sample of data, the dbt contains a command called seed, which will ingest files with a .csv extension. For this, it is necessary to make the file available in the /data folder.
When executing the ingestion through the command described below, a new table will be created in the analytical repository, which will have a data load carried out in this repository.
dbt seed
Important: If this command is executed again, new data will be inserted. And with that, duplicity will be generated in the database.
How do I run my pipeline?
The entire pipeline can be executed using the command:
dbt run
It is also possible to run only a certain data model using the command:
dbt run — models [model-name-to-be-executed]
Can I apply tests to my pipeline?
A very important and really necessary feature is related to pipeline tests! By the way, I, particularly, think this feature is wonderful and really well drawn by the founders! It really helps us a lot for data quality assurance! 💃 🎉
To use it, it is necessary to create a test.yml file or use the defined source.yml file itself. In the file, it is necessary to describe the tests that must be applied later to the execution of the jobs, such as validation of duplicate records, validations regarding data domains, null records, among others.
As you can see in the example source.yml file above, some test validations have been defined. For example:
- Validation that column_name of table table_name1 must contain unique records (no duplicates)
- Validation that they must contain records for all rows. That is, they cannot contain null records in column column_name of table table_name1.
If at any time, the rules described above are violated, then the dbt will return an error related to the given rule.
The validation process can be performed by running the following commands described below.
To run tests on all tests and sources files:
dbt test
To run specific tests:
dbt test# Tests only the schema
dbt test — schema #Tests only the data
dbt test — data #Tests schema
As you can imagine, there are several features in dbt, which I will share here in the future! The objective here was to share the necessary knowledge so that you understand the principles of DBT and with that, start the first steps in creating pipelines using this incredible framework!
Interested in reading and learning more?
Sooner, I’ll post here other additional contents regarding this amazing framework! Be posted! 😉
To check out:
- The previous post: dbt: the framework that uses SQL and how to install it
- This post in Portuguese: dbt: Princípios e primeiros passos com o framework!