dbt: the framework that uses SQL syntax and how to install it
I’ve been planning to write all my posts in English for some time, and now, I’m finally writing about the DBT Framework saga! This is the first post related to this framework that I came across and worked with a few years ago, which is now one of the buzzwords of the moment, and is indeed really good!
So, let’s get started with the first post in English! 💃 🎉
How did I find out about DBT?
As the company I was working for at that time was growing, our team aimed to migrate our architecture to a more scalable stack, considering an easier development of a data pipeline. As the person responsible for ETL definitions, I started researching with the team to explore and test some possibilities, in addition to the already-known Python and Spark programming languages. During our technology exploration, I came across the DBT framework, which I was responsible for studying and testing!
Tell me something: is there a more democratic language than SQL for the data world? Yeah, exactly! SQL is a must-have knowledge! Exactly for that reason, needless to say, my curiosity was at its peak, and of course my expectations too!
Even after analyzing the pros and cons, which we’ll talk about later, we decided to use it as part of the data stack! And from there, we started the refactoring development to continue the migration!
Great! But what is DBT after all?
The DBT (data build tool) is a framework, which uses SQL as a syntax base, for processing/transforming analytical data. That is, it focuses on the Transformation (T) step of the ETL (Extraction, Transformation and Load).
In addition, the framework allows the use of macros and jinja in development, in order to use existing features and create customizations.
The framework allows connection to a series of analytical repositories, such as AWS Redshift, Bigquery, Snowflake, among others, as you can see described below! Interestingly, the framework allows specific configurations for each database/data platforms, thus enabling customizations to improve performance for each solution.
Another super important feature of this framework is related to the creation of DAGs, which is the chaining between the models (jobs) developed. And its composition is nothing more than the entire data processing workflow, the data pipeline (of course, considering specifically the transformation step)!
The DAGs are created during the job development itself, in reference to their codes. These DAGs can even be visualized graphically, as in the example below:
DBT is open-source and is constantly expanding, with feature additions, as versions have been launched!
To follow the releases, just check the link of the desired version.
There is also a paid and cloud version, which is provided by the Fishtown Analytics company, created by the creators and main developers (since the community also contributes) of the framework! The difference is that the paid solution provides an automated product, which contains some features, which allow, for example, the management and full administration of the workflow schedules, access security, allowing the monitoring of the entire log through the graphical interface.
The open-source framework obviously requires development to automate the tracking and scheduling process.
And what are the pros and cons of this Framework?
Of course, this type of analysis is very particular, as each company has its needs and expectations in different aspects, and these factors should be onsidered individually during the analytical process.
That being said, I’ve consolidated the most relevant topics to be considered, considering updated topics, since the first post (in Portuguese) considered the main points faced in the past.
Nice! And how can I install and configure DBT?
Its installation could be mainly carried out on Linux (Ubuntu, Debian, and CentOS), Windows, and macOS, and the requirement for this is to have Python 3.6 installed! However, now, with the facilities of the Dockers, it’s easier to have it installed, considering your necessities and customizations needs, based on images already available by them.
To install, follow the step-by-step instructions provided in this doc.
Once the framework is installed, a directory is created, with the composite divisions, as described in the image below:
Each folder created within this directory has a specific purpose, namely:
- analysis: Where analytical queries are stored, such as reports.
- data: Where raw data sample files should be stored
- dbt_modules: Where dbt packages are installed
- logs: Where the dbt logs are stored
- macros: Where macros are stored that can be reused
- models: Where the data models are actually created, with .sql extension
- snapshots: Where snapshots are stored
- target: Where the dbt will store the sql scripts compiled to create the models
- tests: Where test files are stored
- The dbt_project.yml file is the file that contains configuration parameters for pipeline execution using internal directories for data processing and possible variables to be considered.
After installation, in addition to the folder created in the installation desktop, a new hidden folder called dbt is created on your desktop with a file called profiles.yml. This file must be updated with the access parameters to the analytical repository to be used, as seen below.
Once the installation and configurations are completed, just start the data transformation development process using DBT!
Interested in reading and learning more?
Sooner, I’ll post here other additional contents regarding this amazing framework! Be posted! 😉
- To check out the post in Portuguese: DBT: o framework que usa de sintaxe SQL para processamento de dados