dbt (data building tool) enables analytics engineers to transform data in their data warehouses by simply writing SQL select statements. dbt ensures that these select statements are converted into tables and views.
dbt does the T in ELT processes (Extract, Load, Transform) - it does not extract or load data, but it sits on top of the database and works like a shell around it. It is extremely good at transforming data that has already been loaded into your warehouse. This ensures that you do not spend time on extracting and re-uploading data from and to your warehouse.
How do you use dbt?
dbt connects to your data warehouse to perform data transformation queries. As such, you need a data warehouse with source data loaded into it to use dbt. dbt supports native connections to Snowflake, BigQuery, Redshift and Postgres data warehouses, and there are a number of community-supported adapters for other warehouses.
After connecting to the database, you are asked which git system you want to work with to enable one of dbt's strengths; version control.
After these first steps, you are confronted with a not very telling folder structure:
But this directory structure is how dbt works and is basically nothing more than a collection of .sql and .yml files. The .sql files are going to contain the select statements you want to build the views or tables with, the .yml files give you the ability to drive dbt further.
Benefits of using dbt
version control : git
We've already hinted at it, but one of the benefits of dbt is the git integration. This allows for versioning and easily separating and merging the code behind your datawarehouse. Where previously extensive documentation and backups were always needed, git makes sure this is all done clearly and you can always fall back to an older version.
Being sure of accurate and current data is always a great thing but not always obvious. To ensure this, dbt allows you to test your data before it is further processed in the database. This does not only include the (built-in!) constraint tests or checking whether a column contains a null-value, but also tests that you can write yourself; are all values in a column within an expected range? In this way, you can enable the Analytics Engineer to identify incorrect assumptions or data before the business itself suffers.
In addition to testing for errors, the "freshness" of the data can also be examined; when was the last time it changed? This way, possible issues with pipelines can be detected before the business starts working with outdated data.
dbt also allows you to quickly understand the sources of your data and tables. In the lineage overview you can quickly see how all the data is related and where the different dependencies lie:
Using the previously mentioned lineage, dbt can also more easily determine the order in which the various tables should be built. In the example above, it is not very useful to build up dim_customers before stg_customers and fct_orders are complete.
This all happens automatically and does not have to be set up manually by the analytics engineer.
Everyone knows how important accurate and comprehensive documentation is, but no one ever feels like creating it (or has the time). dbt therefore comes with a comprehensive documentation function that describes all tables and views in detail:
This documentation provides a quick and clear overview of each table and view. It includes a laundry list of data:
- database detail information
- descriptions provided by the engineer
- column information with data type and testing applied
- dependencies both up- and down-stream
- the code associated with the table/view
All these extra features give you extra control over your database and process. Changes can be made quickly and safely, while you keep a clear eye on the quality of your data. At The Information Lab, we are big fans!