Intermediatedbtdatabricksanalyticssqldata-engineeringetldata-transformation

dbt: Data Build Tool for Analytics Engineering

Transform raw data into analytics-ready datasets using dbt's modular SQL-based workflow. Build, test, and document your data transformations with version control and CI/CD.

Step 1
Install Python and pip
dbt requires Python 3.8 or higher. Verify your installation before proceeding.
```
python --version
pip --version
```
Step 2
Install dbt-core and adapter
Install dbt-core along with the appropriate database adapter. For Databricks, use dbt-databricks. Other popular adapters include dbt-postgres, dbt-snowflake, dbt-bigquery, and dbt-redshift.
```
# For Databricks
pip install dbt-core dbt-databricks

# For PostgreSQL
pip install dbt-core dbt-postgres

# For Snowflake
pip install dbt-core dbt-snowflake

# For BigQuery
pip install dbt-core dbt-bigquery
```
Step 3
Verify dbt installation
Check that dbt is installed correctly and view the version.
```
dbt --version
```
Step 4
Initialize a new dbt project
Create a new dbt project with a starter directory structure. This creates models/, tests/, macros/, and configuration files.
```
dbt init my_analytics_project
cd my_analytics_project
```

Step 5

Configure database connection (Databricks)

Edit profiles.yml to configure your Databricks connection. This file is typically located at ~/.dbt/profiles.yml. You'll need your Databricks workspace URL, HTTP path, and personal access token.

my_analytics_project:
  target: dev
  outputs:
    dev:
      type: databricks
      host: <your-workspace>.cloud.databricks.com
      http_path: /sql/1.0/warehouses/<warehouse-id>
      token: <your-personal-access-token>
      schema: analytics_dev
      threads: 4
    prod:
      type: databricks
      host: <your-workspace>.cloud.databricks.com
      http_path: /sql/1.0/warehouses/<warehouse-id>
      token: <your-personal-access-token>
      schema: analytics_prod
      threads: 8

⚠ Heads up: Never commit your profiles.yml file with credentials. Use environment variables in CI/CD.

Step 6

Configure database connection (PostgreSQL)

Example configuration for PostgreSQL. Adjust the profiles.yml for your database type.

my_analytics_project:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      user: <your-username>
      password: <your-password>
      dbname: analytics
      schema: analytics_dev
      threads: 4

Step 7
Test your connection
Verify that dbt can connect to your database successfully.
```
dbt debug
```

Step 8

Understand the project structure

A dbt project contains models (SQL transformations), tests, macros (reusable SQL), and configuration files. The dbt_project.yml file configures project-level settings.

my_analytics_project/
├── dbt_project.yml       # Project configuration
├── models/               # SQL transformation files
│   ├── staging/          # Raw data cleaning
│   ├── intermediate/     # Business logic
│   └── marts/            # Final analytics tables
├── tests/                # Custom data tests
├── macros/               # Reusable SQL snippets
├── seeds/                # CSV files to load
├── snapshots/            # SCD Type 2 tracking
└── analyses/             # Ad-hoc queries

Step 9

Create your first model

Models are SELECT statements saved as .sql files. dbt will materialize them as views or tables in your database. Create a simple staging model.

-- models/staging/stg_customers.sql
with source as (
    select * from raw.customers
),

cleaned as (
    select
        id as customer_id,
        lower(email) as email,
        first_name,
        last_name,
        created_at
    from source
    where email is not null
)

select * from cleaned

Step 10

Configure model materialization

Use the config block to specify how dbt should materialize your model. Options include view (default), table, incremental, or ephemeral.

-- models/marts/fct_orders.sql
{{ config(
    materialized='table',
    tags=['daily']
) }}

select
    order_id,
    customer_id,
    order_date,
    total_amount,
    status
from {{ ref('stg_orders') }}
where status != 'cancelled'

Step 11

Run your models

Execute all models in your project. dbt will build them in dependency order based on ref() calls.

# Run all models
dbt run

# Run a specific model
dbt run --select stg_customers

# Run models matching a tag
dbt run --select tag:daily

# Run models in a directory
dbt run --select models/staging

Step 12

Use the ref() function for dependencies

The ref() function references other models and creates a dependency graph. This ensures models run in the correct order.

-- models/marts/customer_orders.sql
select
    c.customer_id,
    c.email,
    count(o.order_id) as total_orders,
    sum(o.total_amount) as lifetime_value
from {{ ref('stg_customers') }} c
left join {{ ref('stg_orders') }} o
    on c.customer_id = o.customer_id
group by 1, 2

Step 13

Add schema tests

Tests validate your data quality. Built-in tests include unique, not_null, accepted_values, and relationships. Define tests in a schema.yml file.

# models/staging/schema.yml
version: 2

models:
  - name: stg_customers
    description: "Cleaned customer records from raw data"
    columns:
      - name: customer_id
        description: "Primary key"
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - not_null
          - unique
      - name: created_at
        tests:
          - not_null

Step 14

Run tests

Execute all configured tests to validate data quality.

# Run all tests
dbt test

# Test a specific model
dbt test --select stg_customers

# Run models and tests together
dbt build

Step 15

Create incremental models

Incremental models only process new or updated records, improving performance for large datasets. Use the is_incremental() macro to add a where clause on subsequent runs.

-- models/marts/fct_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id'
) }}

select
    event_id,
    user_id,
    event_type,
    event_timestamp
from {{ ref('stg_events') }}

{% if is_incremental() %}
    where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}

Step 16
Add documentation
Document your models, columns, and tests using descriptions in schema.yml files. Generate a documentation website with dbt docs.
```
# Generate documentation
dbt docs generate

# Serve documentation locally
dbt docs serve
```

Step 17

Create reusable macros

Macros are Jinja functions that generate SQL. They promote DRY principles and can accept arguments.

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)::decimal(10,2)
{% endmacro %}

-- Usage in a model:
select
    order_id,
    {{ cents_to_dollars('amount_cents') }} as amount_dollars
from {{ ref('stg_orders') }}

Step 18
Load seed data
Seeds are CSV files in the seeds/ directory. Use them for small reference datasets like country codes or status mappings.
```
# Load all CSV files from seeds/
dbt seed

# Reference in models using ref()
select * from {{ ref('country_codes') }}
```

Step 19

Create snapshots for SCD Type 2

Snapshots track historical changes using slowly changing dimension (SCD) Type 2 logic. They add dbt_valid_from and dbt_valid_to timestamps.

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('raw', 'customers') }}

{% endsnapshot %}

Step 20
Run snapshots
Execute snapshots to capture the current state and track changes over time.
```
dbt snapshot
```

Step 21

Use sources for raw data

Define sources in schema.yml to reference raw tables and add freshness checks. Use source() instead of hardcoding table names.

# models/staging/sources.yml
version: 2

sources:
  - name: raw
    database: production
    schema: raw_data
    tables:
      - name: customers
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
        loaded_at_field: _loaded_at
      - name: orders

Step 22
Check source freshness
Verify that your source data is up-to-date based on the freshness rules defined in sources.yml.
```
dbt source freshness
```

Step 23

Set up dbt Cloud (optional)

dbt Cloud provides a hosted IDE, scheduler, and job orchestration. Sign up at getdbt.com and connect your repository.

1. Create account at cloud.getdbt.com
2. Connect your git repository
3. Configure database credentials
4. Schedule jobs for production runs
5. Set up CI checks on pull requests

Step 24

Set up CI/CD (GitHub Actions example)

Automate dbt runs and tests in your CI/CD pipeline. This example runs tests on pull requests.

# .github/workflows/dbt-ci.yml
name: dbt CI

on:
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dbt
        run: pip install dbt-databricks
      - name: Run dbt tests
        env:
          DBT_DATABRICKS_TOKEN: ${{ secrets.DBT_DATABRICKS_TOKEN }}
          DBT_DATABRICKS_HOST: ${{ secrets.DBT_DATABRICKS_HOST }}
        run: |
          dbt deps
          dbt build --target ci

Step 25
Install dbt packages
Extend dbt with community packages for common patterns. Packages are defined in packages.yml and installed with dbt deps.
```
# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: calogica/dbt_expectations
    version: 0.10.0
```
Step 26
Install packages
Download and install dbt packages defined in packages.yml.
```
dbt deps
```

Step 27

Common dbt commands reference

Quick reference of frequently used dbt commands for daily workflows.

dbt run              # Run all models
dbt test             # Run all tests
dbt build            # Run models + tests
dbt compile          # Compile SQL without running
dbt clean            # Delete target/ and dbt_packages/
dbt debug            # Test database connection
dbt deps             # Install packages
dbt seed             # Load CSV seeds
dbt snapshot         # Run snapshots
dbt source freshness # Check source freshness
dbt docs generate    # Generate docs
dbt docs serve       # Serve docs locally

# Selection syntax
dbt run --select model_name
dbt run --select models/staging
dbt run --select tag:daily
dbt run --select +model_name  # model + upstream
dbt run --select model_name+  # model + downstream

dbt: Data Build Tool for Analytics Engineering

Install Python and pip

Install dbt-core and adapter

Verify dbt installation

Initialize a new dbt project

Configure database connection (Databricks)

Configure database connection (PostgreSQL)

Test your connection

Understand the project structure

Create your first model

Configure model materialization

Run your models

Use the ref() function for dependencies

Add schema tests

Run tests

Create incremental models

Add documentation

Create reusable macros

Load seed data

Create snapshots for SCD Type 2

Run snapshots

Use sources for raw data

Check source freshness

Set up dbt Cloud (optional)

Set up CI/CD (GitHub Actions example)

Install dbt packages

Install packages

Common dbt commands reference

Feature requests

Discussion