TechSetupGuides
Intermediatedbtdatabricksanalyticssqldata-engineeringetldata-transformation

dbt: Data Build Tool for Analytics Engineering

Transform raw data into analytics-ready datasets using dbt's modular SQL-based workflow. Build, test, and document your data transformations with version control and CI/CD.

  1. Step 1

    Install Python and pip

    dbt requires Python 3.8 or higher. Verify your installation before proceeding.

    python --version
    pip --version
  2. Step 2

    Install dbt-core and adapter

    Install dbt-core along with the appropriate database adapter. For Databricks, use dbt-databricks. Other popular adapters include dbt-postgres, dbt-snowflake, dbt-bigquery, and dbt-redshift.

    # For Databricks
    pip install dbt-core dbt-databricks
    
    # For PostgreSQL
    pip install dbt-core dbt-postgres
    
    # For Snowflake
    pip install dbt-core dbt-snowflake
    
    # For BigQuery
    pip install dbt-core dbt-bigquery
  3. Step 3

    Verify dbt installation

    Check that dbt is installed correctly and view the version.

    dbt --version
  4. Step 4

    Initialize a new dbt project

    Create a new dbt project with a starter directory structure. This creates models/, tests/, macros/, and configuration files.

    dbt init my_analytics_project
    cd my_analytics_project
  5. Step 5

    Configure database connection (Databricks)

    Edit profiles.yml to configure your Databricks connection. This file is typically located at ~/.dbt/profiles.yml. You'll need your Databricks workspace URL, HTTP path, and personal access token.

    my_analytics_project:
      target: dev
      outputs:
        dev:
          type: databricks
          host: <your-workspace>.cloud.databricks.com
          http_path: /sql/1.0/warehouses/<warehouse-id>
          token: <your-personal-access-token>
          schema: analytics_dev
          threads: 4
        prod:
          type: databricks
          host: <your-workspace>.cloud.databricks.com
          http_path: /sql/1.0/warehouses/<warehouse-id>
          token: <your-personal-access-token>
          schema: analytics_prod
          threads: 8
    ⚠ Heads up: Never commit your profiles.yml file with credentials. Use environment variables in CI/CD.
  6. Step 6

    Configure database connection (PostgreSQL)

    Example configuration for PostgreSQL. Adjust the profiles.yml for your database type.

    my_analytics_project:
      target: dev
      outputs:
        dev:
          type: postgres
          host: localhost
          port: 5432
          user: <your-username>
          password: <your-password>
          dbname: analytics
          schema: analytics_dev
          threads: 4
  7. Step 7

    Test your connection

    Verify that dbt can connect to your database successfully.

    dbt debug
  8. Step 8

    Understand the project structure

    A dbt project contains models (SQL transformations), tests, macros (reusable SQL), and configuration files. The dbt_project.yml file configures project-level settings.

    my_analytics_project/
    ├── dbt_project.yml       # Project configuration
    ├── models/               # SQL transformation files
    │   ├── staging/          # Raw data cleaning
    │   ├── intermediate/     # Business logic
    │   └── marts/            # Final analytics tables
    ├── tests/                # Custom data tests
    ├── macros/               # Reusable SQL snippets
    ├── seeds/                # CSV files to load
    ├── snapshots/            # SCD Type 2 tracking
    └── analyses/             # Ad-hoc queries
  9. Step 9

    Create your first model

    Models are SELECT statements saved as .sql files. dbt will materialize them as views or tables in your database. Create a simple staging model.

    -- models/staging/stg_customers.sql
    with source as (
        select * from raw.customers
    ),
    
    cleaned as (
        select
            id as customer_id,
            lower(email) as email,
            first_name,
            last_name,
            created_at
        from source
        where email is not null
    )
    
    select * from cleaned
  10. Step 10

    Configure model materialization

    Use the config block to specify how dbt should materialize your model. Options include view (default), table, incremental, or ephemeral.

    -- models/marts/fct_orders.sql
    {{ config(
        materialized='table',
        tags=['daily']
    ) }}
    
    select
        order_id,
        customer_id,
        order_date,
        total_amount,
        status
    from {{ ref('stg_orders') }}
    where status != 'cancelled'
  11. Step 11

    Run your models

    Execute all models in your project. dbt will build them in dependency order based on ref() calls.

    # Run all models
    dbt run
    
    # Run a specific model
    dbt run --select stg_customers
    
    # Run models matching a tag
    dbt run --select tag:daily
    
    # Run models in a directory
    dbt run --select models/staging
  12. Step 12

    Use the ref() function for dependencies

    The ref() function references other models and creates a dependency graph. This ensures models run in the correct order.

    -- models/marts/customer_orders.sql
    select
        c.customer_id,
        c.email,
        count(o.order_id) as total_orders,
        sum(o.total_amount) as lifetime_value
    from {{ ref('stg_customers') }} c
    left join {{ ref('stg_orders') }} o
        on c.customer_id = o.customer_id
    group by 1, 2
  13. Step 13

    Add schema tests

    Tests validate your data quality. Built-in tests include unique, not_null, accepted_values, and relationships. Define tests in a schema.yml file.

    # models/staging/schema.yml
    version: 2
    
    models:
      - name: stg_customers
        description: "Cleaned customer records from raw data"
        columns:
          - name: customer_id
            description: "Primary key"
            tests:
              - unique
              - not_null
          - name: email
            tests:
              - not_null
              - unique
          - name: created_at
            tests:
              - not_null
  14. Step 14

    Run tests

    Execute all configured tests to validate data quality.

    # Run all tests
    dbt test
    
    # Test a specific model
    dbt test --select stg_customers
    
    # Run models and tests together
    dbt build
  15. Step 15

    Create incremental models

    Incremental models only process new or updated records, improving performance for large datasets. Use the is_incremental() macro to add a where clause on subsequent runs.

    -- models/marts/fct_events.sql
    {{ config(
        materialized='incremental',
        unique_key='event_id'
    ) }}
    
    select
        event_id,
        user_id,
        event_type,
        event_timestamp
    from {{ ref('stg_events') }}
    
    {% if is_incremental() %}
        where event_timestamp > (select max(event_timestamp) from {{ this }})
    {% endif %}
  16. Step 16

    Add documentation

    Document your models, columns, and tests using descriptions in schema.yml files. Generate a documentation website with dbt docs.

    # Generate documentation
    dbt docs generate
    
    # Serve documentation locally
    dbt docs serve
  17. Step 17

    Create reusable macros

    Macros are Jinja functions that generate SQL. They promote DRY principles and can accept arguments.

    -- macros/cents_to_dollars.sql
    {% macro cents_to_dollars(column_name) %}
        ({{ column_name }} / 100.0)::decimal(10,2)
    {% endmacro %}
    
    -- Usage in a model:
    select
        order_id,
        {{ cents_to_dollars('amount_cents') }} as amount_dollars
    from {{ ref('stg_orders') }}
  18. Step 18

    Load seed data

    Seeds are CSV files in the seeds/ directory. Use them for small reference datasets like country codes or status mappings.

    # Load all CSV files from seeds/
    dbt seed
    
    # Reference in models using ref()
    select * from {{ ref('country_codes') }}
  19. Step 19

    Create snapshots for SCD Type 2

    Snapshots track historical changes using slowly changing dimension (SCD) Type 2 logic. They add dbt_valid_from and dbt_valid_to timestamps.

    -- snapshots/customers_snapshot.sql
    {% snapshot customers_snapshot %}
    
    {{
        config(
          target_schema='snapshots',
          unique_key='customer_id',
          strategy='timestamp',
          updated_at='updated_at'
        )
    }}
    
    select * from {{ source('raw', 'customers') }}
    
    {% endsnapshot %}
  20. Step 20

    Run snapshots

    Execute snapshots to capture the current state and track changes over time.

    dbt snapshot
  21. Step 21

    Use sources for raw data

    Define sources in schema.yml to reference raw tables and add freshness checks. Use source() instead of hardcoding table names.

    # models/staging/sources.yml
    version: 2
    
    sources:
      - name: raw
        database: production
        schema: raw_data
        tables:
          - name: customers
            freshness:
              warn_after: {count: 12, period: hour}
              error_after: {count: 24, period: hour}
            loaded_at_field: _loaded_at
          - name: orders
  22. Step 22

    Check source freshness

    Verify that your source data is up-to-date based on the freshness rules defined in sources.yml.

    dbt source freshness
  23. Step 23

    Set up dbt Cloud (optional)

    dbt Cloud provides a hosted IDE, scheduler, and job orchestration. Sign up at getdbt.com and connect your repository.

    1. Create account at cloud.getdbt.com
    2. Connect your git repository
    3. Configure database credentials
    4. Schedule jobs for production runs
    5. Set up CI checks on pull requests
  24. Step 24

    Set up CI/CD (GitHub Actions example)

    Automate dbt runs and tests in your CI/CD pipeline. This example runs tests on pull requests.

    # .github/workflows/dbt-ci.yml
    name: dbt CI
    
    on:
      pull_request:
        branches: [main]
    
    jobs:
      test:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v3
          - uses: actions/setup-python@v4
            with:
              python-version: '3.11'
          - name: Install dbt
            run: pip install dbt-databricks
          - name: Run dbt tests
            env:
              DBT_DATABRICKS_TOKEN: ${{ secrets.DBT_DATABRICKS_TOKEN }}
              DBT_DATABRICKS_HOST: ${{ secrets.DBT_DATABRICKS_HOST }}
            run: |
              dbt deps
              dbt build --target ci
  25. Step 25

    Install dbt packages

    Extend dbt with community packages for common patterns. Packages are defined in packages.yml and installed with dbt deps.

    # packages.yml
    packages:
      - package: dbt-labs/dbt_utils
        version: 1.1.1
      - package: calogica/dbt_expectations
        version: 0.10.0
  26. Step 26

    Install packages

    Download and install dbt packages defined in packages.yml.

    dbt deps
  27. Step 27

    Common dbt commands reference

    Quick reference of frequently used dbt commands for daily workflows.

    dbt run              # Run all models
    dbt test             # Run all tests
    dbt build            # Run models + tests
    dbt compile          # Compile SQL without running
    dbt clean            # Delete target/ and dbt_packages/
    dbt debug            # Test database connection
    dbt deps             # Install packages
    dbt seed             # Load CSV seeds
    dbt snapshot         # Run snapshots
    dbt source freshness # Check source freshness
    dbt docs generate    # Generate docs
    dbt docs serve       # Serve docs locally
    
    # Selection syntax
    dbt run --select model_name
    dbt run --select models/staging
    dbt run --select tag:daily
    dbt run --select +model_name  # model + upstream
    dbt run --select model_name+  # model + downstream

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.