Automating dbt Documentation to Notion

dbt generates documentation automatically from your data models. Problem is, nobody reads it because it lives in JSON files buried in your project directory.

Solution: Push it to Notion where people actually look at documentation. Here's how I automated the entire pipeline using R and the Notion API.

The Problem

dbt creates manifest.json and catalog.json files with complete documentation -- model descriptions, column metadata, SQL code, dependencies. Everything you need. But JSON files don't make for accessible documentation.

Teams want documentation in Notion where they already work. Manually copying hundreds of tables and columns isn't happening. So I built an ETL pipeline to do it automatically.

Setup: Notion Integration

First, create a Notion integration at notion.so/profile/integrations. Set it as internal, give it read/write permissions for content. Copy the integration token -- you'll need it.

Create a Notion page where the documentation will live. Connect your integration to that page through the page settings. Notion generates a database ID for this page -- grab that too.

Store both in environment variables:

SECRET_NOTION=your_integration_token
DATABASE_PARENT_ID=your_page_id

Loading dbt Documentation

dbt outputs two files after running dbt docs generate:

manifest <- jsonlite::read_json("../target/manifest.json")
catalog <- jsonlite::read_json("../target/catalog.json")

Extract only models with descriptions:

manifest_nodes <- manifest$nodes %>% 
  keep(~.x$description != '')

models <- manifest_nodes %>%
  keep(~ .$resource_type == "model") %>%
  enframe(name = "node_name", value = "data")

This filters out tests, seeds, and anything without documentation. No point uploading empty records.

Notion API Wrapper

The Notion API requires authenticated requests with specific headers. Wrapped it in a helper function:

make_request <- function(endpoint, method = "GET", body = NULL) {
  Sys.sleep(0.34) # Rate limit: 3 requests/second
  
  url <- glue("https://api.notion.com/v1/{endpoint}")
  
  headers <- c(
    "Authorization" = NOTION_TOKEN,
    "Content-Type" = "application/json",
    "Notion-Version" = "2022-02-22"
  )
  
  response <- VERB(
    verb = method,
    url = url,
    body = if (!is.null(body)) toJSON(body, auto_unbox = TRUE),
    add_headers(headers)
  )
  
  content(response, "parsed")
}

The Sys.sleep(0.34) is critical -- Notion limits to 3 requests per second. Without the delay, the script fails halfway through.

Creating Notion Records

Each dbt model becomes a Notion page with:

- Title (model name)
- Description
- Table of columns with data types
- Full SQL code block

Building the JSON structure for Notion's API was the annoying part. Properties need specific formatting:

make_properties_json <- function(data) {
  glue('
  "properties": {{
      "Name": {{
        "title": [{{ "text": {{ "content": "{data$name}" }} }}]
      }},
      "Description": {{
        "rich_text": [{{ "text": {{ "content": "{data$description}" }} }}]
      }} 
    }}
  ')
}

Columns get formatted as a table:

make_table_children <- function(data) {
  df_columns <- data$columns %>% 
    map_df(~tibble(
      name = .x$name,
      data_type = .x$data_type,
      description = .x$description
    ))
  
  df_columns %>% 
    pmap(~make_3_cell_row(..1, ..2, ..3)) %>%
    paste(collapse = ",")
}

SQL code blocks need special handling for line breaks and escaping:

make_dbt_code_block <- function(data) {
  code_str <- data$raw_code %>% 
    str_replace_all("\n", "\\\\n") %>%
    str_replace_all("'", "\\'")
  
  glue('{{ "type": "code", "code": {{ "rich_text": [{{ 
    "text": {{ "content": "{code_str}" }} 
  }}], "language": "sql" }} }}')
}

Avoiding Duplicates

Before creating a record, check if it already exists:

check_record_exists <- function(database_id, data) {
  query_json <- glue('{{
    "filter": {{
      "property": "Name",
      "title": {{ "equals": "{data$name}" }}
    }}
  }}')
  
  record_query_resp <- make_request(
    endpoint = paste0("databases/", database_id, "/query"),
    method = "POST",
    body = query_json
  )
  
  length(record_query_resp$results) > 0
}

Only create the record if it doesn't exist. Updates would require more logic -- for now, this prevents duplicates on reruns.

Running the Pipeline

for (i in 1:nrow(models)) {
  model_name <- models$node_name[i]
  data <- models$data[[i]]
  
  if (!check_record_exists(DATABASE_ID, data)) {
    create_dbt_record(DATABASE_ID, data)
    cli_alert_success("Created: {data$name}")
  } else {
    cli_alert_info("Exists: {data$name}")
  }
}

Run this after dbt docs generate and the entire data catalog syncs to Notion automatically.

What Could Be Better

This works but has limitations:

- No update logic -- records are create-only
- No deletion of removed models
- Doesn't use dbt tags or meta fields
- Error handling is minimal

For a quick proof of concept, it does the job. Full production version would need proper state management and incremental updates.

Why This Matters

Documentation that nobody reads is worthless. Putting dbt docs where people actually work -- Notion, Confluence, wherever -- increases the chance someone uses it.

Automating this means the documentation stays current without manual updates. Run it as part of your CI/CD pipeline after dbt builds and you have a self-updating data catalog.

Code is available on request. Modify it for your setup.