Skip to main content

dagster-clickhouse integration reference

This page covers ClickHouse integration details beyond the Using ClickHouse with Dagster guide.

ClickHouse resource

The ClickhouseResource exposes a clickhouse_driver.Client from get_connection(). Use it to run DDL/DML and queries with full control over SQL.

Executing custom SQL

from dagster_clickhouse import ClickhouseResource

from dagster import asset

# This example executes a query against the `iris.iris_dataset` table created in the
# Using ClickHouse with Dagster guide.
iris_dataset = asset(name="iris_dataset")(lambda: None)


@asset(deps=[iris_dataset])
def small_petals(clickhouse: ClickhouseResource) -> None:
with clickhouse.get_connection() as client:
client.execute("DROP TABLE IF EXISTS iris.small_petals")
client.execute(
"CREATE TABLE iris.small_petals ENGINE = MergeTree() ORDER BY tuple() AS"
" SELECT * FROM iris.iris_dataset WHERE petal_length_cm < 2 AND petal_width_cm < 0.5"
)

This builds on the iris.iris_dataset table created in the resource tutorial.

ClickHouse I/O managers

The Pandas and Polars I/O managers create MergeTree tables when needed and use ALTER TABLE … DELETE (with mutations_sync = 1) or TRUNCATE when replacing partitions or full tables, consistent with ClickhouseDbClient.

Selecting specific columns in a downstream asset

You can restrict loaded columns by attaching columns metadata on the input asset, similar to other database I/O managers:

import pandas as pd

from dagster import AssetIn, asset


@asset(
ins={
"iris_sepal": AssetIn(
key="iris_dataset",
metadata={"columns": ["sepal_length_cm", "sepal_width_cm"]},
)
}
)
def sepal_data(iris_sepal: pd.DataFrame) -> pd.DataFrame:
iris_sepal["sepal_area_cm2"] = (
iris_sepal["sepal_length_cm"] * iris_sepal["sepal_width_cm"]
)
return iris_sepal

Dagster schema vs connection database

  • schema on the I/O manager / asset metadata: the ClickHouse database that contains the table (for example `analytics`.`revenue_daily`).
  • database on the I/O manager / resource: the default database on the client session (ClickHouse USE semantics). Table locations for assets still use the Dagster schema / asset metadata path.

Legacy @io_manager factories

The packages expose clickhouse_pandas_io_manager and clickhouse_polars_io_manager built with build_clickhouse_io_manager for setups that prefer the older factory style.