Modeling ETL pipelines with assets
When modeling an ETL pipeline (or any multi-step process) in Dagster, you can represent it at different levels of granularity. There are three common approaches:
- Combining all phases into a single asset
- Splitting each phase into its own asset with explicit dependencies
- Producing multiple assets from a single operation with the
@dg.multi_assetdecorator
Combining all phases into a single asset
The simplest approach is to combine all phases—extract, load, and transform—into one asset. A single materialization runs one function that performs all phases. This works well for simple pipelines and quick prototypes, or when you don't need to retry individual phases or see each step in the UI. In the example below, customer_etl_single runs extract, load, and transform in one function.
@dg.asset
def customer_etl_single(context: dg.AssetExecutionContext):
"""One asset: extract, load, and transform in a single function."""
data = extract_from_source()
loaded = load_to_staging(data)
transformed = transform_data(loaded)
return transformed
| Benefits | Downsides |
|---|---|
|
|
Splitting each phase into its own asset
When you need visibility into each phase, the ability to retry steps independently, or have downstream assets that depend on intermediate results, you can split extract, load, and transform into separate assets. Each phase becomes its own asset; downstream assets depend on upstream ones through function parameters or deps. Data between phases typically flows through an I/O manager or external storage.
In the example below, orders_extract, orders_load, and orders_transform are three assets. Dagster infers the dependency chain from the function parameters: orders_load receives the output of orders_extract, and orders_transform receives the output of orders_load. Each phase appears in the asset graph and can be materialized or retried on its own.
@dg.asset
def orders_extract(context: dg.AssetExecutionContext):
"""First asset: extract only."""
return extract_from_source()
@dg.asset
def orders_load(context: dg.AssetExecutionContext, orders_extract):
"""Second asset: load from extract output (dependency inferred)."""
return load_to_staging(orders_extract)
@dg.asset
def orders_transform(context: dg.AssetExecutionContext, orders_load):
"""Third asset: transform from load output."""
return transform_data(orders_load)
| Benefits | Downsides |
|---|---|
|
|
Producing multiple assets from a single operation
When operations are tightly coupled, but you still want separate assets for downstream consumers—and you want to pass data between phases in memory without writing to storage in between—use the @dg.multi_asset decorator. A single function produces multiple assets in one run; all outputs are produced atomically (all succeed or all fail).
In the example below, inventory_etl_multi_asset returns three values that become the assets inventory_raw, inventory_staged, and inventory_final. Data flows from extract to load to transform in memory, so there is no intermediate I/O. Downstream assets can depend on any of the three outputs. With can_subset=False, all three are always materialized together.
@dg.multi_asset(
outs={
"raw": dg.AssetOut(key="inventory_raw"),
"staged": dg.AssetOut(key="inventory_staged"),
"final": dg.AssetOut(key="inventory_final"),
},
can_subset=False, # All outputs produced together
)
def inventory_etl_multi_asset(context: dg.AssetExecutionContext):
"""One function produces three assets; data passed in memory."""
raw_data = extract_from_source()
staged_data = load_to_staging(raw_data)
final_data = transform_data(staged_data)
return raw_data, staged_data, final_data
| Benefits | Downsides |
|---|---|
|
|
Comparing the approaches
| Approach | Assets created | Execution model | Independent retry | Selective execution | UI visibility | I/O between phases | Downstream dependencies | Complexity | Best for |
|---|---|---|---|---|---|---|---|---|---|
| Single asset | One | Single function | No | No | One asset | In-memory | One target | Low | Prototypes |
| Multiple separate assets | One per phase | Multiple functions with deps | Yes | Yes | Multiple assets with lineage | External storage | Multiple targets | Medium | Production, debugging |
| Multi-asset | Multiple | Single function, multiple outputs | No | No (unless subsettable) | Multiple assets (one operation) | In-memory | Multiple targets | Medium | Coupled ops, many consumers |
Whichever approach you use, you can schedule materializations or run assets on demand with declarative automation. To target which assets run in a job or schedule, you can use asset selection. For example, you can target assets by group.
You can use the following flowchart to decide which approach fits your pipeline:
Next steps
- Defining assets — decorators and basic patterns
- Passing data between assets — I/O managers and explicit storage
- Graph-backed assets — multiple ops behind a single or multi-asset
- Asset selection syntax — select assets by group, key, and more for jobs and schedules