Modeling ETL pipelines with assets

When modeling an ETL pipeline (or any multi-step process) in Dagster, you can represent it at different levels of granularity. There are three common approaches:

Combining all phases into a single asset
Splitting each phase into its own asset with explicit dependencies
Producing multiple assets from a single operation with the @dg.multi_asset decorator

Combining all phases into a single asset

The simplest approach is to combine all phases—extract, load, and transform—into one asset. A single materialization runs one function that performs all phases. This works well for simple pipelines and quick prototypes, or when you don't need to retry individual phases or see each step in the UI. In the example below, customer_etl_single runs extract, load, and transform in one function.

src/<project_name>/assets.py
@dg.asset
def customer_etl_single(context: dg.AssetExecutionContext):
    """One asset: extract, load, and transform in a single function."""
    data = extract_from_source()
    loaded = load_to_staging(data)
    transformed = transform_data(loaded)
    return transformed

Benefits	Downsides
Simple to understand and reason about One materialization = one execution Minimal orchestration overhead Good for operations that always run together	No visibility into individual phases in the UI Cannot independently retry extract vs. transform Cannot selectively run only one step If extract succeeds but transform fails, you must re-run the entire pipeline

Splitting each phase into its own asset

When you need visibility into each phase, the ability to retry steps independently, or have downstream assets that depend on intermediate results, you can split extract, load, and transform into separate assets. Each phase becomes its own asset; downstream assets depend on upstream ones through function parameters or deps. Data between phases typically flows through an I/O manager or external storage.

In the example below, orders_extract, orders_load, and orders_transform are three assets. Dagster infers the dependency chain from the function parameters: orders_load receives the output of orders_extract, and orders_transform receives the output of orders_load. Each phase appears in the asset graph and can be materialized or retried on its own.

src/<project_name>/assets.py
@dg.asset
def orders_extract(context: dg.AssetExecutionContext):
    """First asset: extract only."""
    return extract_from_source()


@dg.asset
def orders_load(context: dg.AssetExecutionContext, orders_extract):
    """Second asset: load from extract output (dependency inferred)."""
    return load_to_staging(orders_extract)


@dg.asset
def orders_transform(context: dg.AssetExecutionContext, orders_load):
    """Third asset: transform from load output."""
    return transform_data(orders_load)

Benefits	Downsides
Full visibility into each phase in the UI Can independently retry failed phases Can selectively run only certain phases (e.g., just transform) Intermediate results are visible in the Dagster UI Better for debugging and monitoring Can set different retry policies per phase	More assets to manage Explicit dependency management Data between phases typically goes through storage (I/O manager or external storage), which may add I/O overhead More complex job and dependency definitions

Producing multiple assets from a single operation

When operations are tightly coupled, but you still want separate assets for downstream consumers—and you want to pass data between phases in memory without writing to storage in between—use the @dg.multi_asset decorator. A single function produces multiple assets in one run; all outputs are produced atomically (all succeed or all fail).

In the example below, inventory_etl_multi_asset returns three values that become the assets inventory_raw, inventory_staged, and inventory_final. Data flows from extract to load to transform in memory, so there is no intermediate I/O. Downstream assets can depend on any of the three outputs. With can_subset=False, all three are always materialized together.

src/<project_name>/assets.py
@dg.multi_asset(
    outs={
        "raw": dg.AssetOut(key="inventory_raw"),
        "staged": dg.AssetOut(key="inventory_staged"),
        "final": dg.AssetOut(key="inventory_final"),
    },
    can_subset=False,  # All outputs produced together
)
def inventory_etl_multi_asset(context: dg.AssetExecutionContext):
    """One function produces three assets; data passed in memory."""
    raw_data = extract_from_source()
    staged_data = load_to_staging(raw_data)
    final_data = transform_data(staged_data)
    return raw_data, staged_data, final_data

Benefits	Downsides
Atomic execution: all outputs succeed or all fail Separate assets for downstream consumers to depend on Can pass data between outputs in memory, without intermediate I/O Lower I/O overhead than multiple separate assets All assets visible separately in the Dagster UI Best balance of simplicity and observability when you need both atomic execution and separate assets for consumers	Cannot independently retry individual outputs All outputs must succeed or all fail Slightly more complex than a single asset Cannot selectively materialize just one output (unless you use subsetting; see Graph-backed assets)

Comparing the approaches

Approach	Assets created	Execution model	Independent retry	Selective execution	UI visibility	I/O between phases	Downstream dependencies	Complexity	Best for
Single asset	One	Single function	No	No	One asset	In-memory	One target	Low	Prototypes
Multiple separate assets	One per phase	Multiple functions with deps	Yes	Yes	Multiple assets with lineage	External storage	Multiple targets	Medium	Production, debugging
Multi-asset	Multiple	Single function, multiple outputs	No	No (unless subsettable)	Multiple assets (one operation)	In-memory	Multiple targets	Medium	Coupled ops, many consumers

tip

Whichever approach you use, you can schedule materializations or run assets on demand with declarative automation. To target which assets run in a job or schedule, you can use asset selection. For example, you can target assets by group.

You can use the following flowchart to decide which approach fits your pipeline:

Next steps

Defining assets — decorators and basic patterns
Passing data between assets — I/O managers and explicit storage
Graph-backed assets — multiple ops behind a single or multi-asset
Asset selection syntax — select assets by group, key, and more for jobs and schedules

Combining all phases into a single asset​

Splitting each phase into its own asset​

Producing multiple assets from a single operation​

Comparing the approaches​

Next steps​

Combining all phases into a single asset

Splitting each phase into its own asset

Producing multiple assets from a single operation

Comparing the approaches

Next steps