Skip to main content

Modeling ETL pipelines with assets

When modeling an ETL pipeline (or any multi-step process) in Dagster, you can represent it at different levels of granularity. There are three common approaches:

Combining all phases into a single asset

The simplest approach is to combine all phases—extract, load, and transform—into one asset. A single materialization runs one function that performs all phases. This works well for simple pipelines and quick prototypes, or when you don't need to retry individual phases or see each step in the UI. In the example below, customer_etl_single runs extract, load, and transform in one function.

src/<project_name>/assets.py
@dg.asset
def customer_etl_single(context: dg.AssetExecutionContext):
"""One asset: extract, load, and transform in a single function."""
data = extract_from_source()
loaded = load_to_staging(data)
transformed = transform_data(loaded)
return transformed

BenefitsDownsides
  • Simple to understand and reason about
  • One materialization = one execution
  • Minimal orchestration overhead
  • Good for operations that always run together
  • No visibility into individual phases in the UI
  • Cannot independently retry extract vs. transform
  • Cannot selectively run only one step
  • If extract succeeds but transform fails, you must re-run the entire pipeline

Splitting each phase into its own asset

When you need visibility into each phase, the ability to retry steps independently, or have downstream assets that depend on intermediate results, you can split extract, load, and transform into separate assets. Each phase becomes its own asset; downstream assets depend on upstream ones through function parameters or deps. Data between phases typically flows through an I/O manager or external storage.

In the example below, orders_extract, orders_load, and orders_transform are three assets. Dagster infers the dependency chain from the function parameters: orders_load receives the output of orders_extract, and orders_transform receives the output of orders_load. Each phase appears in the asset graph and can be materialized or retried on its own.

src/<project_name>/assets.py
@dg.asset
def orders_extract(context: dg.AssetExecutionContext):
"""First asset: extract only."""
return extract_from_source()


@dg.asset
def orders_load(context: dg.AssetExecutionContext, orders_extract):
"""Second asset: load from extract output (dependency inferred)."""
return load_to_staging(orders_extract)


@dg.asset
def orders_transform(context: dg.AssetExecutionContext, orders_load):
"""Third asset: transform from load output."""
return transform_data(orders_load)

BenefitsDownsides
  • Full visibility into each phase in the UI
  • Can independently retry failed phases
  • Can selectively run only certain phases (e.g., just transform)
  • Intermediate results are visible in the Dagster UI
  • Better for debugging and monitoring
  • Can set different retry policies per phase
  • More assets to manage
  • Explicit dependency management
  • Data between phases typically goes through storage (I/O manager or external storage), which may add I/O overhead
  • More complex job and dependency definitions

Producing multiple assets from a single operation

When operations are tightly coupled, but you still want separate assets for downstream consumers—and you want to pass data between phases in memory without writing to storage in between—use the @dg.multi_asset decorator. A single function produces multiple assets in one run; all outputs are produced atomically (all succeed or all fail).

In the example below, inventory_etl_multi_asset returns three values that become the assets inventory_raw, inventory_staged, and inventory_final. Data flows from extract to load to transform in memory, so there is no intermediate I/O. Downstream assets can depend on any of the three outputs. With can_subset=False, all three are always materialized together.

src/<project_name>/assets.py
@dg.multi_asset(
outs={
"raw": dg.AssetOut(key="inventory_raw"),
"staged": dg.AssetOut(key="inventory_staged"),
"final": dg.AssetOut(key="inventory_final"),
},
can_subset=False, # All outputs produced together
)
def inventory_etl_multi_asset(context: dg.AssetExecutionContext):
"""One function produces three assets; data passed in memory."""
raw_data = extract_from_source()
staged_data = load_to_staging(raw_data)
final_data = transform_data(staged_data)
return raw_data, staged_data, final_data

BenefitsDownsides
  • Atomic execution: all outputs succeed or all fail
  • Separate assets for downstream consumers to depend on
  • Can pass data between outputs in memory, without intermediate I/O
  • Lower I/O overhead than multiple separate assets
  • All assets visible separately in the Dagster UI
  • Best balance of simplicity and observability when you need both atomic execution and separate assets for consumers
  • Cannot independently retry individual outputs
  • All outputs must succeed or all fail
  • Slightly more complex than a single asset
  • Cannot selectively materialize just one output (unless you use subsetting; see Graph-backed assets)

Comparing the approaches

ApproachAssets createdExecution modelIndependent retrySelective executionUI visibilityI/O between phasesDownstream dependenciesComplexityBest for
Single assetOneSingle functionNoNoOne assetIn-memoryOne targetLowPrototypes
Multiple separate assetsOne per phaseMultiple functions with depsYesYesMultiple assets with lineageExternal storageMultiple targetsMediumProduction, debugging
Multi-assetMultipleSingle function, multiple outputsNoNo (unless subsettable)Multiple assets (one operation)In-memoryMultiple targetsMediumCoupled ops, many consumers
tip

Whichever approach you use, you can schedule materializations or run assets on demand with declarative automation. To target which assets run in a job or schedule, you can use asset selection. For example, you can target assets by group.

You can use the following flowchart to decide which approach fits your pipeline:

Next steps