Understanding how software-defined assets relate to ops and graphs#

Dagster’s main abstraction for building data pipelines is the software-defined asset. However, Dagster also has abstractions called ops and graphs.

If you're not sure which one to use, this guide is for you. In this guide, we'll cover:

  • When you should use assets or ops and graphs
  • How assets, ops, and graphs relate to each other

When should I use assets or ops and graphs?#

Dagster is mainly used to build data pipelines, and most data pipelines can be expressed in Dagster as sets of software-defined assets. If you’re a new Dagster user and your goal is to build a data pipeline, we recommend starting with software-defined assets and not worrying about ops or graphs. This is because most of the code you’ll be writing will directly relate to producing data assets.

However, there are some situations where you want to run code without thinking about data assets that the code is producing. In these cases, it’s appropriate to use ops and graphs. For example:

Situation 1: You’re not building a data pipeline#

You want to schedule a workflow where the goal is not to keep a set of data assets up-to-date. It might do something like:

  • Send emails to a set of users
  • Scan a data warehouse for tables that haven't been used in months and delete them
  • Record metadata about a set of data assets

In these cases, you should define your workflow in terms ops and graphs, not software-defined assets. The Intro to ops and jobs guide is a good place to start learning how to do this.

Additionally, note that a single Dagster deployment can contain software-defined assets and op/graph-based jobs side-by-side, which means that you’re not bound to one particular choice. If your workflow reads from software-defined assets, you can model that explicitly in Dagster, which is discussed in a a later section.

Situation 2: You want to break an asset into multiple steps#

If you're in a situation like the following:

  • An asset requires multiple steps
  • Some of the steps don't produce assets of their own
  • You need to be able to re-execute individual steps

In this case, you might want to use a graph-backed asset. This is discussed more later in this guide.

Situation 3: You’re anchored in task-based workflows#

Task-based workflows have been a popular way of defining data pipelines for a long time. While we believe that software-defined assets provide a superior way of writing and operating data pipelines, we acknowledge that teams often have existing codebases or mindsets that are heavily anchored in task-based workflows.

Op-based graphs resemble task-based workflows very closely, so they’re a natural choice for data pipelines that want to stick to that paradigm, either permanently or temporarily, while migrating to software-defined assets.


How do assets relate to ops and graphs?#

Next, we'll discuss how assets relate to ops and graphs. By the end of this section, you should understand how each type of software-defined asset relates to ops and graphs.

Software-defined assets#

A software-defined asset is a description of how to compute the contents of a particular data asset.

Under the hood, every software-defined asset contains an op (or graph of ops), which is the function that’s invoked to compute its contents. In most cases, the underlying op is invisible to the user.

Op and software-defined asset

Multi-assets#

When you use the @multi_asset decorator, you’re defining a single op that produces multiple assets:

Multi- software-defined asset

Graph-backed assets#

Dagster supports composing a set of ops into an op graph, usually by using the @graph decorator. A software-defined asset can be backed by an op graph, instead of an op.

Op graph and graph-backed asset

Graph-backed assets are useful when you want to execute multiple separate steps to compute an asset and some of those steps don’t produce assets of their own.

For example, to compute the contents of a table, you need to fetch data from an API and then perform a heavy data transformation on it. You don’t care about writing the fetched, pre-transformed data to any known location, but you want the fetching and transforming to happen in two separate steps that can run in different processes. If there’s a failure, you’d like to be able to re-execute the transformation step without re-executing the fetching step.

Refer to the Graph-backed asset documentation for code examples and details on how to define and use graph-backed assets.

Op graphs that read from an asset#

In some cases, you might want to build a job that doesn't produce any assets, but does read from at least one asset. Dagster facilitates this by allowing you to designate assets as inputs to ops within a graph or graph-based job:

Op graph with source asset

For example, you have a table that represents a list of emails that you want to send. A job reads data from the table and uses it to send the emails:

from dagster import asset, job, op


@asset
def emails_to_send():
    ...


@op
def send_emails(emails) -> None:
    ...


@job
def send_emails_job():
    send_emails(emails_to_send.to_source_asset())

In this case, the asset - specifically, the table the job reads from - is only used as a data source for the job. It’s not materialized when the graph is run.

The Graph documentation contains more details on how this works.