Dagster’s main abstraction for building data pipelines is the software-defined asset. However, Dagster also has abstractions called ops and graphs.
If you're not sure which one to use, this guide is for you. In this guide, we'll cover:
Dagster is mainly used to build data pipelines, and most data pipelines can be expressed in Dagster as sets of software-defined assets. If you’re a new Dagster user and your goal is to build a data pipeline, we recommend starting with software-defined assets and not worrying about ops or graphs. This is because most of the code you’ll be writing will directly relate to producing data assets.
You want to schedule a workflow where the goal is not to keep a set of data assets up-to-date. It might do something like:
In these cases, you should define your workflow in terms ops and graphs, not software-defined assets. The Intro to ops and jobs guide is a good place to start learning how to do this.
Additionally, note that a single Dagster deployment can contain software-defined assets and op/graph-based jobs side-by-side, which means that you’re not bound to one particular choice. If your workflow reads from software-defined assets, you can model that explicitly in Dagster, which is discussed in a a later section.
If you're in a situation like the following:
Task-based workflows have been a popular way of defining data pipelines for a long time. While we believe that software-defined assets provide a superior way of writing and operating data pipelines, we acknowledge that teams often have existing codebases or mindsets that are heavily anchored in task-based workflows.
Op-based graphs resemble task-based workflows very closely, so they’re a natural choice for data pipelines that want to stick to that paradigm, either permanently or temporarily, while migrating to software-defined assets.
Next, we'll discuss how assets relate to ops and graphs. By the end of this section, you should understand how each type of software-defined asset relates to ops and graphs.
A software-defined asset is a description of how to compute the contents of a particular data asset.
Under the hood, every software-defined asset contains an op (or graph of ops), which is the function that’s invoked to compute its contents. In most cases, the underlying op is invisible to the user.
When you use the
@multi_asset decorator, you’re defining a single op that produces multiple assets:
Dagster supports composing a set of ops into an op graph, usually by using the
@graph decorator. A software-defined asset can be backed by an op graph, instead of an op.
Graph-backed assets are useful when you want to execute multiple separate steps to compute an asset and some of those steps don’t produce assets of their own.
For example, to compute the contents of a table, you need to fetch data from an API and then perform a heavy data transformation on it. You don’t care about writing the fetched, pre-transformed data to any known location, but you want the fetching and transforming to happen in two separate steps that can run in different processes. If there’s a failure, you’d like to be able to re-execute the transformation step without re-executing the fetching step.
Refer to the Graph-backed asset documentation for code examples and details on how to define and use graph-backed assets.
In some cases, you might want to build a job that doesn't produce any assets, but does read from at least one asset. Dagster facilitates this by allowing you to designate assets as inputs to ops within a graph or graph-based job:
For example, you have a table that represents a list of emails that you want to send. A job reads data from the table and uses it to send the emails:
from dagster import asset, job, op @asset def emails_to_send(): ... @op def send_emails(emails) -> None: ... @job def send_emails_job(): send_emails(emails_to_send.to_source_asset())
In this case, the asset - specifically, the table the job reads from - is only used as a data source for the job. It’s not materialized when the graph is run.
The Graph documentation contains more details on how this works.