Familiar with ops and graphs? Want to understand when, why, and how to use asset definitions in Dagster? If so, this guide is for you. We'll also demonstrate what some common Dagster jobs look like before and after using asset definitions.
Before we jump in, here's a quick refresher:
An asset is a persistent object in storage, such as a table, machine learning (ML) model, or file.
An op is the core unit of computation in Dagster. For example, an op might accept tabular data as its input and produce transformed tabular data as its output.
A graph is a directed acyclic graph of ops or other graphs, which execute in order and pass data to each other.
An asset definition is a declaration of an asset that should exist and a description of how to compute it: the op or graph that needs to run and the upstream assets that it should run on.
Asset definitions aren't a replacement for Dagster's core computational concepts - ops are, in fact, the core unit of computation that occurs within an asset. Think of them as a top layer that links ops, graphs, and jobs to the long-lived objects they interact with.
Using asset definitions means building Dagster jobs in a way that declares ahead of time the assets they produce and consume. This is different than using the AssetMaterialization API, which only informs Dagster at runtime about the assets a job interacted with.
Assets help track and define cross-job dependencies. For example, when viewing a job that materializes assets, you can navigate to the jobs that produce the assets that it depends on. Additionally, when an upstream asset has been updated more recently than a downstream asset, Dagster will indicate that the downstream asset might be out of date.
Asset definitions provide sizeable improvements when it comes to code ergonomics:
You'll usually write less code. Specifying the inputs to an asset definition defines the assets it depends on. This means you don't need to use @graph and @job to wire dependencies between ops.
This approach improves scalability by reducing the number of times an asset's name appears in your codebase by half. Refer to the I/O manager-based example below to see this in action.
You no longer have to choose between easy dependency tracking and manageable organization. Without asset definitions, you're often forced to:
Contain everything in a single mega-job, which allows for easy dependency tracking but creates maintenance difficulties, OR
Split your pipeline into smaller jobs, which allows for easy maintenance but makes dependency tracking difficult
As assets track their dependencies, you can avoid interruptions in dependency graphs and eliminate the need for input managers.
You’re using Dagster to produce or maintain assets, AND
You know what those assets will be before you launch any runs.
Note that using asset definitions in one job doesn’t mean they need to be used in all your jobs. If your use case doesn't meet these criteria, you can still use graphs and ops.
Still not sure? Check out these examples to see what's a good fit and what isn't:
Use case
Good fit?
Explanation
Every day, drop and recreate the users table and the user_recommender_model model that depends on it
Yes
Assets are known before a run and are being updated
Every hour, add a partition to the events table
Yes
Assets are known before a run and are being updated
Clicking a button refreshes the recommender model
Yes
Assets are known before a run and are being updated
Every day, send emails to a set of users
No
No assets are being updated
Every day, read a file of user IDs and change the value of a particular attribute for each user
No
The set of assets to update is not known before running the job.
Every day, scan my warehouse for tables that haven't been used in months and delete them
No
The set of assets to update is not known before running the job.
Let's say you've written jobs that you want to enrich using asset definitions. Assuming assets are known and being updated, what would upgrading look like?
Generally, every op output in a job that corresponds to a long-lived object in storage should have an asset definition. The following examples demonstrate some realistic Dagster jobs, both with and without asset definitions:
This isn't an exhaustive list! We're adding the ability to define jobs that materialize assets and then run arbitrary ops. Interested? We'd love to hear from you in Slack or a GitHub discussion.
Materialize two interdependent tables without an I/O manager#
This example does the same things as the previous example, with one difference. This job performs I/O inside of the ops instead of delegating it to I/O managers and input managers:
from pandas import read_sql
from dagster import Definitions, In, Nothing, job, op
from.mylib import create_db_connection, pickle_to_s3, train_recommender_model
@opdefbuild_users():
raw_users_df = read_sql("select * from raw_users", con=create_db_connection())
users_df = raw_users_df.dropna()
users_df.to_sql(name="users", con=create_db_connection())@op(ins={"users": In(Nothing)})defbuild_user_recommender_model():
users_df = read_sql("select * from users", con=create_db_connection())
users_recommender_model = train_recommender_model(users_df)
pickle_to_s3(users_recommender_model, key="users_recommender_model")@jobdefusers_recommender_job():
build_user_recommender_model(build_users())
defs = Definitions(
jobs=[users_recommender_job],)
Here's an example of an equivalent job that uses asset definitions:
This example demonstrates a job where some of the ops (extract_products and get_categories) don't produce assets of their own. Instead, they produce transient data that downstream ops will use to produce assets:
Note: Because some ops don't correspond to assets, this job uses @op and @graph APIs and AssetsDefinition.from_graph to wrap a graph in an asset definition:
How do asset definitions work with other Dagster concepts?#
Still not sure how asset definitions fit into your current Dagster usage? In this section, we'll touch on how asset definitions work with some of Dagster's core concepts.
The Nothing Dagster type enables declaring that Dagster doesn't need to store or load the object corresponding to an op output or input
The deps argument when defining an asset enables specifying dependencies without relying on Dagster to store or load objects corresponding to that dependency