Asset versioning and caching
This feature is considered in a beta stage. It is still being tested and may change.
This guide demonstrates how to build memoizable graphs of assets. Memoizable assets help avoid unnecessary recomputations, speed up the developer workflow, and save computational resources.
Context
There's no reason to spend time materializing an asset if the result is going to be the same as the result of its last materialization.
Dagster's versioning system helps you determine ahead of time whether materializing an asset will produce a different result. It's based on the idea that the result of an asset materialization shouldn't change as long as:
- The code used is the same code as the last time the asset was materialized.
- The input data is the same input data as the last time the asset was materialized.
Dagster has two versioning concepts to represent the code and input data used for each materialization:
- Code version. A string that represents the version of the code that computes an asset. This is the
code_version
argument of@dg.asset
. - Data version. A string that represents the version of the data represented by the asset. This is represented as a
DataVersion
object.
By keeping track of code and data versions, Dagster can predict whether a materialization will change the underlying value. This allows Dagster to skip redundant materializations and instead return the previously computed value. In more technical terms, Dagster offers a limited form of memoization for assets: the last-computed asset value is always cached.
In computationally expensive data pipelining, this approach can yield tremendous benefits.
Step one: Understanding data versions
By default, Dagster automatically computes a data version for each materialization of an asset. It does this by hashing a code version together with the data versions of any input assets.
Let's start with a trivial asset that returns a hardcoded number:
from dagster import asset
@asset
def a_number():
return 1
Next, start the Dagster UI:
dagster dev
Navigate to the Asset catalog and click Materialize to materialize the asset.
Next, look at the entry for the materialization under the "Events" tab in the Asset Catalog. Take note of the two hashes in the Tags section of the materialization details - code_version
and data_version
:
The code version shown is a copy of the run ID for the run that generated this materialization. Because a_number
has no user-defined code_version
, Dagster assumes a different code version on every run, which it represents with the run ID.
The data_version
is also generated by Dagster. This is a hash of the code version together with the data versions of any inputs. Since a_number
has no inputs, in this case, the data version is a hash of the code version only.
If you materialize the asset again, you'll notice that both the code version and data version change. The code version becomes the ID of the new run and the data version becomes a hash of the new code version.
Let's improve this situation by setting an explicit code version. Add a code_version
on the asset:
from dagster import asset
@asset(code_version="v1")
def versioned_number():
return 1
Now, materialize the asset. The user-defined code version v1
will be associated with the latest materialization:
Now, let's update the code and inform Dagster that the code has changed. Do this by changing the code_version
argument:
from dagster import asset
@asset(code_version="v2")
def versioned_number():
return 11
Click Reload definitions to pick up the changes.
The asset now has an "Unsynced" label to indicate that its code version has changed since it was last materialized. We can see this in both the asset graph and the sidebar, where details about the last materialization of a selected node are visible. You can see the code version associated with the last materialization of versioned_number
is v1
, but its current code version is v2
. This is also explained in the tooltip that appears if you hover over the (i)
icon on the indicator tag.
The "Unsynced" label can appear for three reasons:
- The code version of the asset is changed.
- The dependencies of the asset have changed (a dependency was added or removed).
- The data version of a parent asset has changed due to a new materialization. Note that if you are not using code versions, all new materialization of a dependency will change the data version. The UI will in this case report a "new materialization" rather than a "new data version".
The versioned_number
asset must be materialized again to become up-to-date. Click the toggle to the right side of the Materialize button to display the Materialize unsynced option. Confirm the materialization of versioned_number
. This will update the latest materialization code_version
shown in the sidebar to v2
and bring the asset up-to-date.
Step two: data versions with dependencies
Tracking changes becomes more powerful when there are dependencies in play. Let's add an asset downstream of our first asset:
from dagster import asset
@asset(code_version="v2")
def versioned_number():
return 11
@asset(code_version="v1")
def multiplied_number(versioned_number):
return versioned_number * 2
In the Dagster UI, click Reload definitions. The multiplied_number
asset will be marked as Never materialized.
Once again, click the toggle to the right side of the Materialize button to display the Materialize unsynced option. This will also provide the option to materialize "Never materialized" assets. This time, you will not see versioned_number
as an option, because the system knows that versioned_number
is up to date. Confirm the materialization of multiplied_number
.
Now, let's update the versioned_number
asset. Specifically, we'll change its return value and code version:
from dagster import asset
@asset(code_version="v3")
def versioned_number():
return 15
@asset(code_version="v1")
def multiplied_number(versioned_number):
return versioned_number * 2
As before, this will cause versioned_number
to get an "Unsynced" label indicating that its code version has changed since its latest materialization. You might think that, since multiplied_number
depends on versioned_number
, it would also appear to be "Unsynced". However, "Unsynced" status is not transitive in Dagster. multiplied_number
will only appear to be "Unsynced" if its last materialization is against an outdated version of versioned_number
. Materialize versioned_number
and you will see that multiplied_number
then becomes "Unsynced", with a reported reason of "Upstream data version change".
Materialize multiplied_number
to get both assets up-to-date again.