Asset versioning and caching

info

This feature is considered in a beta stage. It is still being tested and may change. For more information, see the API lifecycle stages documentation.

This guide demonstrates how to build memoizable graphs of assets. Memoizable assets help avoid unnecessary recomputations, speed up the developer workflow, and save computational resources.

Context

There's no reason to spend time materializing an asset if the result is going to be the same as the result of its last materialization.

Dagster's versioning system helps you determine ahead of time whether materializing an asset will produce a different result. It's based on the idea that the result of an asset materialization shouldn't change as long as:

The code used is the same code as the last time the asset was materialized.
The input data is the same input data as the last time the asset was materialized.

Dagster has two versioning concepts to represent the code and input data used for each materialization:

Code version. A string that represents the version of the code that computes an asset. This is the code_version argument of @dg.asset.
Data version. A string that represents the version of the data represented by the asset. This is represented as a DataVersion object.

By keeping track of code and data versions, Dagster can predict whether a materialization will change the underlying value. This allows Dagster to skip redundant materializations and instead return the previously computed value. In more technical terms, Dagster offers a limited form of memoization for assets: the last-computed asset value is always cached.

In computationally expensive data pipelining, this approach can yield tremendous benefits.

tip

You can scaffold assets from the command line by running dg scaffold defs dagster.asset <path/to/asset_file.py>. For more information, see the dg CLI docs.

Step one: Understanding data versions

By default, Dagster automatically computes a data version for each materialization of an asset. It does this by hashing a code version together with the data versions of any input assets.

Let's start with a trivial asset that returns a hardcoded number:

src/<project_name>/defs/assets.py
from dagster import asset


@asset
def a_number():
    return 1

Next, start the Dagster UI:

dg dev

Navigate to the Asset catalog and click Materialize to materialize the asset.

Next, look at the entry for the materialization under the "Events" tab in the Asset Catalog. Take note of the two hashes in the Tags section of the materialization details - code_version and data_version:

Simple asset data version

The code version shown is a copy of the run ID for the run that generated this materialization. Because a_number has no user-defined code_version, Dagster assumes a different code version on every run, which it represents with the run ID.

The data_version is also generated by Dagster. This is a hash of the code version together with the data versions of any inputs. Since a_number has no inputs, in this case, the data version is a hash of the code version only.

If you materialize the asset again, you'll notice that both the code version and data version change. The code version becomes the ID of the new run and the data version becomes a hash of the new code version.

Let's improve this situation by setting an explicit code version. Add a code_version on the asset:

src/<project_name>/defs/assets.py
from dagster import asset


@asset(code_version="v1")
def versioned_number():
    return 1

Now, materialize the asset. The user-defined code version v1 will be associated with the latest materialization:

Simple asset data version with code version

Now, let's update the code and inform Dagster that the code has changed. Do this by changing the code_version argument:

src/<project_name>/defs/assets.py
from dagster import asset


@asset(code_version="v2")
def versioned_number():
    return 11

Click Reload definitions to pick up the changes.

Simple asset data version with code version

The asset now has an "Unsynced" label to indicate that its code version has changed since it was last materialized. We can see this in both the asset graph and the sidebar, where details about the last materialization of a selected node are visible. You can see the code version associated with the last materialization of versioned_number is v1, but its current code version is v2. This is also explained in the tooltip that appears if you hover over the (i) icon on the indicator tag.

note

The "Unsynced" label can appear for three reasons:

The code version of the asset is changed.
The dependencies of the asset have changed (a dependency was added or removed).
The data version of a parent asset has changed due to a new materialization. Note that if you are not using code versions, all new materialization of a dependency will change the data version. The UI will in this case report a "new materialization" rather than a "new data version".

The versioned_number asset must be materialized again to become up-to-date. Click the toggle to the right side of the Materialize button to display the Materialize unsynced option. Confirm the materialization of versioned_number. This will update the latest materialization code_version shown in the sidebar to v2 and bring the asset up-to-date.

Step two: data versions with dependencies

Tracking changes becomes more powerful when there are dependencies in play. Let's add an asset downstream of our first asset:

src/<project_name>/defs/assets.py
from dagster import asset


@asset(code_version="v2")
def versioned_number():
    return 11


@asset(code_version="v1")
def multiplied_number(versioned_number):
    return versioned_number * 2

In the Dagster UI, click Reload definitions. The multiplied_number asset will be marked as Never materialized.

Once again, click the toggle to the right side of the Materialize button to display the Materialize unsynced option. This will also provide the option to materialize "Never materialized" assets. This time, you will not see versioned_number as an option, because the system knows that versioned_number is up to date. Confirm the materialization of multiplied_number.

Run event log

Now, let's update the versioned_number asset. Specifically, we'll change its return value and code version:

src/<project_name>/defs/assets.py
from dagster import asset


@asset(code_version="v3")
def versioned_number():
    return 15


@asset(code_version="v1")
def multiplied_number(versioned_number):
    return versioned_number * 2

As before, this will cause versioned_number to get an "Unsynced" label indicating that its code version has changed since its latest materialization. You might think that, since multiplied_number depends on versioned_number, it would also appear to be "Unsynced". However, "Unsynced" status is not transitive in Dagster. multiplied_number will only appear to be "Unsynced" if its last materialization is against an outdated version of versioned_number. Materialize versioned_number and you will see that multiplied_number then becomes "Unsynced", with a reported reason of "Upstream data version change".

Dependencies code version only

Materialize multiplied_number to get both assets up-to-date again.

Step three: Computing your own data versions

A data version is like a fingerprint for the value that an asset represents, i.e. the output of its materialization function. Therefore, we want our data versions to correspond on a one-to-one basis to the possible return values of a materialization function. Dagster auto-generates data versions by hashing the code version together with input data versions. This satisfies the above criterion in many cases, but sometimes a different approach is necessary.

For example, when a materialization function contains an element of randomness, then multiple materializations of the asset with the same code over the same inputs will produce the same data version for different outputs. On the flip side, if we are generating code versions with an automated approach like source-hashing, then materializing an asset after a cosmetic refactor will produce a different data version (which is derived from the code version) but the same output.

Dagster accommodates these and similar scenarios by allowing user code to supply its own data versions. To do so, include the data version alongside the returned asset value in an Output object. Let's update versioned_number to do this. For simplicity, you'll use the stringified return value as the data version:

src/<project_name>/defs/assets.py
import dagster as dg


@dg.asset(code_version="v4")
def versioned_number():
    value = 20
    return dg.Output(value, data_version=dg.DataVersion(str(value)))


@dg.asset(code_version="v1")
def multiplied_number(versioned_number):
    return versioned_number * 2

If you reload definitions, as before, you will see versioned_number gets an "Unsynced" label to indicate the latest materialization is out of sync with its code version. We also know that if we materialize versioned_number, multiplied_number will become unsynced. Let's re-materialize them both in one run to avoid that intermediate state. Notice the DataVersion of versioned_number is now 20:

Manual data versions 1

Let's simulate a cosmetic refactor by updating versioned_number again, but without changing the returned value. Bump the code version to v5 and change 20 to 10 + 10:

src/<project_name>/defs/assets.py
import dagster as dg


@dg.asset(code_version="v5")
def versioned_number():
    value = 10 + 10
    return dg.Output(value, data_version=dg.DataVersion(str(value)))


@dg.asset(code_version="v1")
def multiplied_number(versioned_number):
    return versioned_number * 2

Once again, versioned_asset will have an "Unsynced" label to indicate the change in the code version.

Let's see what happens if only versioned_number is materialized. Select it in the asset graph and click Materialize selected. The sidebar shows the latest materialization now has a code_version of v5, and the data version is again 20:

Manual data versions 2

Notice that, unlike the last time we materialized versioned_number, multiplied_number does not have an "Unsynced" label! Here's what happened: the new materialization of versioned_number with the explicitly supplied data version supersedes the code version of versioned_number. Dagster then compared the data version of versioned_number last used to materialize multiplied_number to the current data version of versioned_number. Since this comparison shows that the data version of versioned_number hasn't changed, Dagster knows that the change to the code version of versioned_number doesn't affect multiplied_number.

If versioned_number had used a Dagster-generated data version, the data version of versioned_number would have changed due to its updated code version despite the fact that the returned value did not change. multiplied_number would have a label indicating that an upstream data version had changed.

Step four: data versions with source assets

In the real world, data pipelines depend on external upstream data. So far in this guide, we haven't used any external data; we've been substituting hardcoded data in the asset at the root of our graph and using a code version as a stand-in for the version of that data. We can do better than this.

External data sources in Dagster are modeled by SourceAssets. We can add versioning to a SourceAsset by making it observable. An observable source asset has a user-defined function that computes and returns a data version.

Let's add an @dg.observable_source_asset called input_number. This will represent a file written by an external process upstream of our pipeline:

The body of the input_number function computes a hash of the file contents and returns it as a DataVersion. We'll set input_number as an upstream dependency of versioned_number and have versioned_number return the value it reads from the file:

src/<project_name>/defs/assets.py
from hashlib import sha256

import dagster as dg


def sha256_digest_from_str(string: str) -> str:
    hash_sig = sha256()
    hash_sig.update(bytearray(string, "utf8"))
    return hash_sig.hexdigest()


FILE_PATH = dg.file_relative_path(__file__, "input_number.txt")


@dg.observable_source_asset
def input_number():
    with open(FILE_PATH) as ff:
        return dg.DataVersion(sha256_digest_from_str(ff.read()))


@dg.asset(code_version="v6", deps=[input_number])
def versioned_number():
    with open(FILE_PATH) as ff:
        value = int(ff.read())
        return dg.Output(value, data_version=dg.DataVersion(str(value)))


@dg.asset(code_version="v1")
def multiplied_number(versioned_number):
    return versioned_number * 2

Adding an observable source asset to an asset graph will cause a new button, Observe sources, to appear:

Source asset in graph

Click this button to kick off a run that executes the observation function of input_number. Let's look at the entry in the asset catalog for input_number:

Source asset in catalog

Take note of the data_version listed here that you computed.

We also see that versioned_number and multiplied_number have labels indicating that they have new upstream dependencies (because we added the observable source asset to the graph). Click Materialize all to bring them up to date.

Finally, let's manually alter the file to simulate the activity of an external process. Change the content of input_number.txt:

If we click the Observe Sources button again, the downstream assets will again have labels indicating that upstream data has changed. The observation run generated a new data version for input_number because its content changed.

Context​

Step one: Understanding data versions​

Step two: data versions with dependencies​

Step three: Computing your own data versions​

Step four: data versions with source assets​

Context

Step one: Understanding data versions

Step two: data versions with dependencies

Step three: Computing your own data versions

Step four: data versions with source assets