Backfilling is the process of running partitions for assets or ops that either don’t exist or updating existing records. Dagster supports backfills for each partition or a subset of partitions.
After defining a partition, you can launch a backfill that will submit runs to fill in multiple partitions at the same time.
Backfills are common when setting up a pipeline for the first time. The assets you want to materialize might have historical data that needs to be materialized to get the assets up to date. Another common reason to run a backfill is when you’ve changed the logic for an asset and need to update historical data with the new logic.
To launch backfills for a partitioned asset, click the Materialize button on either the Asset details or the Global asset lineage page. The backfill modal will display.
Backfills can also be launched for a selection of partitioned assets as long as the most upstream assets share the same partitioning. For example: All partitions use a
To observe the progress of an asset backfill, navigate to the Backfill details page for the backfill. This page can be accessed by clicking Overview (top navigation bar) > Backfills tab, then clicking the ID of the backfill:
By default, if you launch a backfill that covers
N partitions, Dagster will launch
N separate runs, one for each partition. This approach can help avoid overwhelming Dagster or resources with large amounts of data. However, if you're using a parallel-processing engine like Spark and Snowflake, you often don't need Dagster to help with parallelism, so splitting up the backfill into multiple runs just adds extra overhead.
Dagster supports backfills that execute as a single run that covers a range of partitions, such as executing a backfill as a single Snowflake query. After the run completes, Dagster will track that all the partitions have been filled.
To get this behavior, you need to:
Set the asset's
Write code that operates a range of partitions instead of just single partitions. This means that, if your code uses the
partition_key context property, you'll need to update it to use one of the following properties instead:
Which property to use depends on whether it's most convenient for you to operate on start/end datetime objects, start/end partition keys, or a list of partition keys.
from dagster import ( AssetExecutionContext, AssetKey, BackfillPolicy, DailyPartitionsDefinition, asset, ) @asset( partitions_def=DailyPartitionsDefinition(start_date="2020-01-01"), backfill_policy=BackfillPolicy.single_run(), deps=[AssetKey("raw_events")], ) def events(context: AssetExecutionContext): start_datetime, end_datetime = context.partition_time_window input_data = read_data_in_datetime_range(start_datetime, end_datetime) output_data = compute_events_from_raw_events(input_data) overwrite_data_in_datetime_range(start_datetime, end_datetime, output_data)
If you are using an I/O manager to handle saving and loading your data, you'll need to ensure the I/O manager is also using these methods. If you're using any of the built-in database I/O managers, like Snowflake, BigQuery, or DuckDB, you'll have this out-of-the-box. Note: This doesn't apply to file system I/O managers.
from dagster import InputContext, IOManager, OutputContext class MyIOManager(IOManager): def load_input(self, context: InputContext): start_datetime, end_datetime = context.asset_partitions_time_window return read_data_in_datetime_range(start_datetime, end_datetime) def handle_output(self, context: OutputContext, obj): start_datetime, end_datetime = context.asset_partitions_time_window return overwrite_data_in_datetime_range(start_datetime, end_datetime, obj)
To launch and monitor backfills for a job, use the Partitions tab in the job's Details page:
Click the Launch backfill button in the Partitions tab. This opens the Launch backfill modal.
Select the partitions to backfill. A run will be launched for each partition.
Click Submit [N] runs button on the bottom right to submit the runs. What happens when you click this button depends on your Run Coordinator:
After all the runs have been submitted, you'll be returned to the Partitions page, with a filter for runs inside the backfill. This page refreshes periodically and allows you to see how the backfill is progressing. Boxes will become green or red as steps in the backfill runs succeed or fail:
Backfills can also be launched using the
Let's say we defined a date-partitioned job named
trips_update_job. To execute the backfill for this job, we can run the
dagster job backfill command as follows:
$ dagster job backfill -p trips_update_job
This will display a list of all the partitions in the job, ask you if you want to proceed, and then launch a run for each partition.
To execute a subset of a partition set, use the
--partitions argument and provide a comma-separated list of partition names you want to backfill:
$ dagster job backfill -p do_stuff_partitioned --partitions 2021-04-01,2021-04-02
Alternatively, you can also specify ranges of partitions using the
$ dagster job backfill -p do_stuff_partitioned --from 2021-04-01 --to 2021-05-01