Backfilling is the process of running partitions for assets or ops that either don’t exist or updating existing records. Dagster supports backfills for each partition or a subset of partitions.
After defining a partition, you can launch a backfill that will submit runs to fill in multiple partitions at the same time.
Backfills are common when setting up a pipeline for the first time. The assets you want to materialize might have historical data that needs to be materialized to get the assets up to date. Another common reason to run a backfill is when you’ve changed the logic for an asset and need to update historical data with the new logic.
To launch backfills for a partitioned asset, click the Materialize button on either the Asset details or the Global asset lineage page. The backfill modal will display.
Backfills can also be launched for a selection of partitioned assets as long as the most upstream assets share the same partitioning. For example: All partitions use a DailyPartitionsDefinition.
To observe the progress of an asset backfill, navigate to the Backfill details page for the backfill. This page can be accessed by clicking Overview (top navigation bar) > Backfills tab, then clicking the ID of the backfill:
Launching single-run backfills using backfill policies
By default, if you launch a backfill that covers N partitions, Dagster will launch N separate runs, one for each partition. This approach can help avoid overwhelming Dagster or resources with large amounts of data. However, if you're using a parallel-processing engine like Spark and Snowflake, you often don't need Dagster to help with parallelism, so splitting up the backfill into multiple runs just adds extra overhead.
Dagster supports backfills that execute as a single run that covers a range of partitions, such as executing a backfill as a single Snowflake query. After the run completes, Dagster will track that all the partitions have been filled.
Single-run backfills only work for backfills that target assets directly, i.e. those launched from the asset graph or asset page. Backfills launched from the Job page will not respect the backfill policies of assets included in the job.
Write code that operates on a range of partitions instead of just single partitions. This means that, if your code uses the partition_key context property, you'll need to update it to use one of the following properties instead:
Which property to use depends on whether it's most convenient for you to operate on start/end datetime objects, start/end partition keys, or a list of partition keys.
If you're using an I/O manager, you'll also need to make sure that the I/O manager is using these methods. Note: File system I/O managers do not support single-run backfills, but Dagster's built-in database I/O managers - like Snowflake, BigQuery, or DuckDB - include this functionality out of the box.
Use the following tabs to check out examples both with and without I/O managers.
To launch and monitor backfills for a job, use the Partitions tab in the job's Details page:
Click the Launch backfill button in the Partitions tab. This opens the Launch backfill modal.
Select the partitions to backfill. A run will be launched for each partition.
Click Submit [N] runs button on the bottom right to submit the runs. What happens when you click this button depends on your Run Coordinator:
For the default run coordinator, the modal will exit after all runs have been launched
For the queued run coordinator, the modal will exit after all runs have been queued
After all the runs have been submitted, you'll be returned to the Partitions page, with a filter for runs inside the backfill. This page refreshes periodically and allows you to see how the backfill is progressing. Boxes will become green or red as steps in the backfill runs succeed or fail:
Backfills can also be launched using the backfill CLI.
Let's say we defined a date-partitioned job named trips_update_job. To execute the backfill for this job, we can run the dagster job backfill command as follows:
$ dagster job backfill -j trips_update_job
This will display a list of all the partitions in the job, ask you if you want to proceed, and then launch a run for each partition.