An asset definition can represent a collection of partitions that can be tracked and materialized independently. In many ways, each partition functions like its own mini-asset, but they all share a common materialization function and dependencies. Typically, each partition will correspond to a separate file, or a slice of a table in a database.
Consider an online store's order data. In a database, the order data might be stored as a single orders
table, which contains multiple days' worth of orders. However, if the data were ingested into Amazon Web Services (AWS) S3 as parquet files, you could create a new parquet file per day or partition.
Using partitions provides the following benefits:
- Cost efficiency: Run only the data that’s needed and gain granular control over slices. For example, storing recent orders in hot storage and older orders in cheaper, cold storage.
- Speed up compute: Divide large datasets into smaller, more manageable parts to speed up queries.
- Scalability: As data grows, distribute it across multiple servers or storage systems or run multiple partitions at a time in parallel.
- Concurrent processing: Boost computational speed with parallel processing, significantly reducing the time and cost of data processing tasks.
- Speed up debugging: Test on an individual partition before trying to run larger ranges of data.
Partitions are supported for both asset definitions and ops, but how each concept is used is unique. Refer to the following documentation for more info:
With partitions, you can:
- View runs by partition in the Dagster UI
- Define a schedule that fills in a partition each time it runs. For example, a job might run each day and process the data that arrived during the previous day.
- Launch backfills, which are sets of runs that each process a different partition. For example, after making a code change, you might want to run your job on all partitions instead of just one of them.