Ask AI

Partitions#

An asset definition can represent a collection of partitions that can be tracked and materialized independently. In many ways, each partition functions like its own mini-asset, but they all share a common materialization function and dependencies. Typically, each partition will correspond to a separate file, or a slice of a table in a database.

Consider an online store's order data. In a database, the order data might be stored as a single orders table, which contains multiple days' worth of orders. However, if the data were ingested into Amazon Web Services (AWS) S3 as parquet files, you could create a new parquet file per day or partition.


Benefits#

Using partitions provides the following benefits:

  • Cost efficiency: Run only the data that’s needed and gain granular control over slices. For example, storing recent orders in hot storage and older orders in cheaper, cold storage.
  • Speed up compute: Divide large datasets into smaller, more manageable parts to speed up queries.
  • Scalability: As data grows, distribute it across multiple servers or storage systems or run multiple partitions at a time in parallel.
  • Concurrent processing: Boost computational speed with parallel processing, significantly reducing the time and cost of data processing tasks.
  • Speed up debugging: Test on an individual partition before trying to run larger ranges of data.

Uses#

Partitions are supported for both asset definitions and ops, but how each concept is used is unique. Refer to the following documentation for more info:

With partitions, you can:

  • View runs by partition in the Dagster UI
  • Define a schedule that fills in a partition each time it runs. For example, a job might run each day and process the data that arrived during the previous day.
  • Launch backfills, which are sets of runs that each process a different partition. For example, after making a code change, you might want to run your job on all partitions instead of just one of them.