In this guide, we'll walk through a fully featured Dagster project that takes advantage of a wide range of Dagster features. This example can be useful as a point of reference for using different Dagster APIs and integrating other tools.
At a high level, this project consists of three asset groups, all centered around a contrived organization that wants to do ML and analysis on Hacker News user activity data.
To follow along with this guide, you can bootstrap your own project with this example:
dagster project from-example \ --name my-dagster-project \ --example project_fully_featured
To install this example and its Python dependencies, run:
cd my-dagster-project pip install -e .
Once you've done this, you can run:
to view this example in Dagster's UI, Dagit.
This example shows useful patterns for many Dagster concepts, including:
Software-defined assets - An asset is a software object that models a data asset. The prototypical example is a table in a database or a file in cloud storage.
This example contains three asset groups:
recommender: A machine learning model that recommends stories to specific users based on their comment history, as well as the features and training set used to fit that model. These are dropped and recreated whenever the
core assets receive updates.
activity_analytics: Aggregate statistics computed about Hacker News activity represented by dbt models and a Python model that depends on them. These are dropped and recreated whenever the
core assets receive updates.
Resources - A resource is an object that models a connection to a (typically) external service. Resources can be shared between assets, and different implementations of resources can be used depending on the environment. In this example, we built multiple Hacker News API resources, all of which have the same interface but different implementations:
HNAPIClientinteracts with the real Hacker News API and gets the full data set, which will be used in production.
HNAPISubsampleClienttalks to the real API but subsamples the data, which is much faster than the normal implementation and is great for demoing purposes.
HNSnapshotClientreads from a local snapshot, which is useful for unit testing or environments where the connection isn't available.
The way we model resources helps separate the business logic in code from environments, e.g. you can easily switch resources without changing your pipeline code.
DuckDBPartitionedParquetIOManager: interacts with Spark and dbt without any long-running process. It minimizes setup difficulty and is useful for local development.
SnowflakeIOManager: handles outputs that are either Spark or Pandas DataFrames and write data to a Snowflake table specified by metadata on the relevant
Out. The metadata is helpful for observability, especially in production.
Sensors - A sensor allows you to instigate runs based on some external state change. In this example, we have sensors to react to different state changes:
Testing - All Dagster entities are unit-testable. This example illustrates lightweight invocations in unit tests, including:
@asset-decorated functions. Read more about testing assets on the Testing page.
InputContextwith the mocks. Check out Testing an IO manager to learn more.
This example is meant to be loaded from three deployments:
By default, it will load for the local deployment. You can toggle deployments by setting the
DAGSTER_DEPLOYMENT env var to
Beyond leveraging Dagster core concepts, this project also uses several dagster integration libraries:
dbt_project, and loads dbt models from an existing dbt
manifest.jsonfile in the dbt project to Dagster assets. It is useful for larger dbt projects as you may not want to recompile the entire dbt project every time you load the Dagster project.
PartitionedParquetIOManagerthat can take a PySpark DataFrame and store it in Parquet at the given path. It uses
pyspark_resourceto access to a PySpark SparkSession for executing PySpark code within Dagster.
As time goes on, this guide will be kept up to date, taking advantage of new Dagster features and learnings from the community. If you have anything you'd like to add, or an additional example you'd like to see, don't hesitate to reach out!