Dagster is a highly componentized system built around a few core packages:
Contains the core programming model, which you'll use to write
solids, pipelines, and all the other components of a data
application, as well as the
dagster CLI tool for executing and
managing pipelines. Every Dagster project starts here and will
import from the public API of this library.
Our GUI tool for visualizing, testing, scheduling, running, and monitoring Dagster pipelines, written against the GraphQL API. You'll use Dagit locally to develop and monitor pipelines, as well as in production for long-running deployments.
Defines a GraphQL API for executing pipelines and includes a CLI tool for executing queries against the API. Users will typically not make imports from this package, but tools like Dagit, integrations like dagster-airflow, and some containerization strategies are all built on the GraphQL API. Most users should not install this package directly
Dagster also provides a growing set of optional add-on libraries to integrate with infrastructure and other components of the data ecosystem. These libraries vary in maturity and are under active development as the community surfaces its needs.
We distinguish between "beta" and "experimental" libraries as a rough indication of the production-readiness of these libraries, and we are always excited to identify new design partners and collaborators interested in pushing their capabilities forwards.
As a rule, beta libraries are ready for real use, but subject to change at a faster rate than the core libraries. In particular, breaking changes in public APIs may not be limited to minor releases.
Experimental libraries are exploratory and the functionality they expose may not be complete or ready for production.
If you are working on a library that helps Dagster interact with another part of the data ecosystem, please reach out to us. We welcome external contributions and have already incorporated experimental libraries from the community (dagster-github).
Enables incremental adoption of Dagster in existing Apache Airflow environments with a facility for compiling dagster pipelines to Airflow DAGs. Not recommended for greenfield installations.
Tools for working with Amazon Web Services, including custom Cloudwatch loggers, EMR for hosted cluster compute, and S3 for persistent storage of logs and intermediate artifacts. Includes a CLI tool for easy-up/proof-of-concept deployment of hosted dagit to AWS. S3 and GCS are the preferred persistence solutions for long-running deploys.
Pluggable executor to run Dagster pipelines using the Celery task queue, including a CLI tool for managing worker processes. Preferred parallel execution solution for long-running deploys.
Uses system cron to enable scheduled pipeline runs, integrated with Dagit. (Does not support Windows.) Preferred scheduling solution.
Tools for working with Google Cloud Platform, including BigQuery databases and Dataproc for hosted cluster compute, as well as GCS for persistent storage of logs and intermediate artifacts. S3 and GCS are the preferred persistence solutions for long-running deploys.
Model Helm chart for deploying Dagster on Kubernetes, as well as a pluggable Kubernetes-aware run launcher. Preferred deployment solution for long-running deploys.
Wraps Jupyter notebooks as solids to enable repeatable execution integrated into and interoperating with Dagster pipelines, as well as interaction with the Dagster environment during notebook development. Built on papermill
Tools for working with the common Pandas library for Python data frames, including custom type definition with arbitrary columnar or dataframe-wide semantic constraints.
Pluggable Postgres-backed storage for run history and event logs, allowing Dagit and other dagster tools to point at shared remote databases. Preferred storage solution for long-running deploys.
Pluggable executor to run Dagster pipelines using the Dask framework for parallel computing.
Library resources for reporting metrics to Datadog from within Dagster pipelines.
Library resource to interact with Github from within Dagster pipelines.
Library resource to trigger PagerDuty alerts from within Dagster pipelines.
Custom logger using Papertrail.
Library resource for reporting metrics to Prometheus from within Dagster pipelines.
Library solid factory, resources, and custom data types for working with PySpark in Dagster pipelines.
Library solids to execute shell commands.
Library resource to post to Slack from within Dagster pipelines.
Library solids and resources for connecting to and querying Snowflake data warehouses.
Library solid factory, resources, and data types for working with Spark clusters and jobs.
Library resources and solids for SSH and SFTP execution.
Library resource that makes a Twilio client available to Dagster pipelines.