There are many approaches to writing integrations in Dagster. The choice of approach depends on the specific requirements of the integration, the level of control needed, and the complexity of the external system being integrated. By reviewing the pros and cons of each approach, it is possible to make an informed decision on the best method for a specific use case. The following are typical approaches that align with Dagster's best practices.
One of the most fundamental features that can be implemented in an integration is a resource object to interface with an external service. For example, the dagster-snowflake integration provides a custom SnowflakeResource that is a wrapper around the Snowflake connector object.
The factory pattern is used for creating multiple similar objects based on a set of specifications. This is often useful in the data engineering when you have similar processing that will operate on multiple objects with varying parameters.
For example, imagine you would like to perform an operation on a set of tables in a database. You could construct a factory method that takes in a table specification, resulting in a list of assets.
from dagster import Definitions, asset
parameters =[{"name":"asset1","table":"users"},{"name":"asset2","table":"orders"},]defprocess_table(table_name:str)->None:passdefbuild_asset(params):@asset(name=params["name"])def_asset():
process_table(params["table"])return _asset
assets =[build_asset(params)for params in parameters]
defs = Definitions(assets=assets)
In the scenario where a single API call or configuration can result in multiple assets, with a shared runtime or dependencies, one may consider creating a multi-asset decorator. Example implementations of this approach include dbt, dlt, and Sling.
The Pipes protocol is used to integrate with systems that have their own execution environments. It enables running code in these external environments while allowing Dagster to maintain control and visibility. Example implementations of this approach include AWS Lambda, Databricks, and Kubernetes.
Separation of Environments: Allows running code in external environments, which can be useful for integrating with systems that have their own execution environments.
Flexibility: Can integrate with a wide range of external systems and languages.
Streaming logs and metadata: Provides support for streaming logs and structured metadata back into Dagster.