Tools in Dagster ecosystem such as Dagit and the Dagster CLI (we will be focusing on Dagit for this guide) need to be able to know what user code to load. This process is managed by Dagster workspaces. A workspace is a collection of user-defined repositories and information about where they reside. Currently we support repositories residing in the same environment as Dagit itself, but we also support repositories living in separate virtual environments, cleanly separating dependencies from Dagit and each other. We refer to where a repository lives as a repository location.
Currently the only repository location type we support is a python environment. We will be adding other location types (e.g containers) as the system develops.
The structure of a workspace is encoded in a yaml document. By convention is it named
The goal of the workspace.yaml is to provide enough information to load all the repositories that the tool wants to have access to. We support two use cases:
- Loading in the current python environment.
- Loading in a different python environment.
Loading in the current environment¶
The user needs to provide the system either a path to the file or the name of an installed python package where a repository is defined.
If there is only one repository defined in the target package or file it is automatically loaded.
Example yaml for loading a single repository in a file:
from dagster import pipeline, repository, solid @solid def hello_world(_): pass @pipeline def hello_world_pipeline(): hello_world() @repository def hello_world_repository(): return [hello_world_pipeline]
load_from: - python_file: hello_world_repository.py
Now if you type
dagit in that folder it will automatically discover
workspace.yaml and then
load the repository in the same python environment. However the user code will reside in its own process.
Dagit will not load the user code into its process.
Sometimes you might have more than one repository in scope and you want to specify a specific one. Our schema supports as well:
load_from: - python_file: relative_path: hello_world_repository.py attribute: hello_world_repository
You can also load from an installed package.
load_from: # works if hello_world_repository is installed by pip - python_package: hello_world_repository
Similarly you can also specify an attribute:
load_from: - python_package: package_name: yourproject.hello_world_repository attribute: hello_world_repository
And lastly you can load multiple repositories from multiple packages:
load_from: - python_package: team_one - python_package: team_two - python_file: path/to/team_that_refuses_to_install_packages/repo.py
Loading from an external environment¶
It is useful for repositories to have independent environments. A data engineering team running Spark can have dramatically different dependencies than an ML team running Tensorflow. Dagster supports this by having its tools communicate with those user environments over an IPC layer. In order to do this you must configure your workspace to load the correct repository in the correct virtual environment.
load_from: - python_environment: executable_path: venvs/path/to/dataengineering_spark_team/bin/python target: python_package: package_name: dataengineering_spark_repository location_name: dataengineering_spark_team_py_38_virtual_env - python_environment: executable_path: venvs/path/to/ml_tensorflow/bin/python target: python_file: relative_path: path/to/team_repos.py location_name: ml_team_py_36_virtual_env
Note that not only could these be distinct sets of installed dependencies, but they could also be completely different python versions.
Using the built in Dagster gRPC Server, it is possible to interact with repositories that are completely remote. This allows for complete separation between tools like the Dagster CLI and Dagit and your repository code.
The Dagster gRPC server needs to have access to your code. This server is responsible for serving
information about your repositories over gRPC. To initialize the server, you need to run
dagster api grpc command and pass it a target.
The target can be either either a python file or python module. The server will automatically find
and load all repositories within the specified target. If you want to manually specify where to find
a single repository within a target, you can use the
You also need to specify a host and either a post or socket to run the server on.
# Load gRPC Server using python file: dagster api grpc --python-file /path/to/file.py --host 0.0.0.0 --port 4266 dagster api grpc --python-file /path/to/file.py --host 0.0.0.0 --port /path/to/socket # Load gRPC Server using python module: dagster api grpc --module-name my_module_name --host 0.0.0.0 --port 4266 dagster api grpc --module-name my_module_name --host 0.0.0.0 --socket /path/to/socket # Specify an attribute within the target to load a specific repository: dagster api grpc --python-file /path/to/file.py --attribute my_repository --host 0.0.0.0 --port 4266 dagster api grpc --module-name my_module_name --attribute my_repository --host 0.0.0.0 --port 4266
Then, in your
workspace.yaml, you can configure a new gRPC server repository location to load from:
load_from: - grpc_server: host: localhost port: 4266 location_name: 'my_grpc_server'
Executing runs against a gRPC Server:¶
If you are using the
DefaultRunLauncher, which is configured by
default on your
DagsterInstance, the run launcher will launch
runs against your hosted gRPC server. The gRPC server needs to be able to access your run storage in
order to be able to execute launched runs.
If you have implemented a custom run launcher and would like to host your code using the Dagster gRPC server, please reach out to us.