Workspaces

Tools in Dagster ecosystem such as Dagit and the Dagster CLI (we will be focusing on Dagit for this guide) need to be able to know what user code to load. This process is managed by Dagster workspaces. A workspace is a collection of user-defined repositories and information about where they reside. Currently we support repositories residing in the same environment as Dagit itself, but we also support repositories living in separate virtual environments, cleanly separating dependencies from Dagit and each other. We refer to where a repository lives as a repository location.

Currently the only repository location type we support is a python environment. We will be adding other location types (e.g containers) as the system develops.

Workspace YAML

The structure of a workspace is encoded in a yaml document. By convention is it named workspace.yaml.

The goal of the workspace.yaml is to provide enough information to load all the repositories that the tool wants to have access to. We support two use cases:

  • Loading in the current python environment.
  • Loading in a different python environment.

Loading in the current environment

The user needs to provide the system either a path to the file or the name of an installed python package where a repository is defined.

If there is only one repository defined in the target package or file it is automatically loaded.

Example yaml for loading a single repository in a file:

hello_world_repository.py
from dagster import pipeline, repository, solid

@solid
def hello_world(_):
    pass

@pipeline
def hello_world_pipeline():
    hello_world()

@repository
def hello_world_repository():
    return [hello_world_pipeline]
workspace.yaml
load_from:
  - python_file: hello_world_repository.py

Now if you type dagit in that folder it will automatically discover workspace.yaml and then load the repository in the same python environment. However the user code will reside in its own process. Dagit will not load the user code into its process.

Sometimes you might have more than one repository in scope and you want to specify a specific one. Our schema supports as well:

workspace.yaml
load_from:
  - python_file:
      relative_path: hello_world_repository.py
      attribute: hello_world_repository

You can also load from an installed package.

workspace.yaml
load_from:
  # works if hello_world_repository is installed by pip
  - python_package: hello_world_repository

Similarly you can also specify an attribute:

workspace.yaml
load_from:
  - python_package:
      package_name: yourproject.hello_world_repository
      attribute: hello_world_repository

And lastly you can load multiple repositories from multiple packages:

workspace.yaml
load_from:
  - python_package: team_one
  - python_package: team_two
  - python_file: path/to/team_that_refuses_to_install_packages/repo.py

Loading from an external environment

Python Environment

It is useful for repositories to have independent environments. A data engineering team running Spark can have dramatically different dependencies than an ML team running Tensorflow. Dagster supports this by having its tools communicate with those user environments over an IPC layer. In order to do this you must configure your workspace to load the correct repository in the correct virtual environment.

workspace.yaml
load_from:
  - python_environment:
    executable_path: venvs/path/to/dataengineering_spark_team/bin/python
    target:
      python_package:
        package_name: dataengineering_spark_repository
        location_name: dataengineering_spark_team_py_38_virtual_env
  - python_environment:
    executable_path: venvs/path/to/ml_tensorflow/bin/python
    target:
      python_file:
        relative_path: path/to/team_repos.py
        location_name: ml_team_py_36_virtual_env

Note that not only could these be distinct sets of installed dependencies, but they could also be completely different python versions.

gRPC Server

Using the built in Dagster gRPC Server, it is possible to interact with repositories that are completely remote. This allows for complete separation between tools like the Dagster CLI and Dagit and your repository code.

The Dagster gRPC server needs to have access to your code. This server is responsible for serving information about your repositories over gRPC. To initialize the server, you need to run the dagster api grpc command and pass it a target.

The target can be either either a python file or python module. The server will automatically find and load all repositories within the specified target. If you want to manually specify where to find a single repository within a target, you can use the attribute flag.

You also need to specify a host and either a post or socket to run the server on.

# Load gRPC Server using python file:
dagster api grpc --python-file /path/to/file.py --host 0.0.0.0 --port 4266
dagster api grpc --python-file /path/to/file.py --host 0.0.0.0 --port /path/to/socket

# Load gRPC Server using python module:
dagster api grpc --module-name my_module_name --host 0.0.0.0 --port 4266
dagster api grpc --module-name my_module_name --host 0.0.0.0 --socket /path/to/socket

# Specify an attribute within the target to load a specific repository:
dagster api grpc --python-file /path/to/file.py --attribute my_repository --host 0.0.0.0 --port 4266
dagster api grpc --module-name my_module_name --attribute my_repository --host 0.0.0.0 --port 4266

Then, in your workspace.yaml, you can configure a new gRPC server repository location to load from:

workspace.yaml
load_from:
  - grpc_server:
      host: localhost
      port: 4266
      location_name: 'my_grpc_server'

Executing runs against a gRPC Server:

If you are using the DefaultRunLauncher, which is configured by default on your DagsterInstance, the run launcher will launch runs against your hosted gRPC server. The gRPC server needs to be able to access your run storage in order to be able to execute launched runs.

If you have implemented a custom run launcher and would like to host your code using the Dagster gRPC server, please reach out to us.