Workspaces

Tools in Dagster ecosystem such as Dagit and the Dagster CLI need to know where to find user code in order to load it. This process is managed by Dagster workspaces. A workspace is a collection of user-defined repositories and information about where they reside.

Workspace YAML

The structure of a workspace is encoded in a yaml document. By convention is it named workspace.yaml.

Each entry in the workspace is referred to as a repository location. A repository location can include more than one repository. Each repository location is loaded in its own server process that Dagster tools use an RPC protocol to communicate with. This process separation allows multiple repository locations in different environments to be loaded independently, and ensures that errors in user code can't impact Dagster system code.

Python environments

If you want to load repositories from Python code, use the python_file or python_package keys in your workspace YAML.

If you use python_file, it must specify a path relative to the workspace file leading to a file containing at least one repository definition. For example, the repository here:

hello_world_repository.py
from dagster import pipeline, repository, solid

@solid
def hello_world(_):
    pass

@pipeline
def hello_world_pipeline():
    hello_world()

@repository
def hello_world_repository():
    return [hello_world_pipeline]

could be loaded using the following workspace file in the same folder:

workspace.yaml
load_from:
  - python_file: hello_world_repository.py

Sometimes you might have more than one repository in scope and want to specify a specific one. Our schema supports this as well through the attribute key, which must be a repository name or the name of a function that returns a RepositoryDefinition. For example:

workspace.yaml
load_from:
  - python_file:
      relative_path: file_with_many_repositories.py
      attribute: hello_world_repository

The example above also illustrates that the python_file key can be a single string if the only configuration needed is the relative file path, but must be a map if more parameters are added.

By default, if you use python_file your code will load with no working directory available to resolve imports in your code. You can supply a custom working directory for relative imports using the working_directory key. For example:

workspace.yaml
load_from:
  - python_file:
      relative_path: hello_world_repository.py
      working_directory: /var/my_working_directory/

python_package can also be used instead of python_file to load code from an installed Python package.

workspace.yaml
load_from:
  - python_package: hello_world_package

A deprecated python_module key with similar semantics also exists, but you should prefer using either python_package if the code is in an installed package and does not require specifying a working directory, or python_file if it does not.

You can also specify an attribute to identify a single repository within a package, as with python_file:

workspace.yaml
load_from:
  - python_package:
      package_name: yourproject.hello_world_repository
      attribute: hello_world_repository

Customizing the Python environment

By default, Dagit and other Dagster tools assume that repository locations should be loaded using the same Python environment that was used to load Dagster. However, it is often useful for repository locations to use independent environments. For example, a data engineering team running Spark can have dramatically different dependencies than an ML team running Tensorflow.

To enable this use case, Dagster supports customizing the Python environment for each repository location, by adding the executable_path key to the YAML for a location. These environments can involve distinct sets of installed dependencies, or even completely different Python versions.

workspace.yaml
load_from:
  - python_package:
      package_name: dataengineering_spark_repository
      location_name: dataengineering_spark_team_py_38_virtual_env
      executable_path: venvs/path/to/dataengineering_spark_team/bin/python
  - python_file:
      relative_path: path/to/team_repos.py
      location_name: ml_team_py_36_virtual_env
      executable_path: venvs/path/to/ml_tensorflow/bin/python

Identifying repository locations

The example above also illustrates the location_name key. Each repository location in a workspace has a unique name that is displayed in Dagit, and is also used to disambiguate definitions with the same name across multiple repository locations. Dagster will supply a default name for each location based on its workspace entry if a custom one is not supplied.

Renaming a repository location (or changing its workspace configuration if you're using the default name) may cause parts of the Dagster system that rely on that name to need to be re-configured. For example, you may need to restart any schedules in that repository location, since those schedules were marked as started using the previous repository location name. You can avoid this situation by giving each of your locations a unique name as soon as you add them to the workspace.

Repository locations in Dagit

If you run dagit in the same folder as a workspace.yaml file, it will load all the repositories in each repository location defined by the workspace. (You can also specify a different workspace file using the -w command-line argument).

It's possible that one or more of your repository locations can't be loaded - for example, there might be a syntax error or some other unrecoverable error in one of your definitions. When this occurs, a warning message will appear in Dagit in the left-hand panel, directing you to a status page with an error message and stack trace for any locations that were unable to load.

Running your own gRPC server

By default, Dagster tools will automatically create a server process on your local machine for each of your repository locations. However, it is also possible to run your own gRPC server that's responsible for serving information about your repositories. This can be useful in more complex system architectures that deploy user code separately from Dagit.

The Dagster gRPC server needs to have access to your code. To initialize the server, run the dagster api grpc command and pass it a target file or module to load.

Similar to when specifying code in your workspace.yaml, the target can be either a python file or installed python package, and you can use an 'attribute' flag to identify a single repository within that target. The server will automatically find and load the specified repositories.

You also need to specify a host and either a post or socket to run the server on.

# Load gRPC Server using python file:
dagster api grpc --python-file /path/to/file.py --host 0.0.0.0 --port 4266
dagster api grpc --python-file /path/to/file.py --host 0.0.0.0 --port /path/to/socket

# Load gRPC Server using python package:
dagster api grpc --package-name my_package_name --host 0.0.0.0 --port 4266
dagster api grpc --package-name my_package_name --host 0.0.0.0 --socket /path/to/socket

# Specify an attribute within the target to load a specific repository:
dagster api grpc --python-file /path/to/file.py --attribute my_repository --host 0.0.0.0 --port 4266
dagster api grpc --package-name my_package_name --attribute my_repository --host 0.0.0.0 --port 4266

Then, in your workspace.yaml, you can configure a new gRPC server repository location to load:

workspace.yaml
load_from:
  - grpc_server:
      host: localhost
      port: 4266
      location_name: 'my_grpc_server'