DagsterDocs

Deploying Dagster to AWS #

To deploy Dagster to AWS, EC2 or ECS can host Dagit, RDS can store runs and events, and S3 can act as an IO manager.

Hosting Dagit or Dagster Daemon on EC2 or ECS #

To host Dagit or Dagster Daemon on a bare VM or in Docker on EC2 or ECS, see Running Dagster as a service.

Example #

You can find the code for this example on Github

This example demonstrates how you can use docker-compose to manage a Dagster deployment on ECS. It includes a Dagit container for loading and launching jobs, a dagster-daemon container for managing a run queue and submitting runs from schedules and sensors, a postgres container for persistent storage, and a container with user job code. The Dagster instance uses the EcsRunLauncher to launch each run in its own ECS task.

Using RDS for run and event log storage #

You can use a hosted RDS PostgreSQL database for your Dagster run/events data. You can do this by setting blocks in your $DAGSTER_HOME/dagster.yaml appropriately.

run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      username: { username }
      password: { password }
      hostname: { hostname }
      db_name: { database }
      port: { port }

event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      username: { username }
      password: { password }
      hostname: { hostname }
      db_name: { db_name }
      port: { port }

schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      username: { username }
      password: { password }
      hostname: { hostname }
      db_name: { db_name }
      port: { port }

In this case, you'll want to ensure you provide the right connection strings for your RDS instance, and ensure that the node or container hosting Dagit is able to connect to RDS.

Be sure that this file is present, and DAGSTER_HOME is set, on the node where Dagit is running.

Note that using RDS for run and event log storage does not require that Dagit be running in the cloud. If you are connecting a local Dagit instance to a remote RDS storage, double check that your local node is able to connect to RDS.

Using S3 for IO Management #

To enable parallel computation (e.g., with the multiprocessing or Dagster celery executors), you will need to configure persistent IO Managers -- for instance, using an S3 bucket to store data passed between ops.

You'll first need to need to use s3_pickle_io_manager as your IO Manager or customize your own persistent io managers (see example).

from dagster import Int, Out, job, op
from dagster_aws.s3.io_manager import s3_pickle_io_manager
from dagster_aws.s3.resources import s3_resource


@op(out=Out(Int))
def my_op():
    return 1


@job(
    resource_defs={
        "io_manager": s3_pickle_io_manager,
        "s3": s3_resource,
    }
)
def my_job():
    my_op()

Then, add the following YAML block in your job's config:

resources:
  io_manager:
    config:
      s3_bucket: my-cool-bucket
      s3_prefix: good/prefix-for-files-

The resource uses boto under the hood, so if you are accessing your private buckets, you will need to provide the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables or follow one of the other boto authentication methods.

With this in place, your job runs will store data passed between ops on S3 in the location s3://<bucket>/dagster/storage/<job run id>/<op name>.compute.