Skip to main content

IBM DataStage with Dagster

In this example, you'll build a pipeline with Dagster that:

  • Wraps IBM DataStage replication jobs as Dagster multi-assets
  • Runs inline data quality checks in the same step as materialization
  • Uses a translator pattern to map DataStage tables to Dagster asset keys
  • Configures everything with a YAML-based Dagster component

Prerequisites

To follow the steps in this guide, you'll need:

  • Basic Python knowledge
  • Python 3.10+ installed on your system. For more information, see the Installation guide.
  • Familiarity with IBM DataStage
note

This example runs in demo mode and doesn't require the cpdctl CLI. If you want to run this example against a real DataStage instance, follow the IBM installation instructions and set demo_mode: false in the YAML configuration.

Step 1: Set up your Dagster environment

First, set up a new Dagster project.

  1. Clone the Dagster repo and navigate to the project:

    cd examples/docs_projects/project_datastage
  2. Install the required dependencies with uv:

    uv sync
  3. Activate the virtual environment:

    source .venv/bin/activate

Step 2: Launch the Dagster webserver

Navigate to the project root directory and start the Dagster webserver:

dg dev
note

With demo_mode: true set in the YAML configuration, the project simulates a DataStage replication job locally without a cpdctl installation.

Next steps