IBM DataStage with Dagster
In this example, you'll build a pipeline with Dagster that:
- Wraps IBM DataStage replication jobs as Dagster multi-assets
- Runs inline data quality checks in the same step as materialization
- Uses a translator pattern to map DataStage tables to Dagster asset keys
- Configures everything with a YAML-based Dagster component
Prerequisites
To follow the steps in this guide, you'll need:
- Basic Python knowledge
- Python 3.10+ installed on your system. For more information, see the Installation guide.
- Familiarity with IBM DataStage
This example runs in demo mode and doesn't require the cpdctl CLI. If you want to run this example against a real DataStage instance, follow the IBM installation instructions and set demo_mode: false in the YAML configuration.
Step 1: Set up your Dagster environment
First, set up a new Dagster project.
-
Clone the Dagster repo and navigate to the project:
cd examples/docs_projects/project_datastage -
Install the required dependencies with
uv:uv sync -
Activate the virtual environment:
- MacOS
- Windows
source .venv/bin/activate.venv\Scripts\activate
Step 2: Launch the Dagster webserver
Navigate to the project root directory and start the Dagster webserver:
dg dev
With demo_mode: true set in the YAML configuration, the project simulates a DataStage replication job locally without a cpdctl installation.
Next steps
- Continue this example with defining assets