Notebooks are an indispensible tool for data science. They allow for easy exploration of datasets, fast iteration, and the ability to create a rich report with Markdown blocks and inline plotting. The Dagstermill (Dagster & Papermill) package makes it straightforward to run notebooks using Dagster tools and to integrate them with your Dagster assets or jobs.
Using the Dagstermill library enables you to:
Developing in notebooks is an important part of many data science workflows. However, notebooks are often considered standalone artifacts during development: The notebook is responsible for fetching any data it needs, running the analysis, and presenting results.
This presents a number of problems, the main two being:
Data is siloed in the notebook. If another developer wants to write a notebook to do additional analysis of the same dataset, they have to replicate the data loading logic in their notebook. This means that each developer must remember to update the data loading logic in each notebook whenever that logic needs to change. If the two versions of fetching data begin to drift (ie. the logic is changed in one notebook but not the other) you may get misleading conclusions.
Running notebooks on a schedule may miss important data. If your notebook fetches data each day, running the notebook on a schedule could mean that the notebook runs before the new data is available. This increases the uncertainty that the conclusions produced by the notebook are based on the correct data, and could force you to manually re-run the notebook.
Integrating your notebooks into a broader Dagster project allows you to:
Separate data-fetching logic from analysis. Factoring data retrieval into separate assets allows any notebook to individually retrieve data required for analysis. This creates a common source of truth for your data.
If you need to change how an asset is created, modifying only one piece of code is required. This allows for increased discoverability of notebooks by easily seeing which notebooks analyze a particular dataset.
Execute notebooks in response when new data is detected. Using sensors, Dagster can execute notebooks in response to new data being made available. Not only does this simplify scheduling, it also ensures that notebooks are using the most up-to-date data and prevents redundant executions if the data doesn't change.
In this tutorial, we'll walk you through integrating Jupyter notebooks with Dagster using an example Jupyter notebook and the dagstermill
library. Click here to get started.
By the end of the tutorial, you'll have a working integration with Jupyter and Papermill integration and a handful of materialized Dagster assets.