Tutorial
If you're new to Dagster, we recommend working through this tutorial to become familiar with Dagster's feature set and tooling, using small examples that are intended to be illustrative of real data problems.
Before We Start¶
The tutorial is divided into several sections:
- Setup for the Tutorial will give you a starting point to follow the tutorial.
- Overview will teach you the fundamental concepts of Dagster: solids and pipelines.
- ETL with Dagster will teach you ways to construct and execute a simple data pipeline using the basics of Dagster.
- Advanced Tutorials will showcase Dagster's advanced features like scheduling and materializations.
What Are We Building¶
We'll build examples around a simple but scary .csv dataset, cereal.csv
, which contains nutritional
facts about 80 breakfast cereals. You can find this dataset on
Github.
Or, if you've cloned the dagster git repository, you'll find this dataset at
dagster/examples/dagster_examples/intro_tutorial/cereal.csv
To get the flavor of this dataset, let's look at the header and the first five rows:
cereal.csv
name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
100% Bran,N,C,70,4,1,130,10,5,6,280,25,3,1,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2,8,8,135,0,3,1,1,33.983679
All-Bran,K,C,70,4,1,260,9,7,5,320,25,3,1,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14,8,0,330,25,3,1,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1,14,8,-1,25,3,1,0.75,34.384843
You can find all of the tutorial code checked into the dagster repository.