Ingest puzzle data

NYT Connections is a word puzzle where players find four groups of four related words from a 16-word grid. Each group has a difficulty level: Yellow (easiest), Green, Blue, and Purple (hardest). Our DSPy solver needs structured data to learn effective puzzle-solving strategies.

Data structure and loading

The foundation of our AI system starts with well-structured data classes that model the puzzle domain. These classes capture not just the raw puzzle data, but also the game state needed to track solving progress:

dspy_modules/puzzle.py
@dataclass
class PuzzleGroup:
    """Represents a group in a Connections puzzle."""

    name: str
    color: str
    words: List[str]


@dataclass
class Puzzle:
    """Represents a complete Connections puzzle."""

    id: int
    date: str
    difficulty: float
    words: List[str]
    groups: List[PuzzleGroup]


@dataclass
class GameState:
    """Tracks the state of a game in progress."""

    puzzle: Puzzle
    solved_groups: Set[str]  # group colors that have been solved
    guess_count: int
    mistake_count: int
    invalid_count: int
    finished: bool
    won: bool
    start_time: Optional[float]
    end_time: Optional[float]

The Puzzle class serves as our core data structure, containing the 16 words and their correct group assignments with difficulty levels. The GameState class tracks the dynamic aspects of puzzle-solving: which groups have been solved, how many mistakes were made, and whether the game is complete. This separation allows our DSPy solver to reason about both the static puzzle structure and the evolving game dynamics.

The data loading process includes comprehensive validation to ensure puzzle integrity. Every puzzle must have exactly 16 words arranged in 4 groups of 4, with each group assigned one of the four difficulty colors. This validation prevents corrupted data from affecting model training and ensures consistent puzzle structure across the dataset.

Creating training data

The data preparation process is handled by a Dagster asset that orchestrates the entire pipeline from raw CSV files to DSPy training examples. This approach ensures reproducible data splits and consistent preprocessing:

src/project_dspy/components/ds_py_model_builder.py
        @dg.asset(
            name="connections_puzzle_data",
            compute_kind="python",
            group_name="data",
            description="Loads and splits Connections puzzle data for training and evaluation",
        )
        def puzzle_data_asset(
            context: dg.AssetExecutionContext,
        ) -> Dict[str, Any]:
            """Load and split Connections puzzle data."""

            # Get absolute path to connections data
            data_path = Path(self.connections_data_path)
            if not data_path.is_absolute():
                data_path = Path.cwd() / data_path

            context.log.info(f"Loading Connections data from: {data_path}")

            if not data_path.exists():
                raise FileNotFoundError(f"Connections data not found at: {data_path}")

            # Load puzzles
            puzzles = load_puzzles_from_csv(str(data_path))
            context.log.info(f"Loaded {len(puzzles)} puzzles")

            # Split data
            train_puzzles, test_puzzles = shuffle_and_split_puzzles(
                puzzles, self.train_test_split
            )

            # Take subset for evaluation if specified
            eval_puzzles = (
                test_puzzles[: self.eval_subset_size]
                if self.eval_subset_size
                else test_puzzles
            )

            context.log.info(
                f"Split: {len(train_puzzles)} training, {len(eval_puzzles)} evaluation puzzles"
            )

            context.add_output_metadata(
                {
                    "total_puzzles": len(puzzles),
                    "train_puzzles": len(train_puzzles),
                    "eval_puzzles": len(eval_puzzles),
                    "train_test_split": self.train_test_split,
                    "data_path": str(data_path),
                }
            )

            return {
                "train_puzzles": train_puzzles,
                "eval_puzzles": eval_puzzles,
                "total_puzzles": len(puzzles),
            }

This asset performs several critical functions:

It loads puzzles from the CSV file and validates their structure using our domain-specific validation logic.
It splits the data into training and evaluation sets using a configurable ratio (default 25% for training), ensuring we have sufficient data for both optimization and testing.
The asset also handles practical considerations, like taking a subset of puzzles for efficient evaluation during development, while maintaining the full dataset for comprehensive testing.

All operations are tracked with detailed metadata, providing visibility into dataset size, split ratios, and data quality metrics. This transparency is crucial for understanding model performance and debugging potential data issues.

Component configuration

The data pipeline leverages Dagster Components to create reusable, configurable data processing workflows. This YAML configuration defines all the key parameters for the Connections puzzle solver:

src/project_dspy/defs/connections_model/defs.yaml
type: project_dspy.components.connections_model_builder.ConnectionsModelBuilder

attributes:
  model_name: connections
  connections_data_path: data/Connections_Data.csv
  performance_threshold: 0.3
  optimization_enabled: true
  train_test_split: 0.25
  eval_subset_size: 10
  num_threads: 4

This configuration encapsulates specific defaults optimized for Connections puzzles. The data path points to our puzzle CSV file, the performance threshold (0.3) represents a reasonable baseline success rate for puzzle-solving, and the train/test split (0.25) balances having enough training data with sufficient evaluation coverage.

The component approach provides several benefits:

Configurations can be easily modified for different environments or puzzle variants.
The same component can be deployed across development and production with different parameters.
The declarative YAML format makes it easy to understand and version control the pipeline settings.

Next steps

Now that we have a robust data ingestion pipeline that loads, validates, and prepares puzzle data for training, we can move on to building the core AI solver that will use this data to learn puzzle-solving strategies.

Continue this tutorial with building the DSPy solver

Data structure and loading​

Creating training data​

Component configuration​

Next steps​

Data structure and loading

Creating training data

Component configuration

Next steps