Ingest puzzle data
NYT Connections is a word puzzle where players find four groups of four related words from a 16-word grid. Each group has a difficulty level: Yellow (easiest), Green, Blue, and Purple (hardest). Our DsPy solver needs structured data to learn effective puzzle-solving strategies.
Data structure and loading
The foundation of our AI system starts with well-structured data classes that model the puzzle domain. These classes capture not just the raw puzzle data, but also the game state needed to track solving progress:
@dataclass
class PuzzleGroup:
"""Represents a group in a Connections puzzle."""
name: str
color: str
words: List[str]
@dataclass
class Puzzle:
"""Represents a complete Connections puzzle."""
id: int
date: str
difficulty: float
words: List[str]
groups: List[PuzzleGroup]
@dataclass
class GameState:
"""Tracks the state of a game in progress."""
puzzle: Puzzle
solved_groups: Set[str] # group colors that have been solved
guess_count: int
mistake_count: int
invalid_count: int
finished: bool
won: bool
start_time: Optional[float]
end_time: Optional[float]
The Puzzle
class serves as our core data structure, containing the 16 words and their correct group assignments with difficulty levels. The GameState
class tracks the dynamic aspects of puzzle-solving: which groups have been solved, how many mistakes were made, and whether the game is complete. This separation allows our DSPy solver to reason about both the static puzzle structure and the evolving game dynamics.
The data loading process includes comprehensive validation to ensure puzzle integrity. Every puzzle must have exactly 16 words arranged in 4 groups of 4, with each group assigned one of the four difficulty colors. This validation prevents corrupted data from affecting model training and ensures consistent puzzle structure across the dataset.
Creating training data
The data preparation process is handled by a Dagster asset that orchestrates the entire pipeline from raw CSV files to DSPy training examples. This approach ensures reproducible data splits and consistent preprocessing:
@dg.asset(
name="connections_puzzle_data",
compute_kind="python",
group_name="data",
description="Loads and splits Connections puzzle data for training and evaluation",
)
def puzzle_data_asset(
context: dg.AssetExecutionContext,
) -> Dict[str, Any]:
"""Load and split Connections puzzle data."""
# Get absolute path to connections data
data_path = Path(self.connections_data_path)
if not data_path.is_absolute():
data_path = Path.cwd() / data_path
context.log.info(f"Loading Connections data from: {data_path}")
if not data_path.exists():
raise FileNotFoundError(f"Connections data not found at: {data_path}")
# Load puzzles
puzzles = load_puzzles_from_csv(str(data_path))
context.log.info(f"Loaded {len(puzzles)} puzzles")
# Split data
train_puzzles, test_puzzles = shuffle_and_split_puzzles(
puzzles, self.train_test_split
)
# Take subset for evaluation if specified
eval_puzzles = (
test_puzzles[: self.eval_subset_size]
if self.eval_subset_size
else test_puzzles
)
context.log.info(
f"Split: {len(train_puzzles)} training, {len(eval_puzzles)} evaluation puzzles"
)
context.add_output_metadata(
{
"total_puzzles": len(puzzles),
"train_puzzles": len(train_puzzles),
"eval_puzzles": len(eval_puzzles),
"train_test_split": self.train_test_split,
"data_path": str(data_path),
}
)
return {
"train_puzzles": train_puzzles,
"eval_puzzles": eval_puzzles,
"total_puzzles": len(puzzles),
}
This asset performs several critical functions:
- It loads puzzles from the CSV file and validates their structure using our domain-specific validation logic.
- It splits the data into training and evaluation sets using a configurable ratio (default 25% for training), ensuring we have sufficient data for both optimization and testing.
- The asset also handles practical considerations, like taking a subset of puzzles for efficient evaluation during development, while maintaining the full dataset for comprehensive testing.
All operations are tracked with detailed metadata, providing visibility into dataset size, split ratios, and data quality metrics. This transparency is crucial for understanding model performance and debugging potential data issues.
Component configuration
The data pipeline leverages Dagster Components to create reusable, configurable data processing workflows. This YAML configuration defines all the key parameters for the Connections puzzle solver:
type: project_dspy.components.connections_model_builder.ConnectionsModelBuilder
attributes:
model_name: connections
connections_data_path: data/Connections_Data.csv
performance_threshold: 0.3
optimization_enabled: true
train_test_split: 0.25
eval_subset_size: 10
num_threads: 4
This configuration encapsulates specific defaults optimized for Connections puzzles. The data path points to our puzzle CSV file, the performance threshold (0.3) represents a reasonable baseline success rate for puzzle-solving, and the train/test split (0.25) balances having enough training data with sufficient evaluation coverage.
The component approach provides several benefits:
- Configurations can be easily modified for different environments or puzzle variants.
- The same component can be deployed across development and production with different parameters.
- The declarative YAML format makes it easy to understand and version control the pipeline settings.
Next steps
Now that we have a robust data ingestion pipeline that loads, validates, and prepares puzzle data for training, we can move on to building the core AI solver that will use this data to learn puzzle-solving strategies.
- Continue this tutorial with building the DSPy solver