Databricks (dagster_databricks)

The dagster_databricks package provides two main pieces of functionality:

  • A resource, databricks_pyspark_step_launcher, which will execute a solid within a Databricks context on a cluster, such that the pyspark resource uses the cluster’s Spark instance.

  • A function, create_databricks_job_solid, which creates a solid that submits an external configurable job to Databricks using the ‘Run Now’ API.

Note that, for the databricks_pyspark_step_launcher, either S3 or Azure Data Lake Storage config must be specified for solids to succeed, and the credentials for this storage must also be stored as a Databricks Secret and stored in the resource config so that the Databricks cluster can access storage.

dagster_databricks.create_databricks_job_solid(name='databricks_job', num_inputs=1, description=None, required_resource_keys=frozenset({'databricks_client'}))[source]

Creates a solid that launches a databricks job.

As config, the solid accepts a blob of the form described in Databricks’ job API: https://docs.databricks.com/dev-tools/api/latest/jobs.html.

Returns

A solid definition.

Return type

SolidDefinition

dagster_databricks.databricks_pyspark_step_launcher ResourceDefinition[source]

Resource for running solids as a Databricks Job.

When this resource is used, the solid will be executed in Databricks using the ‘Run Submit’ API. Pipeline code will be zipped up and copied to a directory in DBFS along with the solid’s execution context.

Use the ‘run_config’ configuration to specify the details of the Databricks cluster used, and the ‘storage’ key to configure persistent storage on that cluster. Storage is accessed by setting the credentials in the Spark context, as documented here for S3 and here for ADLS.

class dagster_databricks.DatabricksError[source]