Azure (dagster_azure)¶
Utilities for using Azure Storage Accounts with Dagster. This is mostly aimed at Azure Data Lake Storage Gen 2 (ADLS2) but also contains some utilities for Azure Blob Storage.
NOTE: This package is incompatible with dagster-snowflake
! This is due to a version mismatch
between the underlying azure-storage-blob
package; dagster-snowflake
has a transitive
dependency on an old version, via snowflake-connector-python
.
-
dagster_azure.adls2.
adls2_resource
ResourceDefinition[source]¶ Resource that gives solids access to Azure Data Lake Storage Gen2.
The underlying client is a
DataLakeServiceClient
.Attach this resource definition to a
ModeDefinition
in order to make it available to your solids.Example
from dagster import ModeDefinition, execute_solid, solid from dagster_azure.adls2 import adls2_resource @solid(required_resource_keys={'adls2'}) def example_adls2_solid(context): return list(context.resources.adls2.list_file_systems()) result = execute_solid( example_adls2_solid, run_config={ 'resources': { 'adls2': { 'config': { 'storage_account': 'my_storage_account' } } } }, mode_def=ModeDefinition(resource_defs={'adls2': adls2_resource}), )
Note that your solids must also declare that they require this resource with required_resource_keys, or it will not be initialized for the execution of their compute functions.
You may pass credentials to this resource using either a SAS token or a key, using environment variables if desired:
resources: adls2: config: storage_account: my_storage_account # str: The storage account name. credential: sas: my_sas_token # str: the SAS token for the account. key: env: AZURE_DATA_LAKE_STORAGE_KEY # str: The shared access key for the account.
-
class
dagster_azure.adls2.
FakeADLS2Resource
(account_name, credential='fake-creds')[source]¶ Stateful mock of an ADLS2Resource for testing.
Wraps a
mock.MagicMock
. Containers are implemented using an in-memory dict.
-
dagster_azure.adls2.
adls2_intermediate_storage
IntermediateStorageDefinition[source]¶ Persistent intermediate storage using Azure Data Lake Storage Gen2 for storage.
Suitable for intermediates storage for distributed executors, so long as each execution node has network connectivity and credentials for ADLS and the backing container.
Attach this intermediate storage definition, as well as the
adls2_resource
it requires, to aModeDefinition
in order to make it available to your pipeline:pipeline_def = PipelineDefinition( mode_defs=[ ModeDefinition( resource_defs={'adls2': adls2_resource, ...}, intermediate_storage_defs=[adls2_intermediate_storage], ... ), ... ], ... )
You may configure this storage as follows:
intermediate_storage: adls2: config: adls2_sa: my-best-storage-account adls2_file_system: my-cool-file-system adls2_prefix: good/prefix-for-files-
-
class
dagster_azure.blob.
AzureBlobComputeLogManager
(storage_account, container, secret_key, local_dir=None, inst_data=None, prefix='dagster')[source]¶ Logs solid compute function stdout and stderr to Azure Blob Storage.
This is also compatible with Azure Data Lake Storage.
Users should not instantiate this class directly. Instead, use a YAML block in
dagster.yaml
such as the following:compute_logs: module: dagster_azure.blob.compute_log_manager class: AzureBlobComputeLogManager config: storage_account: my-storage-account container: my-container credential: sas-token-or-secret-key prefix: "dagster-test-" local_dir: "/tmp/cool"
- Parameters
storage_account (str) – The storage account name to which to log.
container (str) – The container (or ADLS2 filesystem) to which to log.
secret_key (str) – Secret key for the storage account. SAS tokens are not supported because we need a secret key to generate a SAS token for a download URL.
local_dir (Optional[str]) – Path to the local directory in which to stage logs. Default:
dagster.seven.get_system_temp_directory()
.prefix (Optional[str]) – Prefix for the log file keys.
inst_data (Optional[ConfigurableClassData]) – Serializable representation of the compute log manager when newed up from config.
-
dagster_azure.adls2.
adls2_pickle_io_manager
IOManagerDefinition[source]¶ Persistent IO manager using Azure Data Lake Storage Gen2 for storage.
Serializes objects via pickling. Suitable for objects storage for distributed executors, so long as each execution node has network connectivity and credentials for ADLS and the backing container.
Attach this resource definition to a
ModeDefinition
in order to make it available to your pipeline:pipeline_def = PipelineDefinition( mode_defs=[ ModeDefinition( resource_defs={ 'io_manager': adls2_pickle_io_manager, 'adls2': adls2_resource, ...}, ), ... ], ... )
You may configure this storage as follows:
resources: io_manager: config: adls2_file_system: my-cool-file-system adls2_prefix: good/prefix-for-files-