Skip to main content

Affordable Databricks Workflows in Apache Airflow

Project description

Astro Databricks

Affordable Databricks Workflows in Apache Airflow

Python versions License Development Status PyPI downloads Contributors Commit activity CI codecov

Astro Databricks is an Apache Airflow provider created by Astronomer for an optimal Databricks experience. With the DatabricksTaskGroup, Astro Datatricks allows you to run from Databricks workflows without the need of running Jobs individually, which can result in 75% cost reduction.

Prerequisites

  • Apache Airflow >= 2.2.4
  • Python >= 2.7
  • Databricks account
  • Previously created Databricks Notebooks

Install

pip install astro-provider-databricks

Quickstart

  1. Use pre-existing or create two simple Databricks Notebooks. Their identifiers will be used in step (5). The original example DAG uses:

    • Shared/Notebook_1
    • Shared/Notebook_2
  2. Generate a Databricks Personal Token. This will be used in step (6).

  3. Ensure that your Airflow environment is set up correctly by running the following commands:

    export AIRFLOW_HOME=`pwd`
    
    airflow db init
    
  4. Create using your preferred way a Databricks Airflow connection (so Airflow can access Databricks using your credentials). This can be done by running the following command, replacing the login and password (with your access token):

airflow connections add 'databricks_conn' \
    --conn-json '{
        "conn_type": "databricks",
        "login": "some.email@yourcompany.com",
        "host": "https://dbc-c9390870-65ef.cloud.databricks.com/",
        "password": "personal-access-token"
    }'
  1. Copy the following workflow into a file named example_databricks_workflow.py and add it to the dags directory of your Airflow project:

    https://github.com/astronomer/astro-provider-databricks/blob/45897543a5e34d446c84b3fbc4f6f7a3ed16cdf7/example_dags/example_databricks_workflow.py#L48-L101

    Alternatively, you can download example_databricks_workflow.py

     curl -O https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/example_dags/example_databricks_workflow.py
    
  2. Run the example DAG:

    airflow dags test example_databricks_workflow `date -Iseconds`
    

This will create a Databricks Workflow with two Notebook jobs.

Available features

  • DatabricksWorkflowTaskGroup: Airflow task group that allows users to create a Databricks Workflow.
  • DatabricksNotebookOperator: Airflow operator which abstracts a pre-existing Databricks Notebook. Can be used independently to run the Notebook, or within a Databricks Workflow Task Group.
  • AstroDatabricksPlugin: An Airflow plugin which is installed by the default. It allows users, by using the UI, to view a Databricks job and retry running it in case of failure.

Documentation

The documentation is a work in progress--we aim to follow the Diátaxis system:

Changelog

Astro Databricks follows semantic versioning for releases. Read changelog to understand more about the changes introduced to each version.

Contribution guidelines

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Read the Contribution Guidelines for a detailed overview on how to contribute.

Contributors and maintainers should abide by the Contributor Code of Conduct.

License

Apache Licence 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astro_provider_databricks-0.1.0.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

astro_provider_databricks-0.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file astro_provider_databricks-0.1.0.tar.gz.

File metadata

File hashes

Hashes for astro_provider_databricks-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3436f9b9b89ea96e1c20c03d49ff3ccb25681e34b56d975c3427e4010b8d0d21
MD5 233fe8b3ec3488a3cd6e14c5218c30e1
BLAKE2b-256 8ecc43c5b3063a150db68f147d65b1edf3922880d3c2eb501daf53d6792b716f

See more details on using hashes here.

File details

Details for the file astro_provider_databricks-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for astro_provider_databricks-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c3fd957cc73e529ec7b51a1969f4575c6f03bc65e7bcaf2ab3f0d29c9dc04627
MD5 ebc9f62c39696ed0ad0d18c8c1153601
BLAKE2b-256 5768d079ce21cb3afb8034a1d8cb31b6a09e24ba3ed610d3aa67c5307f61f7a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page