Skip to main content

Affordable Databricks Workflows in Apache Airflow

Project description

Astro Databricks

Orchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows

Python versions License Development Status PyPI downloads Contributors Commit activity CI codecov

The Astro Databricks Provider is an Apache Airflow provider created by Astronomer to run your Databricks notebooks as Databricks Workflows while maintaining Airflow as the authoring interface. When using the DatabricksTaskGroup and DatabricksNotebookOperator, notebooks run as a Databricks Workflow which can result in a 75% cost reduction ($0.40/DBU for all-purpose compute, $0.10/DBU for Jobs compute).

Prerequisites

  • Apache Airflow >= 2.2.4
  • Python >= 3.7
  • Databricks account
  • Previously created Databricks Notebooks

Install

pip install astro-provider-databricks

Quickstart (with Astro CLI)

  1. Use pre-existing or create two simple Databricks Notebooks. Their identifiers will be used in step (6). The original example DAG uses:

    • Shared/Notebook_1
    • Shared/Notebook_2
  2. Generate a Databricks Personal Token. This will be used in step (5).

  3. Create a new Astro CLI project (if you don't have one already):

    mdkir my_project && cd my_project
    astro dev init
    
  4. Add the following to your requirements.txt file:

    astro-provider-databricks
    
  5. Create a Databricks connection in Airflow. You can do this via the Airflow UI or the airflow_settings.yaml file by specifying the following fields:

    connections:
      - conn_id: databricks_conn
        conn_type: databricks
        conn_login: <your email, e.g. julian@astronomer.io>
        conn_password: <your personal access token, e.g. dapi1234567890abcdef>
        conn_host: <your databricks host, e.g. https://dbc-9c390870-65ef.cloud.databricks.com>
    
  6. Copy the following workflow into a file named example_databricks.py in your dags directory:

    https://github.com/astronomer/astro-provider-databricks/blob/45897543a5e34d446c84b3fbc4f6f7a3ed16cdf7/example_dags/example_databricks_workflow.py#L48-L101

  7. Run the following command to start your Airflow environment:

    astro dev start
    
  8. Open the Airflow UI at http://localhost:8080 and trigger the DAG. You can click on a task, and under the Details tab select "See Databricks Job Run" to open the job in the Databricks UI.

Quickstart (without Astro CLI)

  1. Use pre-existing or create two simple Databricks Notebooks. Their identifiers will be used in step (5). The original example DAG uses:

    • Shared/Notebook_1
    • Shared/Notebook_2
  2. Generate a Databricks Personal Token. This will be used in step (6).

  3. Ensure that your Airflow environment is set up correctly by running the following commands:

    export AIRFLOW_HOME=`pwd`
    
    airflow db init
    
  4. Create a Databricks connection in Airflow. This can be done by running the following command, replacing the login and password (with your access token):

    # If using Airflow 2.3 or higher:
    airflow connections add 'databricks_conn' \
        --conn-json '{
            "conn_type": "databricks",
            "login": "some.email@yourcompany.com",
            "host": "https://dbc-c9390870-65ef.cloud.databricks.com/",
            "password": "personal-access-token"
        }'
    
    # If using Airflow between 2.2.4 and less than 2.3:
    airflow connections add 'databricks_conn' --conn-type 'databricks' --conn-login 'some.email@yourcompany.com' --conn-host 'https://dbc-9c390870-65ef.cloud.databricks.com/' --conn-password 'personal-access-token'
    
  5. Copy the following workflow into a file named example_databricks_workflow.py and add it to the dags directory of your Airflow project:

    https://github.com/astronomer/astro-provider-databricks/blob/45897543a5e34d446c84b3fbc4f6f7a3ed16cdf7/example_dags/example_databricks_workflow.py#L48-L101

    Alternatively, you can download example_databricks_workflow.py

    curl -O https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/example_dags/example_databricks_workflow.py
    
  6. Run the example DAG:

    airflow dags test example_databricks_workflow `date -Iseconds`
    

    Which will log, among other lines, the link to the Databricks Job Run URL:

    [2023-03-13 15:27:09,934] {notebook.py:158} INFO - Check the job run in Databricks: https://dbc-c9390870-65ef.cloud.databricks.com/?o=4256138892007661#job/950578808520081/run/14940832
    

    This will create a Databricks Workflow with two Notebook jobs. This workflow may take two minutes to complete if the cluster is already up & running or approximately five minutes depending on your cluster initialisation time.

Available features

  • DatabricksWorkflowTaskGroup: Airflow task group that allows users to create a Databricks Workflow.
  • DatabricksNotebookOperator: Airflow operator which abstracts a pre-existing Databricks Notebook. Can be used independently to run the Notebook, or within a Databricks Workflow Task Group.
  • AstroDatabricksPlugin: An Airflow plugin which is installed by the default. It allows users, by using the UI, to view a Databricks job and retry running it in case of failure.

Documentation

The documentation is a work in progress--we aim to follow the Diátaxis system:

Changelog

Astro Databricks follows semantic versioning for releases. Read changelog to understand more about the changes introduced to each version.

Contribution guidelines

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Read the Contribution Guidelines for a detailed overview on how to contribute.

Contributors and maintainers should abide by the Contributor Code of Conduct.

License

Apache Licence 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astro_provider_databricks-0.1.3a1.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

astro_provider_databricks-0.1.3a1-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file astro_provider_databricks-0.1.3a1.tar.gz.

File metadata

File hashes

Hashes for astro_provider_databricks-0.1.3a1.tar.gz
Algorithm Hash digest
SHA256 d21483e4c3edf680b228391b29fd2b35c4aba340187ce40187cf62e1a99e8e58
MD5 6addcb7848a75d733bb5f2ae6a1c1519
BLAKE2b-256 d5240444d52d823c8b8d4ad066d3892827fff9a3096e2f764e8f215d4beaf382

See more details on using hashes here.

File details

Details for the file astro_provider_databricks-0.1.3a1-py3-none-any.whl.

File metadata

File hashes

Hashes for astro_provider_databricks-0.1.3a1-py3-none-any.whl
Algorithm Hash digest
SHA256 4d212160de5e4f47976b4cb840d33de2adf9b6fcf1a0230180467238e3422805
MD5 b1c68c2af7727d8b6ff66bf9af18138a
BLAKE2b-256 f8a5e0a8ea6fab629bc3fc76c78b16f22f357eb84c89c316255d120369638e90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page