Skip to main content

A CLI to work with DataHub metadata

Project description

Metadata Ingestion

Python version 3.6+

This module hosts an extensible Python-based metadata ingestion system for DataHub. This supports sending data to DataHub using Kafka or through the REST API. It can be used through our CLI tool, with an orchestrator like Airflow, or as a library.

Getting Started

Prerequisites

Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. If you are trying this out locally, the easiest way to do that is through quickstart Docker images.

Install from PyPI

The folks over at Acryl Data maintain a PyPI package for DataHub metadata ingestion.

# Requires Python 3.6+
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub version
# If you see "command not found", try running this instead: python3 -m datahub version

If you run into an error, try checking the common setup issues.

Installing Plugins

We use a plugin architecture so that you can install only the dependencies you actually need. Click the plugin name to learn more about the specific source recipe and any FAQs!

Sources:

Plugin Name Install Command Provides
file included by default File source and sink
athena pip install 'acryl-datahub[athena]' AWS Athena source
bigquery pip install 'acryl-datahub[bigquery]' BigQuery source
bigquery-usage pip install 'acryl-datahub[bigquery-usage]' BigQuery usage statistics source
datahub-business-glossary no additional dependencies Business Glossary File source
dbt no additional dependencies dbt source
druid pip install 'acryl-datahub[druid]' Druid Source
feast pip install 'acryl-datahub[feast]' Feast source
glue pip install 'acryl-datahub[glue]' AWS Glue source
hive pip install 'acryl-datahub[hive]' Hive source
kafka pip install 'acryl-datahub[kafka]' Kafka source
kafka-connect pip install 'acryl-datahub[kafka-connect]' Kafka connect source
ldap pip install 'acryl-datahub[ldap]' (extra requirements) LDAP source
looker pip install 'acryl-datahub[looker]' Looker source
lookml pip install 'acryl-datahub[lookml]' LookML source, requires Python 3.7+
mongodb pip install 'acryl-datahub[mongodb]' MongoDB source
mssql pip install 'acryl-datahub[mssql]' SQL Server source
mysql pip install 'acryl-datahub[mysql]' MySQL source
mariadb pip install 'acryl-datahub[mariadb]' MariaDB source
openapi pip install 'acryl-datahub[openapi]' OpenApi Source
oracle pip install 'acryl-datahub[oracle]' Oracle source
postgres pip install 'acryl-datahub[postgres]' Postgres source
redash pip install 'acryl-datahub[redash]' Redash source
redshift pip install 'acryl-datahub[redshift]' Redshift source
sagemaker pip install 'acryl-datahub[sagemaker]' AWS SageMaker source
snowflake pip install 'acryl-datahub[snowflake]' Snowflake source
snowflake-usage pip install 'acryl-datahub[snowflake-usage]' Snowflake usage statistics source
sql-profiles pip install 'acryl-datahub[sql-profiles]' Data profiles for SQL-based systems
sqlalchemy pip install 'acryl-datahub[sqlalchemy]' Generic SQLAlchemy source
superset pip install 'acryl-datahub[superset]' Superset source
trino pip install 'acryl-datahub[trino] Trino source
starburst-trino-usage pip install 'acryl-datahub[starburst-trino-usage]' Starburst Trino usage statistics source
nifi `pip install 'acryl-datahub[nifi]' Nifi source

Sinks

Plugin Name Install Command Provides
file included by default File source and sink
console included by default Console sink
datahub-rest pip install 'acryl-datahub[datahub-rest]' DataHub sink over REST API
datahub-kafka pip install 'acryl-datahub[datahub-kafka]' DataHub sink over Kafka

These plugins can be mixed and matched as desired. For example:

pip install 'acryl-datahub[bigquery,datahub-rest]'

You can check the active plugins:

datahub check plugins

Basic Usage

pip install 'acryl-datahub[datahub-rest]'  # install the required plugin
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml

The --dry-run option of the ingest command performs all of the ingestion steps, except writing to the sink. This is useful to ensure that the ingestion recipe is producing the desired workunits before ingesting them into datahub.

# Dry run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml --dry-run
# Short-form
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml -n

The --preview option of the ingest command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source. This option helps with quick end-to-end smoke testing of the ingestion recipe.

# Preview
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml --preview
# Preview with dry-run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml -n --preview

Install using Docker

Docker Hub datahub-ingestion docker

If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container. We have prebuilt images available on Docker hub. All plugins will be installed and enabled automatically.

Limitation: the datahub_docker.sh convenience script assumes that the recipe and any input/output files are accessible in the current working directory or its subdirectories. Files outside the current working directory will not be found, and you'll need to invoke the Docker image directly.

# Assumes the DataHub repo is cloned locally.
./metadata-ingestion/scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml

Install from source

If you'd like to install from source, see the developer guide.

Recipes

A recipe is a configuration file that tells our ingestion scripts where to pull data from (source) and where to put it (sink). Here's a simple example that pulls metadata from MSSQL and puts it into datahub.

# A sample recipe that pulls metadata from MSSQL and puts it into DataHub
# using the Rest API.
source:
  type: mssql
  config:
    username: sa
    password: ${MSSQL_PASSWORD}
    database: DemoData

transformers:
  - type: "fully-qualified-class-name-of-transformer"
    config:
      some_property: "some.value"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

We automatically expand environment variables in the config, similar to variable substitution in GNU bash or in docker-compose files. For details, see https://docs.docker.com/compose/compose-file/compose-file-v2/#variable-substitution.

Running a recipe is quite easy.

datahub ingest -c ./examples/recipes/mssql_to_datahub.yml

A number of recipes are included in the examples/recipes directory. For full info and context on each source and sink, see the pages described in the table of plugins.

Transformations

If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub.

Check out the transformers guide for more info!

Using as a library

In some cases, you might want to construct the MetadataChangeEvents yourself but still use this framework to emit that metadata to DataHub. In this case, take a look at the emitter interfaces, which can easily be imported and called from your own code.

Programmatic Pipeline

In some cases, you might want to configure and run a pipeline entirely from within your custom python script. Here is an example of how to do it.

Developing

See the guides on developing, adding a source and using transformers.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acryl-datahub-0.8.17.6.tar.gz (358.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acryl_datahub-0.8.17.6-py3-none-any.whl (468.0 kB view details)

Uploaded Python 3

File details

Details for the file acryl-datahub-0.8.17.6.tar.gz.

File metadata

  • Download URL: acryl-datahub-0.8.17.6.tar.gz
  • Upload date:
  • Size: 358.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for acryl-datahub-0.8.17.6.tar.gz
Algorithm Hash digest
SHA256 b87026eb0a23333bf7d84253a45a351818611be7e44f4b4cf682b665adb2989b
MD5 789c63200646190d1bec7ae5c2240b13
BLAKE2b-256 5b526e493d6238f22c45cf797385d263e5621f52c069ab6af8a3b6b6e1a56c0c

See more details on using hashes here.

File details

Details for the file acryl_datahub-0.8.17.6-py3-none-any.whl.

File metadata

  • Download URL: acryl_datahub-0.8.17.6-py3-none-any.whl
  • Upload date:
  • Size: 468.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for acryl_datahub-0.8.17.6-py3-none-any.whl
Algorithm Hash digest
SHA256 27880238eb6390e96036b600a3b773aa0c35518944b5b6468bb69d872c1609d6
MD5 e6b03164601397b58e93e6c1f7ce0c0e
BLAKE2b-256 cbb1bea8e83a5f2bd35bf8afefd5c455e41ea9afb0a55d3307324ab41bfde65e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page