Skip to main content

Opinionated framework based on Airflow 2.0 for building pipelines to ingest data into a BigQuery data warehouse

Project description

gcp-airflow-foundations

PyPI version Cloud Build Status Documentation Status

airflow

Airflow is an awesome open source orchestration framework that is the go-to for building data ingestion pipelines on GCP (using Composer - a hosted AIrflow service). However, most companies using it face the same set of problems

  • Learning curve: Airflow requires python knowledge and has some gotchas that take time to learn. Further, writing Python DAGs for every single table that needs to get ingested becomes cumbersome. Most companies end up building utilities for creating DAGs out of configuration files to simplify DAG creation and to allow non-developers to configure ingestion
  • Datalake and data pipelines design best practices: Airflow only provides the building blocks, users are still required to understand and implement the nuances of building a proper ingestion pipelines for the data lake/data warehouse platform they are using
  • Core reusability and best practice enforcement across the enterprise: Usually each team maintains its own Airflow source code and deployment

We have written an opinionated yet flexible ingestion framework for building an ingestion pipeline into data warehouse in BigQuery that supports the following features:

  • Zero-code, config file based ingestion - anybody can start ingesting from the growing number of sources by just providing a simple configuration file. Zero python or Airflow knowledge is required.
  • Modular and extendable - The core of the framework is a lightweight library. Ingestion sources are added as plugins. Adding a new source can be done by extending the provided base classes.
  • Opinionated automatic creation of ODS (Operational Data Store ) and HDS (Historical Data Store) in BigQuery while enforcing best practices such as schema migration, data quality validation, idempotency, partitioning, etc.
  • Dataflow job support for ingesting large datasets from SQL sources and deploying jobs into a specific network or shared VPC.
  • Support of advanced Airflow features for job prioritization such as slots and priorities.
  • Integration with GCP data services such as DLP and Data Catalog [work in progress].
  • Well tested - We maintain a rich suite of both unit and integration tests.

Installing from PyPI

pip install 'gcp-airflow-foundations'

Usage

See the gcp-airflow-foundations documentation for more details.

Project details


Release history Release notifications | RSS feed

This version

0.3.5

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gcp-airflow-foundations-dev-0.3.5.tar.gz (57.3 kB view details)

Uploaded Source

File details

Details for the file gcp-airflow-foundations-dev-0.3.5.tar.gz.

File metadata

  • Download URL: gcp-airflow-foundations-dev-0.3.5.tar.gz
  • Upload date:
  • Size: 57.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for gcp-airflow-foundations-dev-0.3.5.tar.gz
Algorithm Hash digest
SHA256 3dd7dfa1d0ae9829da9d9bf9048bcfd070d0dfd15ca58f093ce71d52b7cfae5d
MD5 df75b5a55bec2983ea7b456f0a9084db
BLAKE2b-256 3fbb5bfcc98d550a1b0672518a451c01993fff6092107896a0d04b6b4d37366c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page