Skip to main content

Airbyte made easy (no UI, no database, no cluster)

Project description

logo

Airbyte made simple


Why airbyte_serverless ?

At Unytics, we ❤️ Airbyte which provides a catalog of open-source connectors to move your data from any source to your data-warehouse.

Airbyte Open-Source Platform is "batteries included" 🔋.
You'll get a server, workers, database, UI, orchestrator, connectors, secret manager, logs manager, etc. All of this is very well packaged and deployable on Kubernetes. While we believe this is great for most people we strive for lightweight and simple assets to deploy and maintain. What's more, we ❤️ serverless.

👉 We wanted a simple tool to manage Airbyte connectors, run them locally or deploy them in serverless mode.


Airbyte Open-Source Platform vs airbyte_serverless

💡 Airbyte Serverless is less than Airbyte Open-Source Platform

Airbyte Open-Source Platform Airbyte Serverless
Deployed on a VM or Kubernetes Cluster Deployed with Serverless
- Each Airbyte source docker image is upgraded with a destination connector from airbyte_serverless
- Each upgraded docker image can then be deployed as an isolated Cloud Run Job (or Cloud Run Service)
- Cloud Run is natively monitored with metrics, dashboards, logs, error reporting, alerting, etc
- They can be scheduled or triggred upon cloud events
Has database Has NO database
- The destination stores the state (the track of where sync stops)
- The destination stores the logs which can then be visualized with your preferred BI tool
- Connectors Configurations can be stored in config files and versionned in git
Has a UI to edit configuration Has NO UI
Configurations are generated as documented-yaml-files that one can edit and version
Is scalable
if deployed on autoscaled Kubernetes Cluster
Is scalable
Each connector is deployed independently of each other. You can have as many as you want.
Has a transform layer
Airbyte loads your data in a raw format but then enables you to perform basic transform such as replace, upsert, schema normalization
Has NO transform layer
Data is appended in your destination in raw format. We believe less is more. airbyte_serverless is dedicated to do one thing and do it well: Extract-Load. It's easier to maintain and to evolve.

Features

  1. ⚡ A lightweight python wrapper around any Airbyte Source executable.
  2. ⚡ Destination Connectors (only BigQuery for now - contibutions are welcome 🤗) which store logs and states in addition to data. Thus, there is no need for a database any more!
  3. ⚡ Examples to deploy to serverless compute (only Google Cloud Run for now - contibutions are welcome 🤗)

Getting Started

0. Install

pip install airbyte-serverless

1. Create an Airbyte Source from an Airbyte Source Executable

If you have docker installed on your laptop, the easiest is to write the following code in a file getting_started.py (change surveymonkey with the source you want). Then, it should directly work when you run python getting_started.py. If it does not, please raise an issue.

from airbyte_serverless.sources import AirbyteSource

airbyte_source_executable = 'docker run --rm -i airbyte/source-surveymonkey:latest'
source = AirbyteSource(airbyte_source_executable)
If you don't have docker (or don't want to use it)

It is also possible to clone airbyte repo and install a python source connector:

  1. Clone the repo
  2. Go to the directory of the connector: cd airbyte-integrations/connectors/source-surveymonkey
  3. Install the python connector pip install -r requirements.txt
  4. Create here the file getting_started.py and set airbyte_source_executable = 'python main.py'
  5. You can now run python getting_started.py it then should also work. If it does not, please raise an issue.

2. Update config for your Airbyte Source

Your Airbyte Source needs some config to be able to connect. Show a pre-filled config for your connector with:

print(source.config)

Copy the content, edit it and update the variable:

source.config = '''
YOUR UPDATED CONFIG
'''

3. Check your config

print(source.connection_status)

4. Update configured_catalog for your Airbyte Source

The source catalog lists the available streams (think entities) that the source is able to retrieve. The configured_catalog specifies which streams to extract and how. Show the default configured_catalog with:

print(source.configured_catalog)

If needed, copy the content, edit it and update the variable:

source.configured_catalog = {
   ...YOUR UPDATED CONFIG
}

5. Test the retrieval of one data record

print(source.first_record)

6. Create a destination and run Extract-Load

from airbyte_serverless.destinations import BigQueryDestination

destination = BigQueryDestination(dataset='YOUR-PROJECT.YOUR_DATASET')
data = source.extract()
destination.load(data)

7. Run Extract-Load from where you stopped

The state keeps track from where the latest extract-load ended (for incremental extract-load). To start from this state run:

state = destination.get_state()
data = source.extract(state=state)
destination.load(data)

End to End Example

from airbyte_serverless.sources import AirbyteSource
from airbyte_serverless.destinations import BigQueryDestination

airbyte_source_executable = 'docker run --rm -i airbyte/source-surveymonkey:latest'
config = 'YOUR CONFIG'
configured_catalog = {YOUR CONFIGURED CATALOG}
source = AirbyteSource(airbyte_source_executable, config=config, configured_catalog=configured_catalog)

destination = BigQueryDestination(dataset='YOUR-PROJECT.YOUR_DATASET')

state = destination.get_state()
data = source.extract(state=state)
destination.load(data)

Deploy

To deploy to Cloud Run job, edit Dockerfile to pick the Airbyte source you like then run:

Limitations

  • BigQuery Destination connector only works in append mode
  • Data at destination is in raw format. No data parsing is done.

We believe, like Airbyte, that it is a good thing to decouple data moving and data transformation. To shape your data you may want to use a tool such as dbt. Thus, we follow the EL-T philosophy.

Credits

The generation of the sample connector configuration in yaml is heavily inspired from the code of octavia CLI developed by airbyte.

Contribute

Any contribution is more than welcome 🤗!

  • Add a ⭐ on the repo to show your support
  • Raise an issue to raise a bug or suggest improvements
  • Open a PR! Below are some suggestions of work to be done:
    • improve secrets management
    • implement a CLI
    • manage configurations as yaml files
    • implement the get_logs method of BigQueryDestination
    • add a new destination connector (Cloud Storage?)
    • add more serverless deployment examples.
    • implement optional post-processing (replace, upsert data at destination instead of append?)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airbyte_serverless-0.3.tar.gz (80.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page