Airbyte made easy (no UI, no database, no cluster)
Project description
Airbyte made simple
🔍️ What is AirbyteServerless?
AirbyteServerless is a simple tool to manage Airbyte connectors, run them locally or deploy them in serverless mode.
💡 Why AirbyteServerless?
Airbyte is a must-have in your data-stack with its catalog of open-source connectors to move your data from any source to your data-warehouse.
To manage these connectors, Airbyte offers Airbyte-Open-Source-Platform which includes a server, workers, database, UI, orchestrator, connectors, secret manager, logs manager, etc.
AirbyteServerless aims at offering a lightweight alternative to Airbyte-Open-Source-Platform to simplify connectors management.
📝 Comparing Airbyte-Open-Source-Platform & AirbyteServerless
Airbyte-Open-Source-Platform | AirbyteServerless |
---|---|
Has a UI | Has NO UI Connections configurations are managed by documented yaml files |
Has a database | Has NO database - Configurations files are versioned in git - The destination stores the state (the checkpoint of where sync stops) and logs which can then be visualized with your preferred BI tool |
Has a transform layer Airbyte loads your data in a raw format but then enables you to perform basic transform such as replace, upsert, schema normalization |
Has NO transform layer - Data is appended in your destination in raw format. - airbyte_serverless is dedicated to do one thing and do it well: Extract-Load . |
NOT Serverless - Can be deployed on a VM or Kubernetes Cluster. - The platform is made of tens of dependent containers that you CANNOT deploy with serverless |
Serverless - An Airbyte source docker image is upgraded with a destination connector - The upgraded docker image can then be deployed as an isolated Cloud Run Job (or Cloud Run Service )- Cloud Run is natively monitored with metrics, dashboards, logs, error reporting, alerting, etc - It can be scheduled or triggered by events |
Is scalable with conditions Scalable if deployed on autoscaled Kubernetes Cluster and if you are skilled enough. 👉 Check that you are skilled enough with Kubernetes by watching this video 😁. |
Is scalable Each connector is deployed independently of each other. You can have as many as you want. |
💥 Getting Started with abs
CLI
abs
is the CLI (command-line-interface) of AirbyteServerless which facilitates connectors management.
Install abs
🛠️
pip install airbyte-serverless
Create your first Connection 👨💻
abs create my_first_connection --source="airbyte/source-faker:0.1.4" --destination="bigquery" --remote-runner "cloud_run_job"
- Docker is required. Make sure you have it installed.
source
param can be any Public Docker Airbyte Source (here is the list). We recomend that you use faker source to get started.destination
param must be one of the following:
bigquery
- contributions are welcome to offer more destinations 🤗
remote-runner
param must becloud_run_job
. More integrations will come in the future. This remote-runner is only used if you want to run the connection on a remote runner and schedule it.- The command will create a configuration file
./connections/my_first_connection.yaml
with initialized configuration.- Update this configuration file to suit your needs.
Run it! ⚡
abs run my_first_connection
- This will launch an Extract-Load Job from the source to the destination.
- The
run
commmand will only work if you have correctly edited./connections/my_first_connection.yaml
configuration file.- If you chose
bigquery
destination, you must:
- have
gcloud
installed on your machine with default credentials initialized with the commandgcloud auth application-default login
.- have correctly edited the
destination
section of./connections/my_first_connection.yaml
configuration file. You must havedataEditor
permission on the chosen BigQuery dataset.- Data is always appended at destination (not replaced nor upserted). It will be in raw format.
- If the connector supports incremental extract (extract only new or recently modified data) then this mode is chosen.
Select only some streams 🧛🏼
You may not want to copy all the data that the source can get. To see all available streams
run:
abs list-available-streams my_first_connection
If you want to configure your connection with only some of these streams, run:
abs set-streams my_first_connection "stream1,stream2"
Next run
executions will extract selected streams only.
Handle Secrets 🔒
For security reasons, you do NOT want to store secrets such as api tokens in your yaml files. Instead, add your secrets in Google Secret Manager by following this documentation. Then you can add the secret resource name in the yaml file such as below:
source:
docker_image: "..."
config:
api_token: GCP_SECRET({SECRET_RESOURCE_NAME})
Replace {SECRET_RESOURCE_NAME}
by your secret resource name which must have the format: projects/{PROJECT_ID}/secrets/{SECRET_ID}/versions/{SECRET_VERSION}
. To get this path:
- Go to the Secret Manager page in the Google Cloud console.
- Go to the Secret Manager page
- On the Secret Manager page, click on the Name of a secret.
- On the Secret details page, in the Versions table, locate a secret version to access.
- In the Actions column, click on the three dots.
- Click on 'Copy Resource Name' from the menu.
Run from the Remote Runner 🚀
abs remote-run my_first_connection
- The
remote-run
commmand will only work if you have correctly edited./connections/my_first_connection.yaml
configuration file including theremote_runner
part.
- This command will launch an Extract-Load Job like the
abs run
command. The main difference is that the command will be run on a remote deployed container (we use Cloud Run Job as the only container runner for now).- If you chose
bigquery
destination, the service account you put inservice_account
field ofremote_runner
section of the yaml must bebigquery.dataEditor
on the target dataset and have permission to create some BigQuery jobs in the project.- If your yaml config contains some Google Secrets, the service account you put in
service_account
field ofremote_runner
section of the yaml must has read access to the secrets.
Schedule the run from the Remote Runner ⏱️
abs schedule-remote-run my_first_connection "0 * * * *"
⚠️ THIS IS NOT IMPLEMENTED YET
Get help 📙
$ abs --help
Usage: abs [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
create Create CONNECTION
list List created connections
list-available-streams List available streams of CONNECTION
remote-run Run CONNECTION Extract-Load Job from remote runner
run Run CONNECTION Extract-Load Job
run-env-vars Run Extract-Load Job configured by environment...
set-streams Set STREAMS to retrieve for CONNECTION (STREAMS...
Keep in touch 🧑💻
Join our Slack for any question, to get help for getting started, to speak about a bug, to suggest improvements, or simply if you want to have a chat 🙂.
👋 Contribute
Any contribution is more than welcome 🤗!
- Add a ⭐ on the repo to show your support
- Join our Slack and talk with us
- Raise an issue to raise a bug or suggest improvements
- Open a PR! Below are some suggestions of work to be done:
- implements a scheduler
- implement the
get_logs
method ofBigQueryDestination
- enable updating cloud run job instead of deleting/creating when it already exists
- add a new destination connector (Cloud Storage?)
- add more remote runners such compute instances.
- implements vpc access
- implement optional post-processing (replace, upsert data at destination instead of append?)
🏆 Credits
- Big kudos to Airbyte for all the hard work on connectors!
- The generation of the sample connector configuration in yaml is heavily inspired from the code of
octavia
CLI developed by airbyte.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.