Skip to main content

jx-bigquery - JSON Expressions for BigQuery

Project description

jx-bigquery

JSON Expressions for BigQuery

Status

June 2020 - Can insert JSON documents into BigQuery while managing the schema. Queries are not supported yet.

Overview

The library is intended to manage multiple BigQuery tables to give the illusion of one table with a dynamically managed schema.

Definitions

  • partition - Big data is split into separate containers based on age. This allows queries on recent data to use less resources, and allows old data to be dropped quickly
  • cluster - another name for the sorted order of the data in a partition. Sorting by the most common used lookup will make queries faster
  • id - The set of columns that identifies the document

Configuration

  • table - Any name you wish to give to this table series
  • top_level_fields - BigQuery demands that control columns are top-level. Define them here.
  • partition -
    • field - The dot-delimited field used to partition the tables (must be time datatype)
    • expire - When BigQuery will automatically drop your data.
  • id - The identification of documents
    • field - the set of columns to uniquely identify this document
    • version - column used to determine age of a document; replacing newer with older
  • cluster - Columns used to sort the partitions
  • schema - {name: type} dictionary - needed when there is no data; BigQuery demands column definitions
  • sharded - boolean - set to true if you allow this library to track multiple tables. It allows for schema migration (expansion only), and for faster insert from a multitude of machines
  • account_info - The information BigQuery provides to connect

Example

This is a complicated example. See tests/config.json for a minimal example.

{
    "table": "my_table_name",
    "top_level_fields": {},
    "partition": {
        "field": "submit_time",
        "expire": "2year"
    },
    "id": {
        "field": "id",
        "version": "last_modified"
    },
    "cluster": [
        "id",
        "last_modified"
    ],
    "schema": {
        "id": "integer",
        "submit_time": "time",
        "last_modified": "time"
    },
    "sharded": true,
    "account_info": {
        "private_key_id": {
            "$ref": "env://BIGQUERY_PRIVATE_KEY_ID"
        },
        "private_key": {
            "$ref": "env://BIGQUERY_PRIVATE_KEY"
        },
        "type": "service_account",
        "project_id": "my-project-id",
        "client_email": "me@my_project.iam.gserviceaccount.com",
        "client_id": "12345",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/my-project.iam.gserviceaccount.com"
    }
}

Usage

Setup Dataset with an application name

    dataset = bigquery.Dataset(
        dataset=application_name, 
        kwargs=settings
    )

Create a table

    destination = dataset.get_or_create_table(settings.destination)

Insert documents as you please

    destination.extend(documents)

Request a merge when done

    destination.merge()

Running tests

Fork and clone this repo.

git clone https://github.com/klahnakoski/jx-bigquery.git 
cd jx-bigquery
pip install -r requirements.txt

You will require a Google API key to run tests. The website will allow you to generate one and download a JSON file with the key. Update the tests/config.json to point to that file:

# contents of tests/config.json
{
  "destination": {
    "account_info": {
      "$ref": "file:///e:/moz-fx-dev-ekyle-treeherder-a838a7718652.json"
    }
  },
  "constants": {},
  "debug": {
    "trace": true
  }
}

Then you can run the tests

python -m unittest discover tests

NOTE - the tests will create a testing dataset and generate/drop tables

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jx-bigquery-3.80.20196.tar.gz (33.1 kB view details)

Uploaded Source

File details

Details for the file jx-bigquery-3.80.20196.tar.gz.

File metadata

  • Download URL: jx-bigquery-3.80.20196.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.8.0 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for jx-bigquery-3.80.20196.tar.gz
Algorithm Hash digest
SHA256 fd4fc4ce5acf90b7e16b76baf06f4724c284ad1320eeb9e283cc42e5221bc593
MD5 ea683b99b72b406605516a0564eda1ec
BLAKE2b-256 6e72f09fc0cceffd46b2ed624aa450486acbaab1336e7ef768ad159066350285

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page