Skip to main content

Infer SQL DDL statements from tabular data

Project description


ci-tests Test suite code coverage license pypi-downloads python-versions status pypi-version

About

Skeem infers SQL DDL statements from tabular data.

Skeem is, amongst others, based on the excellent ddlgenerator, frictionless, fsspec, pandas, ScipPy, SQLAlchemy and xarray packages, and can be used both as a standalone program, and as a library.

Supported input data:

Supported input sources:

Please note that Skeem is beta-quality software, and a work in progress. Contributions of all kinds are very welcome, in order to make it more solid. Breaking changes should be expected until a 1.0 release, so version pinning is recommended, especially when you use it as a library.

Synopsis

skeem infer-ddl --dialect=postgresql data.ndjson
CREATE TABLE "data" (
    "id" SERIAL NOT NULL,
    "name" TEXT NOT NULL,
    "date" TIMESTAMP WITHOUT TIME ZONE,
    "fruits" TEXT NOT NULL,
    "price" DECIMAL(2, 2) NOT NULL,
    PRIMARY KEY ("id")
);

Quickstart

If you are in a hurry, and want to run Skeem without any installation, just use the OCI image on Podman or Docker.

docker run --rm ghcr.io/daq-tools/skeem-standard \
    skeem infer-ddl --dialect=postgresql \
    https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.ndjson

Setup

Install Skeem from PyPI.

pip install skeem

Install Skeem with support for additional data formats like NetCDF.

pip install 'skeem[scientific]'

Usage

This section outlines some example invocations of Skeem, both on the command line, and per library use. Other than the resources available from the web, testing data can be acquired from the repository’s testdata folder.

Command line use

Help

skeem info
skeem --help
skeem infer-ddl --help

Read from files

# NDJSON, Parquet, and InfluxDB line protocol (ILP) formats.
skeem infer-ddl --dialect=postgresql data.ndjson
skeem infer-ddl --dialect=postgresql data.parquet
skeem infer-ddl --dialect=postgresql data.lp

# CSV, JSON, ODS, and XLSX formats.
skeem infer-ddl --dialect=postgresql data.csv
skeem infer-ddl --dialect=postgresql data.json
skeem infer-ddl --dialect=postgresql data.ods
skeem infer-ddl --dialect=postgresql data.xlsx
skeem infer-ddl --dialect=postgresql data.xlsx --address="Sheet2"

Read from URLs

# CSV, NDJSON, XLSX
skeem infer-ddl --dialect=postgresql https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.csv
skeem infer-ddl --dialect=postgresql https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.ndjson
skeem infer-ddl --dialect=postgresql https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.xlsx --address="Sheet2"

# Google Sheets: Address first sheet, and specific sheet of workbook.
skeem infer-ddl --dialect=postgresql --table-name=foo https://docs.google.com/spreadsheets/d/1ExyrawjlyksbC6DOM6nLolJDbU8qiRrrhxSuxf5ScB0/view
skeem infer-ddl --dialect=postgresql --table-name=foo https://docs.google.com/spreadsheets/d/1ExyrawjlyksbC6DOM6nLolJDbU8qiRrrhxSuxf5ScB0/view#gid=883324548

# InfluxDB line protocol (ILP)
skeem infer-ddl --dialect=postgresql https://github.com/influxdata/influxdb2-sample-data/raw/master/air-sensor-data/air-sensor-data.lp

# Compressed files in gzip format
skeem --verbose infer-ddl --dialect=crate --content-type=ndjson https://s3.amazonaws.com/crate.sampledata/nyc.yellowcab/yc.2019.07.gz

# CSV on S3
skeem --verbose infer-ddl --dialect=postgresql s3://noaa-ghcn-pds/csv/by_year/2022.csv

# CSV on Google Cloud Storage
skeem --verbose infer-ddl --dialect=postgresql gs://tinybird-assets/datasets/nations.csv
skeem --verbose infer-ddl --dialect=postgresql gs://tinybird-assets/datasets/medals1.csv

# CSV on GitHub
skeem --verbose infer-ddl --dialect=postgresql github://daq-tools:skeem@/tests/testdata/basic.csv

# GRIB2, NetCDF
skeem infer-ddl --dialect=postgresql https://github.com/earthobservations/testdata/raw/main/opendata.dwd.de/weather/nwp/icon/grib/18/t/icon-global_regular-lat-lon_air-temperature_level-90.grib2
skeem infer-ddl --dialect=postgresql https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc
skeem infer-ddl --dialect=postgresql https://www.unidata.ucar.edu/software/netcdf/examples/WMI_Lear.nc

OCI

OCI images are available on the GitHub Container Registry (GHCR). In order to run them on Podman or Docker, invoke:

docker run --rm ghcr.io/daq-tools/skeem-standard \
    skeem infer-ddl --dialect=postgresql \
    https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.csv

If you want to work with files on your filesystem, you will need to either mount the working directory into the container using the --volume option, or use the --interactive option to consume STDIN, like:

docker run --rm --volume=$(pwd):/data ghcr.io/daq-tools/skeem-standard \
    skeem infer-ddl --dialect=postgresql /data/basic.ndjson

docker run --rm --interactive ghcr.io/daq-tools/skeem-standard \
    skeem infer-ddl --dialect=postgresql --content-type=ndjson - < basic.ndjson

In order to always run the latest nightly development version, and to use a shortcut for that, this section outlines how to use an alias for skeem, and a variable for storing the input URL. It may be useful to save a few keystrokes on subsequent invocations.

docker pull ghcr.io/daq-tools/skeem-standard:nightly
alias skeem="docker run --rm --interactive ghcr.io/daq-tools/skeem-standard:nightly skeem"
URL=https://github.com/daq-tools/skeem/raw/main/tests/testdata/basic.ndjson

skeem infer-ddl --dialect=postgresql $URL

More

Use a different backend (default: ddlgen):

skeem infer-ddl --dialect=postgresql --backend=frictionless data.ndjson

Reading data from STDIN needs to obtain both the table name and content type separately:

skeem infer-ddl --dialect=crate --table-name=foo --content-type=ndjson - < data.ndjson

Reading data from STDIN also works like this, if you prefer to use pipes:

cat data.ndjson | skeem infer-ddl --dialect=crate --table-name=foo --content-type=ndjson -

Library use

import io
from skeem.core import SchemaGenerator
from skeem.model import Resource, SqlTarget

INDATA = io.StringIO(
    """
    {"id":1,"name":"foo","date":"2014-10-31 09:22:56","fruits":"apple,banana","price":0.42}
    {"id":2,"name":"bar","date":null,"fruits":"pear","price":0.84}
    """
)

sg = SchemaGenerator(
    resource=Resource(data=INDATA, content_type="ndjson"),
    target=SqlTarget(dialect="crate", table_name="testdrive"),
)

print(sg.to_sql_ddl().pretty)
CREATE TABLE "testdrive" (
    "id" INT NOT NULL,
    "name" STRING NOT NULL,
    "date" TIMESTAMP,
    "fruits" STRING NOT NULL,
    "price" DOUBLE NOT NULL,
    PRIMARY KEY ("id")
);

Development

For installing the project from source, please follow the development documentation.

Project information

Credits

Prior art

We are maintaining a list of other projects with the same or similar goals like Skeem.

Etymology

The program was about to be called Eskema, but it turned out that there is already another Eskema out there. So, it has been renamed to Skeem, which is Estonian, and means “schema”, “outline”, or “(to) plan”.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skeem-0.1.2.tar.gz (55.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skeem-0.1.2-py3-none-any.whl (42.4 kB view details)

Uploaded Python 3

File details

Details for the file skeem-0.1.2.tar.gz.

File metadata

  • Download URL: skeem-0.1.2.tar.gz
  • Upload date:
  • Size: 55.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for skeem-0.1.2.tar.gz
Algorithm Hash digest
SHA256 952adf869546bd7c2dd49a3e9f4a3a8f3dbf47ca3b5ea88b9702e3ff7b5f3ccf
MD5 b1ab53505c6103828ef303a6e7c67ba9
BLAKE2b-256 9bb7f7dd3403a094fdbecc7984d8661cb19808b28f3aa745efb8d48e4481cb05

See more details on using hashes here.

File details

Details for the file skeem-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: skeem-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 42.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for skeem-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 160d90e1f89130da04fc33ba3740e72adea5d5df6b971856045adca0dba0c2a0
MD5 8cebaaf66a46a9be63417b2b14014b34
BLAKE2b-256 4037036ad6092b44d8c93ab1ee7b8922eeaa3801b4f4b1988d965058b123d665

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page