Skip to main content

Build and run queries against data

Project description

DataFusion in Python

Python test

This is a Python library that binds to Apache Arrow in-memory query engine DataFusion.

Like pyspark, it allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV files, run it in a multi-threaded environment, and obtain the result back in Python.

It also allows you to use UDFs and UDAFs for complex operations.

The major advantage of this library over other execution engines is that this library achieves zero-copy between Python and its execution engine: there is no cost in using UDFs, UDAFs, and collecting the results to Python apart from having to lock the GIL when running those operations.

Its query engine, DataFusion, is written in Rust, which makes strong assumptions about thread safety and lack of memory leaks.

Technically, zero-copy is achieved via the c data interface.

How to use it

Simple usage:

import datafusion
import pyarrow

# an alias
f = datafusion.functions

# create a context
ctx = datafusion.ExecutionContext()

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
    [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
    names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])

# create a new statement
df = df.select(
    f.col("a") + f.col("b"),
    f.col("a") - f.col("b"),
)

# execute and collect the first (and only) batch
result = df.collect()[0]

assert result.column(0) == pyarrow.array([5, 7, 9])
assert result.column(1) == pyarrow.array([-3, -3, -3])

UDFs

def is_null(array: pyarrow.Array) -> pyarrow.Array:
    return array.is_null()

udf = f.udf(is_null, [pyarrow.int64()], pyarrow.bool_())

df = df.select(udf(f.col("a")))

UDAF

import pyarrow
import pyarrow.compute


class Accumulator:
    """
    Interface of a user-defined accumulation.
    """
    def __init__(self):
        self._sum = pyarrow.scalar(0.0)

    def to_scalars(self) -> [pyarrow.Scalar]:
        return [self._sum]

    def update(self, values: pyarrow.Array) -> None:
        # not nice since pyarrow scalars can't be summed yet. This breaks on `None`
        self._sum = pyarrow.scalar(self._sum.as_py() + pyarrow.compute.sum(values).as_py())

    def merge(self, states: pyarrow.Array) -> None:
        # not nice since pyarrow scalars can't be summed yet. This breaks on `None`
        self._sum = pyarrow.scalar(self._sum.as_py() + pyarrow.compute.sum(states).as_py())

    def evaluate(self) -> pyarrow.Scalar:
        return self._sum


df = ...

udaf = f.udaf(Accumulator, pyarrow.float64(), pyarrow.float64(), [pyarrow.float64()])

df = df.aggregate(
    [],
    [udaf(f.col("a"))]
)

How to install (from pip)

pip install datafusion
# or
python -m pip install datafusion

How to develop

This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin.

Bootstrap:

# fetch this repo
git clone git@github.com:datafusion-contrib/datafusion-python.git
# prepare development environment (used to build wheel / install in development)
python3 -m venv venv
# activate the venv
source venv/bin/activate
# update pip itself if necessary
python -m pip install -U pip
# install dependencies (for Python 3.8+)
python -m pip install -r requirements-310.txt

Whenever rust code changes (your changes or via git pull):

# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest

How to update dependencies

To change test dependencies, change the requirements.in and run

# install pip-tools (this can be done only once), also consider running in venv
python -m pip install pip-tools
python -m piptools compile --generate-hashes -o requirements-310.txt

To update dependencies, run with -U

python -m piptools compile -U --generate-hashes -o requirements-310.txt

More details here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafusion-0.5.0.tar.gz (81.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datafusion-0.5.0-cp36-abi3-win_amd64.whl (5.9 MB view details)

Uploaded CPython 3.6+Windows x86-64

datafusion-0.5.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.6+manylinux: glibc 2.12+ x86-64

datafusion-0.5.0-cp36-abi3-macosx_11_0_arm64.whl (5.1 MB view details)

Uploaded CPython 3.6+macOS 11.0+ ARM64

datafusion-0.5.0-cp36-abi3-macosx_10_7_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.6+macOS 10.7+ x86-64

File details

Details for the file datafusion-0.5.0.tar.gz.

File metadata

  • Download URL: datafusion-0.5.0.tar.gz
  • Upload date:
  • Size: 81.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for datafusion-0.5.0.tar.gz
Algorithm Hash digest
SHA256 8f05d90df70edb811eb86c8b4d4f45e7d43b7f5c41d782865c1e153c49264c55
MD5 f91b111d2bfa20cff1377da4064fc9ee
BLAKE2b-256 7b1e1b78c2510c31dbd115805e9b028f1f396c9398ff24cadc695a3eaf8003bc

See more details on using hashes here.

File details

Details for the file datafusion-0.5.0-cp36-abi3-win_amd64.whl.

File metadata

  • Download URL: datafusion-0.5.0-cp36-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.9 MB
  • Tags: CPython 3.6+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for datafusion-0.5.0-cp36-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 309a80fe1c25e82a2ff92bddec807a8df952409c67cc080d3715e7d811b042c0
MD5 1acd4595bb443f03ce0cc58adb06f85a
BLAKE2b-256 2bf48686226f1cad17fab727c5416e94e82ad5edb40182e13d5f56491f90db9d

See more details on using hashes here.

File details

Details for the file datafusion-0.5.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

  • Download URL: datafusion-0.5.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
  • Upload date:
  • Size: 7.0 MB
  • Tags: CPython 3.6+, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for datafusion-0.5.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 750f6d67a2b7e9f75d90d40fcb831af8c7587f1f42c1006fe7438598e206f339
MD5 ebc0c5f514c2f42398aa77d4f36c2272
BLAKE2b-256 5133dcfb99a497364170712a081ba068113eec2b913f01fc239382e92bad93d0

See more details on using hashes here.

File details

Details for the file datafusion-0.5.0-cp36-abi3-macosx_11_0_arm64.whl.

File metadata

  • Download URL: datafusion-0.5.0-cp36-abi3-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 5.1 MB
  • Tags: CPython 3.6+, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for datafusion-0.5.0-cp36-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1061018546909bea6c718b3f851974f35aacd9e76f3ebd22753613ff25cd4175
MD5 35125686f5e6073589b81cf4e15b8bc5
BLAKE2b-256 fa269df074a6a15f00c508f63c272e5806bd9c5506c10951708372ebe50191d2

See more details on using hashes here.

File details

Details for the file datafusion-0.5.0-cp36-abi3-macosx_10_7_x86_64.whl.

File metadata

  • Download URL: datafusion-0.5.0-cp36-abi3-macosx_10_7_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.6+, macOS 10.7+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for datafusion-0.5.0-cp36-abi3-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 225e09264d4f1897ca03394e8508ec45c51306d1d85417c33de6fc36042e9eaf
MD5 b5bc2a5cb0e0a2ffc1322f55e3a31790
BLAKE2b-256 9ca0d146cb5abf664f8bff94882591e32e2f097a8e5bb019146c1b3c939757a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page