Container library for working with tabular Arrow data

Project description

quivr

Quivr is a Python library which provides great containers for Arrow data.

Quivr's Tables are like DataFrames, but with strict schemas to enforce types and expectations. They are backed by the high-performance Arrow memory model, making them well-suited for streaming IO, RPCs, and serialization/deserialization to Parquet.

why?

Data engineering involves taking analysis code and algorithms which were prototyped, often on pandas DataFrames, and shoring them up for production use.

While DataFrames are great for ad-hoc exploration, visualization, and prototyping, they aren't as great for building sturdy applications:

Loose and dynamic typing makes it difficult to be sure that code is correct without lots of explicit checks of the dataframe's state.
Performance of Pandas operations can be unpredictable and have surprising characteristics, which makes it harder to provision resources.
DataFrames can use an extremely large amount of memory (typical numbers cited are between 2x and 10x the "raw" data's size), and often are forced to copy data in intermediate computations, which poses unnecessarily heavy requirements.
The mutability of DataFrames can make debugging difficult and lead to confusing state.

We don't want to throw everything out, here. Vectorized computations are often absolutely necessary for data work. But what if we could have those vectorized computations, but with:

Types enforced at runtime, with no dynamically column information.
Relatively uniform performance due to a no-copy orientation
Immutable data, allowing multiple views at very fast speed

This is what Quivr's Tables try to provide.

Installation

Check out this repo, and pip install it.

Usage

Your main entrypoint to Quivr is through defining classes which represent your tables. You write a subclass of quivr.Table, annotating it with Fields that describe the data you're working with, and quivr will handle the rest.

from quivr import Table, Float64Field
import pyarrow as pa


class Coordinates(TableBase):
	x = Float64Field()
	y = Float64Field()
	z = Float64Field()
	vx = Float64Field()
	vy = Float64Field()
	vz = Float64Field()

Then, you can construct tables from data:

coords = Coordinates.from_data(
    x=np.array([ 1.00760887, -2.06203093,  1.24360546, -1.00131722]),
    y=np.array([-2.7227298 ,  0.70239707,  2.23125432,  0.37269832]),
    z=np.array([-0.27148738, -0.31768623, -0.2180482 , -0.02528401]),
    vx=np.array([ 0.00920172, -0.00570486, -0.00877929, -0.00809866]),
    vy=np.array([ 0.00297888, -0.00914301,  0.00525891, -0.01119134]),
    vz=np.array([-0.00160217,  0.00677584,  0.00091095, -0.00140548])
)

# Sort the table by the z column. This returns a copy.
coords_z_sorted = coords.sort_by("z")

print(len(coords))
# prints 4

# Access any of the columns as a numpy array with zero copy:
xs = coords.x.to_numpy()

# Present the table as a pandas DataFrame, with zero copy if possible:
df = coords.to_dataframe()

Embedded definitions and nullable fields

You can embed one table's definition within another, and you can make fields nullable:

class AsteroidOrbit(TableBase):
	designation = StringField()
	mass = Float64Field(nullable=True)
	radius = Float64Field(nullable=True)
	coords = Coordinates.as_field()

# You can construct embedded fields from Arrow StructArrays, which you can get from
# other Quivr tables using the to_structarray() method with zero copy.
orbits = AsteroidOrbit.from_data(
    designation=np.array(["Ceres", "Pallas", "Vesta", "2023 DW"]),
    mass=np.array([9.393e20, 2.06e21, 2.59e20, None]),
    radius=np.array([4.6e6, 2.7e6, 2.6e6, None]),
    coords=coords.to_structarray(),
)

Computing

Using Numpy

When you reference columns, you'll get numpy arrays which you can use to do computations:

import numpy as np

print(np.quantile(orbits.mass + 10, 0.5)

Using pyarrow.compute

You can also use access columns of the data as Arrow Arrays to do computations using the Pyarrow compute kernels:

import pyarrow.compute as pc

median_mass = pc.quantile(pc.add(orbits.column(mass, as_numpy=False), 10), q=0.5)
# median_mass is a pyarrow.Scalar, which you can get the value of with .as_py()
print(median_mass.as_py())

There is a very extensive set of functions available in the pyarrow.compute package, which you can see here. These computations will, in general, use all cores available and do vectorized computations which are very fast.

Customizing behavior with methods

Because Quivr tables are just Python classes, you can customize the behavior of your tables by adding or overriding methods. For example, if you want to add a method to compute the total mass of the asteroids in the table, you can do so like this:

class AsteroidOrbit(TableBase):
	designation = StringField()
	mass = Float64Field(nullable=True)
	radius = Float64Field(nullable=True)
	coords = Coordinates.as_field()

    def total_mass(self):
        return pc.sum(self.mass)

You can also use this to add "meta-fields" which are combinations of other fields. For example:

class CoordinateCovariance(TableBase):
	matrix_values = ListField(pa.float64(), 36)

    @property
    def matrix(self):
        # This is a numpy array of shape (n, 6, 6)
        return self.matrix_values.to_numpy().reshape(-1, 6, 6)


class AsteroidOrbit(TableBase):
	designation = StringField()
	mass = Float64Field(nullable=True)
	radius = Float64Field(nullable=True)
	coords = Coordinates.as_field()
	covariance = CoordinateCovariance.as_field()

orbits = load_orbits() # Analogous to the example above

# Compute the determinant of the covariance matrix for each asteroid
determinants = np.linalg.det(orbits.covariance.matrix)

Filtering

You can also filter by expressions on the data. See Arrow documentation for more details. You can use this to construct a quivr Table using an appropriately-schemaed Arrow Table:

big_orbits = AsteroidOrbit(orbits.table.filter(orbits.table["mass"] > 1e21))

If you're plucking out rows that match a single value, you can use the "select" method on the Table:

# Get the orbit of Ceres
ceres_orbit = orbits.select("designation", "Ceres")

Indexes for Fast Lookups

If you're going to be doing a lot of lookups on a particular column, it can be useful to create an index for that column. You can do using the quivr.StringIndex class to build an index for string values:

# Build an index for the designation column
designation_index = quivr.StringIndex(orbits, "designation")

# Get the orbit of Ceres
ceres_orbit = designation_index.lookup("Ceres")

The lookup method on the StringIndex returns Quivr Tables, or None if there is no match. Keep in mind that the returned tables might have multiple rows if there are multiple matches.

TODO: Add numeric and time-based indexes.

Serialization

Feather

Feather is a fast, zero-copy serialization format for Arrow tables. It can be used for interprocess communication, or for working with data on disk via memory mapping.

orbits.to_feather("orbits.feather")

orbits_roundtripped = AsteroidOrbit.from_feather("orbits.feather")

# use memory mapping to work with a large file without copying it into memory
orbits_mmap = AsteroidOrbit.from_feather("orbits.feather", memory_map=True)

Parquet

You can serialize your tables to Parquet files, and read them back:

orbits.to_parquet("orbits.parquet")

orbits_roundtripped = AsteroidOrbit.from_parquet("orbits.parquet")

See the Arrow documentation for more details on the Parquet format used.

Project details

Release history Release notifications | RSS feed

0.8.1

Sep 11, 2025

0.8.0

Mar 24, 2025

0.7.5

Jan 13, 2025

0.7.4

Nov 12, 2024

0.7.4a3 pre-release

Sep 18, 2024

0.7.4a2 pre-release

Sep 18, 2024

0.7.4a1 pre-release

Sep 17, 2024

0.7.3a1 pre-release

May 20, 2024

0.7.2

Oct 18, 2023

0.7.2a1 pre-release

Oct 18, 2023

0.7.1

Oct 5, 2023

0.7.1a1 pre-release

Oct 5, 2023

0.7.0

Oct 3, 2023

0.7.0a3 pre-release

Oct 2, 2023

0.7.0a2 pre-release

Sep 27, 2023

0.7.0a1 pre-release

Sep 25, 2023

0.6.6

Sep 27, 2023

0.6.5

Aug 30, 2023

0.6.4

Aug 24, 2023

0.6.3

Aug 16, 2023

0.6.2

Aug 9, 2023

0.6.2rc3 pre-release

Aug 9, 2023

0.6.2rc2 pre-release

Aug 9, 2023

0.6.2rc1 pre-release

Aug 9, 2023

0.6.1

Aug 8, 2023

0.6.0

Aug 7, 2023

0.6.0rc2 pre-release

Aug 5, 2023

0.6.0rc1 pre-release

Aug 4, 2023

0.5.0

Jul 21, 2023

0.4.3

Jul 19, 2023

0.4.2

Jun 5, 2023

0.4.1

Jun 5, 2023

0.4.0

Jun 5, 2023

0.3.4

May 26, 2023

0.3.3

May 26, 2023

0.3.2

May 18, 2023

0.3.1

May 18, 2023

0.3.0

May 17, 2023

0.2.3

May 17, 2023

0.2.2

May 16, 2023

0.2.1

May 5, 2023

This version

0.2.0

May 5, 2023

0.1.1

May 2, 2023

0.1.0

May 1, 2023

0.0.1

Apr 12, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quivr-0.2.0.tar.gz (19.3 kB view details)

Uploaded May 5, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quivr-0.2.0-py3-none-any.whl (33.0 kB view details)

Uploaded May 5, 2023 Python 3

File details

Details for the file quivr-0.2.0.tar.gz.

File metadata

Download URL: quivr-0.2.0.tar.gz
Upload date: May 5, 2023
Size: 19.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.24.0

File hashes

Hashes for quivr-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`3427933db74b0ca8e58e504838728b7b5fcc7da072aed4d27fc4865fe8b5f5c1`
MD5	`49f8187a656f0556d992ac3f3d4c8bc8`
BLAKE2b-256	`402e02ecef8d2dcb6b93b255962964cff93202e821fdd2e546a0f60b2c64dec3`

See more details on using hashes here.

File details

Details for the file quivr-0.2.0-py3-none-any.whl.

File metadata

Download URL: quivr-0.2.0-py3-none-any.whl
Upload date: May 5, 2023
Size: 33.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.24.0

File hashes

Hashes for quivr-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b3c62f031e106568acbcc65d35f3db6183a59ce03933bac0e7bb17b164e8203`
MD5	`8451db6cdb87d3a38538dc0e382df223`
BLAKE2b-256	`734e13989d404afb33ed190b8673ec5e9048bc3a336a3026eb0aafcede136322`

See more details on using hashes here.

quivr 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

quivr

why?

Installation

Usage

Embedded definitions and nullable fields

Computing

Using Numpy

Using pyarrow.compute

Customizing behavior with methods

Filtering

Indexes for Fast Lookups

Serialization

Feather

Parquet

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes