vinum

Vinum is a SQL processor written in pure Python, designed for data analysis workflows and in-memory analytics.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Project description

Vinum is a SQL processor written in pure Python, designed for data analysis workflows and in-memory analytics. Conceptually, Vinum’s design goal is to provide deeper integration of Python data analysis tools such as Numpy, Pandas or in general any Python code with the SQL language. Key features include native support of vectorized Numpy and Python functions as UDFs in SQL queries.

Key Features:

Natively supports vectorized Numpy and Python functions inside of SELECT, WHERE, GROUP BY, HAVING and ORDER BY clauses. All the numpy functions are available by default via the ‘np.*’ namespace.
Written in pure Python and built from ground up on top of Apache Arrow and Numpy.
Apache Arrow provides the foundation for “moving” data and enables minimal overhead for transferring data to and from Numpy and Pandas.
Designed for in-memory analytics workflows, based on columnar memory layout.

Design

Vinum’s query planner compiles SQL SELECT statement into a DAG of vectorized Arrow and Numpy operators and therefore integration with Numpy, Arrow or native Python functions comes naturally. In Vinum, all Numpy functions are first class citizens and can be used inside of SELECT, WHERE, GROUP BY, HAVING and ORDER BY clauses.

Below is an example of a possible simplified query plan.

https://github.com/dmitrykoval/vinum/raw/main/doc/source/_static/dag_ex.png

Install

pip install vinum

Examples

Query python dict

Create a Table from a python dict and return result of the query as a Pandas DataFrame.

>>> import vinum as vn
>>> data = {'value': [300.1, 2.8, 880], 'mode': ['air', 'bus', 'air']}
>>> tbl = vn.Table.from_pydict(data)
>>> tbl.sql_pd("SELECT value, np.log(value) FROM t WHERE mode='air'")
   value    np.log
0  300.1  5.704116
1  880.0  6.779922

Query pandas dataframe

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql_pd('SELECT * FROM t WHERE col2 > 10 ORDER BY col1 DESC')
   col1  col2
0     3    17
1     2    13

Query csv

>>> import vinum as vn
>>> tbl = vn.read_csv('test.csv')
>>> res_tbl = tbl.sql('SELECT * FROM t WHERE fare > 5 LIMIT 3')
>>> res_tbl.to_pandas()
   id                            ts        lat        lng  fare
0   1   2010-01-05 16:52:16.0000002  40.711303 -74.016048  16.9
1   2  2011-08-18 00:35:00.00000049  40.761270 -73.982738   5.7
2   3   2012-04-21 04:30:42.0000001  40.733143 -73.987130   7.7

Compute Euclidean distance with numpy functions

Use any numpy functions via the ‘np.*’ namespace.

>>> import vinum as vn
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT *, np.sqrt(np.square(x) + np.square(y)) dist '
...            'FROM t ORDER BY dist DESC')
   x   y       dist
0  3  17  17.262677
1  2  13  13.152946
2  1   7   7.071068

Compute Euclidean distance with vectorized UDF

>>> import vinum as vn
>>> vn.register_numpy('distance',
...                   lambda x, y: np.sqrt(np.square(x) + np.square(y)))
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT *, distance(x, y) AS dist '
...            'FROM t ORDER BY dist DESC')
   x   y       dist
0  3  17  17.262677
1  2  13  13.152946
2  1   7   7.071068

Compute Euclidean distance with python UDF

>>> import math
>>> import vinum as vn
>>> vn.register_python('distance', lambda x, y: math.sqrt(x**2 + y**2))
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT x, y, distance(x, y) AS dist FROM t')
   x   y       dist
0  1   7   7.071068
1  2  13  13.152946
2  3  17  17.262677

Group by z-score

>>> import numpy as np
>>> import vinum as vn
>>> def z_score(x: np.ndarray):
...     """Compute Standard Score"""
...     mean = np.mean(x)
...     std = np.std(x)
...     return (x - mean) / std
...
>>> vn.register_numpy('score', z_score)
>>> tbl = vn.read_csv('test.csv')
>>> tbl.sql_pd('select int(score(fare)) AS bucket, avg(fare), count(*) '
...            'FROM t GROUP BY bucket ORDER BY bucket')
   bucket        avg  count
0       0   8.111630     92
1       1  19.380000      3
2       2  27.433333      3
3       3  34.670000      1
4       6  58.000000      1

Documentation

What Vinum is not

Vinum is not a Database Management System, there are no plans to support INSERT or UPDATE statements, as well as MVCC.

Dependencies

Future plans

Performance improvements.
Support sub-queries and JOINs.
Parallel execution.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

0.3.2

Mar 19, 2021

0.2.0

Mar 8, 2021

This version

0.1.4

Nov 8, 2020

0.1.3

Nov 7, 2020

0.1.2

Nov 7, 2020

0.1.1

Nov 7, 2020

0.0.1

Oct 31, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vinum-0.1.4.tar.gz (69.0 kB view hashes)

Uploaded Nov 8, 2020 Source

Built Distribution

vinum-0.1.4-py3-none-any.whl (59.2 kB view hashes)

Uploaded Nov 8, 2020 Python 3

Hashes for vinum-0.1.4.tar.gz

Hashes for vinum-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`5a5d70d1578163ae89897cf149302225e0f29412d50197ae04f1d1580c3497b3`
MD5	`ef6a08ae276566e265514c4aa05aa032`
BLAKE2b-256	`dc24ef1d5eba699e89ee105d1427c067d1c15e751d2f8243503ce3e275595131`

Hashes for vinum-0.1.4-py3-none-any.whl

Hashes for vinum-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`653e2d543e3ebfefed51f912578455813e3cf02fbc0a034a675d7a78fbeb089b`
MD5	`1f95494d5d0b22825d03b5a05ce0350c`
BLAKE2b-256	`d7ff834080d48c5e7e3349d6e1b59c8ad0d2b77d10583b23f0e3bd1d8ee2c083`