Skip to main content

Python support for Parquet file format

Project description

parquet-python

parquet-python is a pure-python implementation (currently with only read-support) of the parquet format. It comes with a script for reading parquet files and outputting the data to stdout as JSON or TSV (without the overhead of JVM startup). Performance has not yet been optimized, but it’s useful for debugging and quick viewing of data in files.

Not all parts of the parquet-format have been implemented yet or tested e.g. nested data—see Todos below for a full list. With that said, parquet-python is capable of reading all the data files from the parquet-compatability project.

requirements

parquet-python has been tested on python 2.7, 3.4, and 3.5. It depends on thrift (0.9) and python-snappy (for snappy compressed files).

getting started

parquet-python is available via PyPi and can be installed using pip install parquet. The package includes the parquet command for reading python files, e.g. parquet test.parquet. See parquet –help for full usage.

Example

parquet-python currently has two programatic interfaces with similar functionality to Python’s csv reader. First, it supports a DictReader which returns a dictionary per row. Second, it has a reader which returns a list of values for each row. Both function require a file-like object and support an optional columns field to only read the specified columns.

import parquet
import json

## assuming parquet file with two rows and three columns:
## foo bar baz
## 1   2   3
## 4   5   6

with open("test.parquet") as fo:
   # prints:
   # {"foo": 1, "bar": 2}
   # {"foo": 4, "bar": 5}
   for row in parquet.DictReader(fo, columns=['foo', 'bar']):
       print(json.dumps(row))


with open("test.parquet") as fo:
   # prints:
   # 1,2
   # 4,5
   for row in parquet.reader(fo, columns=['foo', 'bar]):
       print(",".join([str(r) for r in row]))

Todos

  • Support the deprecated bitpacking

  • Fix handling of repetition-levels and definition-levels

  • Tests for nested schemas, null data

  • Support reading of data from HDFS via snakebite and/or webhdfs.

  • Implement writing

  • performance evaluation and optimization (i.e. how does it compare to the c++, java implementations)

Contributing

Is done via Pull Requests. Please include tests with your changes and follow pep8.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquet-1.1.tar.gz (18.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

parquet-1.1-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

parquet-1.1-py2-none-any.whl (18.3 kB view details)

Uploaded Python 2

File details

Details for the file parquet-1.1.tar.gz.

File metadata

  • Download URL: parquet-1.1.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for parquet-1.1.tar.gz
Algorithm Hash digest
SHA256 a962f4ad7581b1f1c689989c85d69eff6380d68418538546a296168e8acfbe8f
MD5 1882bb10b2cc4ac613a9bddda784986d
BLAKE2b-256 6f25ca111c6428ad610b617469c6ec0a90783b9b1fcd50e2335a2f1e7a6521c6

See more details on using hashes here.

File details

Details for the file parquet-1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for parquet-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6116d9722c2eedbd34375f5a3a93a6cb31c86ec7d67225918d0caa4593447565
MD5 36e9cd8dcf8fa7c12ee7b10bba3f7f9b
BLAKE2b-256 8de3a4976155440533ccde527f65d4783e77b7a1ad3a66c9de223cc181deebd9

See more details on using hashes here.

File details

Details for the file parquet-1.1-py2-none-any.whl.

File metadata

File hashes

Hashes for parquet-1.1-py2-none-any.whl
Algorithm Hash digest
SHA256 bcc318decb1d6f14d779838a6a8206840cd4febdaa923b9139b6b1bd9a71f1f6
MD5 06b5483e47506c742836ea85a630031f
BLAKE2b-256 863e6abf522cb2543104623b842e91e3f59bec58112d29d86a6f9b8d02825e43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page