Skip to main content

S3 select utility package

Project description

S3 select

alt text Example query run on 10GB of GZIP compressed JSON data (>60GB uncompressed)

Motivation

Amazon S3 select is one of the coolest features AWS released in 2018. It's benefits are:

  1. Very fast and low on network utilization as it allows you to return only subset of file contents from S3 using limited SQL select query. Since filtering of the data takes place on AWS machine where S3 file resides, network data transfer can be significantly limited depending on query issued.
  2. Is lightweight on client side because all filtering is done on machine where S3 data is located
  3. It's cheap at $0.002 per GB scanned and $0.0007 per GB returned
    For more details about S3 select see this presentation.

    Unfortunately S3 select API query call is limited to only one file on S3 and syntax is quite cumbersome, making it very impractical for daily usage. These are and more flaws are intended to be fixed with this s3select command.

Features at a glance

Most important features:

  1. Queries all files beneath given S3 prefix
  2. Whole process is multi threaded and fast. Scan of 1.1TB of data in stored in 20,000 files takes 5 minutes). Threads don't slow down client much as heavy lifting is done on AWS.
  3. Format of the file is automatically inferred for you picking GZIP or plain text depending on file extension
  4. Real time progress
  5. Exact cost of the query returned for each run
  6. Ability to only count records matching the filter in fast and efficient manner
  7. You can easily limit number of results returned while still keeping multi threaded execution
  8. Failed requests are properly handled and repeated if they are retriable (e.g. throttled calls)

Installation

s3select is built in Python and uses pip. Here is how to install and updated it:

$ pip install -U s3select

Authentication

s3select uses the same authentication and endpoint configuration as aws-cli. If aws command is working on your machine, there is no need for any additional configuration.

Example usage

License

Distributed under the MIT license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3select-0.0.5.tar.gz (5.3 kB view hashes)

Uploaded Source

Built Distribution

s3select-0.0.5-py3-none-any.whl (6.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page