Snark Hub

Project description

Introduction

Most of the time Data Scientists/ML researchers work on data management and preprocessing instead of doing modeling. Deep Learning often requires to work with large datasets. Those datasets can grow up to terabyte or even petabyte size. It is hard to manage data, version control and track. It is time consuming to download the data and link with the training or inference code. There is no easy way to access a chunk of it and possibly visualize. Wouldn’t it be more convenient to have large datasets stored & version-controlled as single numpy-like array on the cloud and have access to it from any machine at scale?

Hub Arrays: scalable numpy-like arrays stored on the cloud accessible over internet as if they're local numpy arrays.

Let's see how it works in action:

pip3 install hub

Create a large array remotely on cloud with some parts cached locally. You can read/write from anywhere as if it's a local array!

> import hub
> bigarray = hub.array((10000000000, 512, 512, 3), name="test/bigarray:v0")

Problems with Current Workflows

We realized that there are a few problems related with current workflow in deep learning data management through our experience of working with deep learning companies and researchers.

Data locality. When you have local GPU servers but store the data in a secure remote data center or on the cloud, you need to plan ahead to download specific datasets to your GPU box because it takes time. Sharing preprocessed dataset from one GPU box across your team is also slow and error-prone if there're multiple preprocessing pipelines.
Code dependency on local folder structure. People use a folder structure to store images or videos. As a result, the data input pipeline has to take into consideration the raw folder structure which creates unnecessary & error-prone code dependency of the dataset folder structure.
Managing preprocessing pipelines. If you want to run some preprocessing, it would be ideal to save the preprocessed images as a local cache for training.But it’s usually hard to manage & version control the preprocessed images locally when there are multiple preprocessing pipelies and the dataset is very big.
Visualization. It's difficult to visualize the raw data or preprocessed dataset on servers.
Reading a small slice of data. Another popular way is to store in HDF5/TFRecords format and upload to a cloud bucket, but still you have to manage many chunks of HDF5/TFRecords files. If you want to read a small slice of data, it's not clear which TFRecord/HDF5 chunk you need to load. It's also inefficient to load the whole file for a small slice of data.
Synchronization across team. If multiple users modify the data, there needs to be a data versioning and synchronization protocol implemented.
RAM management. Whenever you want to create a numpy array you are worried if the numpy array is going to fit in the local RAM/disk limit.

Workflow with Hub Arrays

Simply declare an array with the namespace inside the code and thats it. “Where and How the data is stored?” is totally abstracted away from the data scientist or machine learning engineer. You can create a numpy array up to Petabytes scale without worrying if the array will fit into RAM or local disk. The inner workings are like this:

The actual array is created on a cloud bucket (object storage) and partially cached on your local environment. The array size can easily scale to 1PB.
When you read/write to the array, the package automatically synchronize the change from local to cloud bucket via internet.

We’re working on simple authentication system, data management, advanced data caching & fetching, and version controls.

> import hub
> import numpy as np

# Create a large array that you can read/write from anywhere.
> bigarray = hub.array((100000, 512, 512, 3), name="test/bigarray:v0")

# Writing to one slice of the array. Automatically syncs to cloud.
> image = np.random.random((512,512,3))
> bigarray[0, :,:, :] = image

# Lazy-Load an existing array from cloud without really downloading the entries
> imagenet = hub.load(name='imagenet')
> imagenet.shape
(1034908, 469, 387, 3)

# Download the entries from cloud to local on demand.
> imagenet[0,:,:,:].mean()

Usage

Step 1. Install

pip3 install hub

Step 2. Lazy-load a public dataset, and fetch a single image with up to 50MB/s speed and plot

> import hub
> imagenet = hub.load(name='imagenet')
> imagenet.shape
(1034908, 469, 387, 3)

> import matplotlib.pyplot as plt
> plt.imshow(imagenet[0])

Step 3. Compute the mean and standard deviation of any chunk of the full dataset

> imagenet[0:10,100:200,100:200].mean()
0.132
> imagenet[0:10,100:200,100:200].std()
0.005

Step 4. Create your own array and access it from another machine

# Create on one machine
> import numpy as np
> mnist = hub.array((50000,28,28,1), name="name/random_name:v1")
> mnist[0,:,:,:] = np.random.random((1,28,28,1))

# Access it from another machine
> mnist = hub.load(name='name/random_name:v1')
> print(mnist[0])

Features

Data Management: Storing large datasets with version control
Collaboration: Multiple data scientists working on the same data in sync
Distribute: Accessing from multiple machines at the same time
Machine Learning: Native integration with Numpy, Dask, PyTorch or TensorFlow.
Scale: Create as big arrays as you want
Visualization: Visualize the data without trouble

Benchmarking

For full reproducibility please refer to the code

Download Parallelism

The following chart shows that hub on a single machine (aws p3.2xlarge) can achieve up to 875 MB/s download speed with multithreading and multiprocessing enabled. Choosing the chunk size plays a role in reaching maximum speed up. The bellow chart shows the tradeoff using different number of threads and processes.

Training Deep Learning Model

The following benchmark shows that streaming data through Hub package while training deep learning model is equivalent to reading data from local file system. The benchmarks have been produced on AWS using p3.2xlarge machine with V100 GPU. The data is stored on S3 within the same region. In the asynchronous data loading figure, first three models (VGG, Resnet101 and DenseNet) have no data bottleneck. Basically the processing time is greater than loading the data in the background. However for more lightweight models such as Resnet18 or SqueezeNet, training is bottlenecked on reading speed. Number of parallel workers for reading the data has been chosen to be the same. The batch size was chosen smaller for large models to fit in the GPU RAM.

Training Deep Learning	Data Streaming

Use Cases

Aerial images: Satellite and drone imagery
Medical Images: Volumetric images such as MRI or Xray
Self-Driving Cars: Radar, 3D LIDAR, Point Cloud, Semantic Segmentation, Video Objects
Retail: Self-checkout datasets
Media: Images, Video, Audio storage

Acknowledgement

Acknowledgment: This technology was inspired from our experience at Princeton University at SeungLab and would like to thank William Silversmith @SeungLab and his awesome project cloud-volume.

Project details

Release history Release notifications | RSS feed

3.0.1

Nov 14, 2022

3.0.0

Sep 28, 2022

2.8.7

Sep 28, 2022

2.8.6

Sep 28, 2022

2.8.5

Sep 20, 2022

2.8.4

Sep 15, 2022

2.8.3

Sep 14, 2022

2.8.2 yanked

Sep 13, 2022

Reason this release was yanked:

deeplake doesn't work

2.8.1

Sep 9, 2022

2.8.0 yanked

Sep 7, 2022

2.7.5

Aug 24, 2022

2.7.4

Aug 15, 2022

2.7.3

Aug 10, 2022

2.7.2

Jul 26, 2022

2.7.1

Jul 25, 2022

2.7.0

Jul 19, 2022

2.6.0

Jun 28, 2022

2.5.2

Jun 6, 2022

2.5.1

Jun 1, 2022

2.5.0

May 18, 2022

2.4.2

May 9, 2022

2.4.1

May 2, 2022

2.4.0

Apr 27, 2022

2.3.5

Apr 26, 2022

2.3.4

Apr 19, 2022

2.3.3

Apr 5, 2022

2.3.2

Mar 21, 2022

2.3.1

Mar 5, 2022

2.3.0

Feb 21, 2022

2.2.4

Feb 21, 2022

2.2.3

Feb 1, 2022

2.2.2

Jan 23, 2022

2.2.1

Jan 15, 2022

2.2.0

Dec 22, 2021

2.1.1

Nov 29, 2021

2.1.0

Nov 10, 2021

2.0.14

Oct 25, 2021

2.0.13

Oct 19, 2021

2.0.12

Oct 11, 2021

2.0.11

Sep 19, 2021

2.0.9

Aug 30, 2021

2.0.8

Aug 16, 2021

2.0.7

Aug 11, 2021

2.0.6

Aug 9, 2021

2.0.5.post1

Aug 9, 2021

2.0.5 yanked

Aug 9, 2021

Reason this release was yanked:

Older version

2.0.4

Aug 2, 2021

2.0.3

Jul 25, 2021

2.0.2

Jul 19, 2021

2.0.2a0 pre-release

Jul 19, 2021

2.0.1

Jul 7, 2021

2.0a8 pre-release

Jul 5, 2021

2.0a7 pre-release

Jun 23, 2021

2.0a6 pre-release

Jun 23, 2021

2.0a5 pre-release

Jun 23, 2021

2.0a4 pre-release

Jun 23, 2021

2.0a3 pre-release

Jun 17, 2021

2.0a2 pre-release

Jun 17, 2021

2.0a1 pre-release

Jun 15, 2021

2.0a0 pre-release

Jun 9, 2021

1.3.7

May 29, 2021

1.3.5

Apr 29, 2021

1.3.4

Apr 19, 2021

1.3.3

Apr 6, 2021

1.3.2

Mar 26, 2021

1.3.1a0 pre-release

Feb 18, 2021

1.3.0

Mar 8, 2021

1.3.0a0 pre-release

Feb 5, 2021

1.2.3

Mar 7, 2021

1.2.2

Feb 17, 2021

1.2.1

Feb 16, 2021

1.2.0

Jan 24, 2021

1.1.3

Jan 15, 2021

1.1.1

Jan 11, 2021

1.1

Jan 10, 2021

1.0.8

Dec 19, 2020

1.0.7

Dec 18, 2020

1.0.6

Dec 15, 2020

1.0.5

Dec 10, 2020

1.0.4

Dec 9, 2020

1.0.3

Dec 5, 2020

1.0.2

Dec 3, 2020

1.0.1

Dec 2, 2020

1.0.0

Dec 1, 2020

1.0.0rc2 pre-release

Dec 1, 2020

1.0.0rc1 pre-release

Dec 1, 2020

1.0.0rc0 pre-release

Dec 1, 2020

1.0.0b5 pre-release

Nov 21, 2020

1.0.0b4 pre-release

Nov 16, 2020

1.0.0b3 pre-release

Nov 16, 2020

1.0.0b2 pre-release

Nov 15, 2020

1.0.0b1 pre-release

Nov 14, 2020

1.0.0b0 pre-release

Nov 13, 2020

1.0.0a5 pre-release

Nov 10, 2020

1.0.0a4 pre-release

Nov 7, 2020

1.0.0a3 pre-release

Nov 7, 2020

1.0.0a2 pre-release

Nov 7, 2020

1.0.0a1 pre-release

Nov 7, 2020

1.0.0a0 pre-release

Nov 7, 2020

0.12.7

Sep 20, 2020

0.12.6

Sep 18, 2020

0.12.4

Sep 17, 2020

0.12.1

Sep 7, 2020

0.11.0.0

Aug 19, 2020

0.10.0.2

Aug 13, 2020

0.10.0.1

Aug 10, 2020

0.10.0.0

Aug 10, 2020

0.9.0.4

Aug 7, 2020

0.9.0.3

Jul 28, 2020

0.9.0.2

Jul 28, 2020

0.9.0.1

Jul 28, 2020

0.9.0.0

Jul 28, 2020

0.5.0.0

Jul 2, 2020

0.4.1.5

May 12, 2020

0.4.1.4

May 1, 2020

0.4.1.3

May 1, 2020

0.4.1.2

Mar 19, 2020

This version

0.4.1.0

Mar 9, 2020

0.4.0.1

Feb 21, 2020

0.4.0.0

Feb 21, 2020

0.2.0.4

Sep 6, 2019

0.2.0.3

Aug 24, 2019

0.2.0.1

Aug 23, 2019

0.2

Aug 23, 2019

0.1.6

Aug 16, 2019

0.1.5

Aug 15, 2019

0.1.4

Aug 15, 2019

0.1.3

Aug 15, 2019

0.1.2

Aug 15, 2019

0.1.1

Aug 15, 2019

0.1.0

Aug 15, 2019

0.0.1

Aug 11, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hub-0.4.1.0.tar.gz (29.3 kB view details)

Uploaded Mar 9, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hub-0.4.1.0-py3-none-any.whl (51.7 kB view details)

Uploaded Mar 9, 2020 Python 3

File details

Details for the file hub-0.4.1.0.tar.gz.

File metadata

Download URL: hub-0.4.1.0.tar.gz
Upload date: Mar 9, 2020
Size: 29.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.33.0 CPython/2.7.16

File hashes

Hashes for hub-0.4.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c0450387acfa185a081bea864a6f28d305d88b452bcbc9b845d430a2c274c951`
MD5	`c4fda5c4f3e1db3a696dc42ba7cbbf0f`
BLAKE2b-256	`8b9abd1e3872a2e401daa1707fcd77d80ad969acd956c1256d7827f5ab422899`

See more details on using hashes here.

File details

Details for the file hub-0.4.1.0-py3-none-any.whl.

File metadata

Download URL: hub-0.4.1.0-py3-none-any.whl
Upload date: Mar 9, 2020
Size: 51.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.33.0 CPython/2.7.16

File hashes

Hashes for hub-0.4.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0bd170e37ef53be11afedbca5ecf928f65a33a77717fba44c4983b6f5e2bcf74`
MD5	`3d75e3aa65a1b932a36827c6855208ec`
BLAKE2b-256	`3f2fea2954b3c5347850e33dc1d8dd1776326a0bf43a894a3ab143366a9172d1`

See more details on using hashes here.

hub 0.4.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Introduction

Problems with Current Workflows

Workflow with Hub Arrays

Usage

Features

Benchmarking

Download Parallelism

Training Deep Learning Model

Use Cases

Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes