Skip to main content

A lightweight data structure for unstructured data

Project description

docarray

The data structure for unstructured data.

🌌 All data types: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data.

🧑‍🔬 Data science powerhouse: easy-to-use functions for facilitating data scientists work on embedding, matching, visualizing, evaluating unstructured data via Torch/Tensorflow/ONNX/PaddlePaddle.

🚡 Portable: ready to wire with efficient conversion from/to Protobuf, binary, JSON, CSV, dataframe.

Install

Requires Python 3.7+ and numpy:

pip install docarray

To install full dependencies, please use pip install docarray[full].

Documentation

Get Started

Let's use DocArray and ResNet50 to build a meme image search on Totally Looks Like. This dataset contains 6016 image-pairs stored in /left and /right. Images that shares the same filename are labeled as perceptually similar. For example,

/left /right /left /right

|

Our problem is given an image from /left and find its most-similar image in /right (without looking at the filename of course).

Load images

First load images and preprocess them with standard computer vision techniques:

from docarray import DocumentArray, Document

left_da = DocumentArray.from_files('left/*.jpg')

To get a feeling of the data you will handle, plot them in one sprite image:

left_da.plot_image_sprites()

Load totally looks like dataset with docarray API

Apply preprocessing

Let's do some standard computer vision preprocessing:

def preproc(d: Document):
    return (d.load_uri_to_image_blob()  # load
             .set_image_blob_normalization()  # normalize color 
             .set_image_blob_channel_axis(-1, 0))  # switch color axis

left_da.apply(preproc)

Did I mention apply work in parallel?

Embed images

Now convert images into embeddings using a pretrained ResNet50:

import torchvision
model = torchvision.models.resnet50(pretrained=True)  # load ResNet50
left_da.embed(model, device='cuda')  # embed via GPU to speedup

Visualize embeddings

You can visualize the embeddings via tSNE in an interactive embedding projector:

left_da.plot_embeddings()

Visualizing embedding via tSNE and embedding projector

Fun is fun, but recall our goal is to match left images against right images and so far we have only handled the left. Let's repeat the same procedure for the right:

right_da = (DocumentArray.from_files('right/*.jpg')
                         .apply(preproc)
                         .embed(model, device='cuda'))

Match nearest neighbours

We can now match the left to the right and take the top-9 results.

left_da.match(right_da, limit=9)

Let's inspect what's inside left_da now:

for d in left_da:
    for m in d.matches:
        print(d.uri, m.uri, m.scores['cosine'].value)
left/02262.jpg right/03459.jpg 0.21102
left/02262.jpg right/02964.jpg 0.13871843
left/02262.jpg right/02103.jpg 0.18265384
left/02262.jpg right/04520.jpg 0.16477376
...

Better see it.

(DocumentArray(left_da[12].matches, copy=True)
    .apply(lambda d: d.set_image_blob_channel_axis(0, -1)
                      .set_image_blob_inv_normalization())
    .plot_image_sprites('result.png'))

Visualizing top-9 matches using DocArray API

Quantitative evaluation

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docarray-0.1.0.dev5.tar.gz (83.7 kB view details)

Uploaded Source

File details

Details for the file docarray-0.1.0.dev5.tar.gz.

File metadata

  • Download URL: docarray-0.1.0.dev5.tar.gz
  • Upload date:
  • Size: 83.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for docarray-0.1.0.dev5.tar.gz
Algorithm Hash digest
SHA256 288f6a914257b3797bfdac89f7379dded99b9c6bd32477031d8cafef5dbccef1
MD5 9809ef92682767735ae3e490c007f30c
BLAKE2b-256 8e3584eb07c84c6d58271b123bed24ed1a4cb090a172a06a9fc759e4110b5506

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page