Skip to main content

The data structure for unstructured data

Project description

DocArray logo: The data structure for unstructured data

The data structure for unstructured data

DocArray is a library for nested, unstructured data such as text, image, audio, video, 3D mesh. It allows deep learning engineers to easily preprocess, embed, search, recommend and transfer the data.

🌌 All data types: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data.

🐍 Pythonic API: easy-to-use idioms and interfaces just as the native Python List. If you know how to Python, you know how to DocArray.

🧑‍🔬 Data science powerhouse: greatly facilitate data scientists work on embedding, matching, visualizing, evaluating via Torch/Tensorflow/ONNX/PaddlePaddle.

🚡 Portable: ready to wire at anytime with efficient and compact serialization from/to Protobuf, binary, JSON, CSV, dataframe.

Install

Requires Python 3.7+ and numpy only:

pip install docarray

Additional features can be enabled by installing the full dependencies: pip install docarray[full].

Documentation

Get Started

Let's use DocArray and Totally Looks Like dataset to build simple meme image search. The dataset contains 6016 image-pairs stored in /left and /right. Images that shares the same filename are perceptually similar. For example,

left/00018.jpg right/00018.jpg left/00131.jpg right/00131.jpg
Visualizing top-9 matches using DocArray API Visualizing top-9 matches using DocArray API Visualizing top-9 matches using DocArray API Visualizing top-9 matches using DocArray API

Our problem is given an image from /left and find its most-similar image in /right (without looking at the filename of course).

Load images

First load images and preprocess them with standard computer vision techniques:

from docarray import DocumentArray, Document

left_da = DocumentArray.from_files('left/*.jpg')

To get a feeling of the data you will handle, plot them in one sprite image:

left_da.plot_image_sprites()

Load totally looks like dataset with docarray API

Apply preprocessing

Let's do some standard computer vision preprocessing:

def preproc(d: Document):
    return (d.load_uri_to_image_blob()  # load
             .set_image_blob_normalization()  # normalize color 
             .set_image_blob_channel_axis(-1, 0))  # switch color axis for the pytorch model later

left_da.apply(preproc)

Did I mention apply work in parallel?

Embed images

Now convert images into embeddings using a pretrained ResNet50:

import torchvision
model = torchvision.models.resnet50(pretrained=True)  # load ResNet50
left_da.embed(model, device='cuda')  # embed via GPU to speedup

This step takes ~30 seconds on GPU. Beside PyTorch, you can also use Tensorflow, PaddlePaddle, ONNX models in .embed(...).

Visualize embeddings

You can visualize the embeddings via tSNE in an interactive embedding projector:

left_da.plot_embeddings()

Visualizing embedding via tSNE and embedding projector

Fun is fun, but recall our goal is to match left images against right images and so far we have only handled the left. Let's repeat the same procedure for the right:

right_da = (DocumentArray.from_files('right/*.jpg')
                         .apply(preproc)
                         .embed(model, device='cuda'))

Match nearest neighbours

We can now match the left to the right and take the top-9 results.

left_da.match(right_da, limit=9)

Let's inspect what's inside left_da now:

for d in left_da:
    for m in d.matches:
        print(d.uri, m.uri, m.scores['cosine'].value)
left/02262.jpg right/03459.jpg 0.21102
left/02262.jpg right/02964.jpg 0.13871843
left/02262.jpg right/02103.jpg 0.18265384
left/02262.jpg right/04520.jpg 0.16477376
...

Better see it.

(DocumentArray(left_da[8].matches, copy=True)
    .apply(lambda d: d.set_image_blob_channel_axis(0, -1)
                      .set_image_blob_inv_normalization())
    .plot_image_sprites('result.png'))

Visualizing top-9 matches using DocArray API Visualizing top-9 matches using DocArray API

What we did here is reverting the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that one can visualize them using image sprites.

Quantitative evaluation

Serious as you are, visual inspection is surely not enough. Let's calculate the recall@K. First we construct the groundtruth matches:

groundtruth = DocumentArray(
    Document(uri=d.uri, matches=[Document(uri=d.uri.replace('left', 'right'))]) for d in left_da)

Here we create a new DocumentArray with real matches by simply replacing the filename, e.g. left/00001.jpg to right/00001.jpg. That's all we need: if the predicted match has the identical uri as the groundtruth match, then it is correct.

Now let's check recall rate from 1 to 5 over the full dataset:

for k in range(1, 6):
    print(f'recall@{k}',
          left_da.evaluate(
            groundtruth,
            hash_fn=lambda d: d.uri,
            metric='recall_at_k',
            k=k,
            max_rel=1))
recall@1 0.02726063829787234
recall@2 0.03873005319148936
recall@3 0.04670877659574468
recall@4 0.052194148936170214
recall@5 0.0573470744680851

More metrics can be used such as precision_at_k, ndcg_at_k, hit_at_k.

Save results

You can save a DocumentArray to binary, JSON, dict, dataframe, CSV or Protobuf message. In its simplest form,

left_da.save('left_da.bin')

To reuse it, do left_da = DocumentArray.load('left_da.bin').

If you want to transfer a DoucmentArray from one machine to another or share it with your colleagues, you can do:

left_da.push(token='my_shared_da')
left_da = DocumentArray.pull(token='my_shared_da')

Anyone knows the token my_shared_da can pull and work on it.

Intrigued? That's only scratching the surface of what DocArray is capable of. Read our docs to learn more.

Support

Join Us

DocArray is backed by Jina AI and licensed under Apache-2.0. We are actively hiring AI engineers, solution engineers to build the next neural search ecosystem in opensource.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docarray-0.1.0.dev24.tar.gz (88.0 kB view details)

Uploaded Source

File details

Details for the file docarray-0.1.0.dev24.tar.gz.

File metadata

  • Download URL: docarray-0.1.0.dev24.tar.gz
  • Upload date:
  • Size: 88.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for docarray-0.1.0.dev24.tar.gz
Algorithm Hash digest
SHA256 c332b95dc8de03522a1e30019123a5e0ebf654cade29e26d28dff16847f2005c
MD5 7d5dec7e283ce66552dec85a7f30242a
BLAKE2b-256 84fe244c34a8eea38b5523112d2cb716adf05cbdfbd3b6b1264cc27a78039063

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page