meerkat-ml

Meerkat is building new data abstractions to make machine learning easier.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

GitHub Workflow Status GitHub

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Getting started

pip install meerkat-ml

Note: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install meerkat-ml[dev,text] instead. See setup.py for a full list of optional dependencies.

Load your dataset into a DataPanel and get going!

import meerkat as mk
dp = mk.DataPanel.from_csv("...")

What is Meerkat?

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Meerkat's core contribution is the DataPanel, a simple columnar data abstraction. The Meerkat DataPanel can house columns of arbitrary type – from integers and strings to complex, high-dimensional objects like videos, images, medical volumes and graphs.

DataPanel loads high-dimensional data lazily. A full high-dimensional dataset won't typically fit in memory. Behind the scenes, DataPanel handles this by only materializing these objects when they are needed.

import meerkat as mk

# Images are NOT read from disk at DataPanel creation...
dp = mk.DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'image': mk.ImageColumn.from_filepaths(['fox.png', 'jump.png', 'dog.png']),
    'label': [0, 1, 0]
}) 

# ...only at this point is "fox.png" read from disk
dp["image"][0]

DataPanel supports advanced indexing. Using indexing patterns similar to those of Pandas and NumPy, we can access a subset of a DataPanel's rows and columns.

import meerkat as mk
dp = ... # create DataPanel

# Pull a column out of the DataPanel
new_col: mk.ImageColumn = dp["image"]

# Create a new DataPanel from a subset of the columns in an existing one
new_dp: mk.DataPanel = dp[["image", "label"]] 

# Create a new DataPanel from a subset of the rows in an existing one
new_dp: mk.DataPanel = dp[10:20] 
new_dp: mk.DataPanel = dp[np.array([0,2,4,8])]

# Pull a column out of the DataPanel and get a subset of its rows 
new_col: mk.ImageColumn = dp["image"][10:20]

DataPanel supports map, update and filter operations. When training and evaluating our models, we often perform operations on each example in our dataset (e.g. compute a model's prediction on each example, tokenize each sentence, compute a model's embedding for each example) and store them . The DataPanel makes it easy to perform these operations and produce new columns (via DataPanel.map), store the columns alongside the original data (via DataPanel.update), and extract an important subset of the datset (via DataPanel.filter). Under the hood, dataloading is multiprocessed so that costly I/O doesn't bottleneck our computation. Consider the example below where we use update a DataPanel with two new columns holding model predictions and probabilities.

    # A simple evaluation loop using Meerkat 
    dp: DataPane = ... # get DataPane
    model: nn.Module = ... # get the model
    model.to(0).eval() # prepare the model for evaluation

    @torch.no_grad()
    def predict(batch: dict):
        probs = torch.softmax(model(batch["input"].to(0)), dim=-1)
        return {"probs": probs.cpu(), "pred": probs.cpu().argmax(dim=-1)}

    # updated_dp has two new `TensorColumn`s: 1 for probabilities and one
    # for predictions
    updated_dp: mk.DataPanel = dp.update(function=predict, batch_size=128, is_batched_fn=True)

DataPanel is extendable. Meerkat makes it easy for you to make custom column types for our data. The easiest way to do this is by subclassing AbstractCell. Subclasses of AbstractCell are meant to represent one element in one column of a DataPanel. For example, say we want our DataPanel to include a column of videos we have stored on disk. We want these videos to be lazily loaded using scikit-video, so we implement a VideoCell class as follows:

    import meerkat as mk
    import skvideo.io

    class VideoCell(mk.AbstractCell):

        # What information will we eventually  need to materialize the cell? 
        def __init__(filepath: str):
            super().__init__()
            self.filepath = filepath

        # How do we actually materialize the cell?
        def get(self):
            return skvideo.io.vread(self.filepath)

        # What attributes should be written to disk on `VideoCell.write`?
        @classmethod
        def _state_keys(cls) -> Collection:
            return {"filepath"}

    # We don't need to define a `VideoColumn` class and can instead just
    # create a CellColumn fro a list of `VideoCell`
    vid_column = mk.CellColumn(map(VideoCell, ["vid1.mp4", "vid2.mp4", "vid3.mp4"]))

Supported Columns

Meerkat ships with a number of core column types and the list is growing.

Core Columns

Column	Supported	Description
`ListColumn`	Yes	Flexible and can hold any type of data.
`NumpyArrayColumn`	Yes	`np.ndarray` behavior for vectorized operations.
`TensorColumn`	Yes	`torch.tensor` behavior for vectorized operations on the GPU.
`ImageColumn`	Yes	Holds images stored on disk (e.g. as PNG or JPEG)
`CellColumn`	Yes	Like `ListColumn`, but optimized for `AbstractCell` objects.
`SpacyColumn`	Yes	Optimized to hold spaCy Doc objects.
`EmbeddingColumn`	Planned	Optimized for embeddings and operations on embeddings.
`PredictionColumn`	Planned	Optimized for model predictions.

Contributed Columns

Column	Supported	Description
`WILDSInputColumn`	Yes	Build `DataPanel`s for the WILDS benchmark.

About

Meerkat is being developed at Stanford's Hazy Research Lab. Please reach out to kgoel [at] cs [dot] stanford [dot] edu if you would like to use or contribute to Meerkat.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.11

May 31, 2023

0.4.10a4 pre-release

Apr 30, 2023

0.4.10a3 pre-release

Apr 30, 2023

0.4.10a2 pre-release

Apr 30, 2023

0.4.9

Apr 27, 2023

0.4.8

Apr 17, 2023

0.4.7

Apr 10, 2023

0.4.6

Mar 22, 2023

0.4.5

Mar 12, 2023

0.4.4

Mar 11, 2023

0.4.3

Mar 4, 2023

0.4.3rc1 pre-release

Mar 4, 2023

0.4.2

Mar 2, 2023

0.4.1

Mar 2, 2023

0.4.0

Mar 2, 2023

0.2.6a16 pre-release

Mar 1, 2023

0.2.6a15 pre-release

Feb 28, 2023

0.2.6a14 pre-release

Feb 27, 2023

0.2.6a13 pre-release

Feb 26, 2023

0.2.6a12 pre-release

Feb 26, 2023

0.2.6a11 pre-release

Feb 26, 2023

0.2.6a10 pre-release

Feb 25, 2023

0.2.6a9 pre-release

Feb 25, 2023

0.2.6a8 pre-release

Feb 25, 2023

0.2.6a7 pre-release

Feb 25, 2023

0.2.6a6 pre-release

Feb 25, 2023

0.2.6a5 pre-release

Feb 25, 2023

0.2.6a4 pre-release

Feb 24, 2023

0.2.6a3 pre-release

Feb 24, 2023

0.2.6a2 pre-release

Feb 24, 2023

0.2.6a1 pre-release

Feb 24, 2023

0.2.5

Jul 22, 2022

0.2.4

Feb 17, 2022

0.2.3

Nov 18, 2021

0.2.2

Nov 12, 2021

0.2.1

Oct 14, 2021

0.2.0

Aug 10, 2021

0.1.2

Jun 28, 2021

0.1.1

Jun 24, 2021

This version

0.1.0

Jun 24, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meerkat-ml-0.1.0.tar.gz (64.5 kB view hashes)

Uploaded Jun 24, 2021 Source

Built Distribution

meerkat_ml-0.1.0-py2.py3-none-any.whl (83.1 kB view hashes)

Uploaded Jun 24, 2021 Python 2 Python 3

Hashes for meerkat-ml-0.1.0.tar.gz

Hashes for meerkat-ml-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`556ae15a72ad6115d6ad152d8f6934abad50dd8b918f9701d252055634a946f4`
MD5	`df82974e60a9f17e584b75fe58b7f5c0`
BLAKE2b-256	`cc2b49f8b377ad023134e9105b81b9fef6d2fdffbc0386a8c6225befe64ef98d`

Hashes for meerkat_ml-0.1.0-py2.py3-none-any.whl

Hashes for meerkat_ml-0.1.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a19453a6c2c320189b97ab0faec5e766bf6bfe7e779b7ed2b032b4fef9523e3`
MD5	`722c7512bb5a8fcf619f7f1df2473847`
BLAKE2b-256	`6b1ea1410ff2cc22332d22a1fc4e7288f7d5ef985b302136c5269f5a1b3fc481`