Skip to main content

An in-memory immutable data manager

Project description

vineyard

vineyard: an in-memory immutable data manager

Vineyard CI Coverage Docs FAQ Artifact HUB License CII Best Practices FOSSA

Vineyard (v6d) is an in-memory immutable data manager that provides out-of-the-box high-level abstraction and zero-copy in-memory sharing for distributed data in big data tasks, such as graph analytics (e.g., GraphScope), numerical computing (e.g., Mars), and machine learning.

Vineyard is a CNCF sandbox project

Vineyard is a CNCF sandbox project and indeed made successful by its community.

What is vineyard

Vineyard is designed to enable zero-copy data sharing between big data systems. Let’s begin with a typical machine learning task of time series prediction with LSTM. We can see that the task is divided into steps of works:

  • First, we read the data from the file system as a pandas.DataFrame.

  • Then, we apply some preprocessing jobs, such as eliminating null values to the dataframe.

  • After that, we define the model, and train the model on the processed dataframe in PyTorch.

  • Finally, the performance of the model is evaluated.

On a single machine, although pandas and PyTorch are two different systems targeting different tasks, data can be shared between them efficiently with little extra-cost, with everything happening end-to-end in a single python script.

Comparing the workflow with and without vineyard

What if the input data is too big to be processed on a single machine? As illustrated on the left side of the figure, a common practice is to store the data as tables on a distributed file system (e.g., HDFS), and replace pandas with ETL processes using SQL over a big data system such as Hive and Spark. To share the data with PyTorch, the intermediate results are typically saved back as tables on HDFS. This can bring some headaches to developers.

  1. For the same task, users are forced to program for multiple systems (SQL & Python).

  2. Data could be polymorphic. Non-relational data, such as tensors, dataframes and graphs/networks (in GraphScope) are becoming increasingly prevalent. Tables and SQL may not be best way to store/exchange or process them. Having the data transformed from/to “tables” back and forth between different systems could be a huge overhead.

  3. Saving/loading the data to/from the external storage requires lots of memory-copies and IO costs.

Vineyard is designed to solve these issues by providing:

  1. In-memory distributed data sharing in a zero-copy fashion to avoid introducing extra I/O costs by exploiting a shared memory manager derived from plasma.

  2. Built-in out-of-the-box high-level abstraction to share the distributed data with complex structures (e.g., distributed graphs) with nearly zero extra development cost, while the transformation costs are eliminated.

As shown in the right side of the above figure, we illustrate how to integrate vineyard to solve the task in the big data context.

First, we use Mars (a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn) to preprocess the raw data just like the single machine solution do, and save the preprocessed dataframe into vineyard.

single

data_csv = pd.read_csv('./data.csv', usecols=[1])

distributed

import mars.dataframe as md
dataset = md.read_csv('hdfs://server/data_full', usecols=[1])
# after preprocessing, save the dataset to vineyard
vineyard_distributed_tensor_id = dataset.to_vineyard()

Then, we modify the training phase to get the preprocessed data from vineyard. Here vineyard makes the sharing of distributed data between Mars and PyTorch just like a local variable in the single machine solution.

single

data_X, data_Y = create_dataset(dataset)

distributed

client = vineyard.connect(vineyard_ipc_socket)
dataset = client.get(vineyard_distributed_tensor_id).local_partition()
data_X, data_Y = create_dataset(dataset)

Finally, we run the training phase distributedly across the cluster.

From the example, we see that with vineyard, the task in the big data context can be handled with only minor modifications to the single machine solution. Compare with the existing approaches, the I/O and transformation overheads are also eliminated.

Features

In-Memory immutable data sharing

Vineyard is an in-memory immutable data manager, sharing immutable data across different systems via shared memory without extra overheads. Vineyard eliminates the overhead of serialization/deserialization and IO during exchanging immutable data between systems.

Out-of-box high level data abstraction

Computation frameworks usually have their own data abstractions for high-level concepts, for example tensor could be torch.tensor, tf.Tensor, mxnet.ndarray etc., not to mention that every graph processing engine has its own graph structure representations.

The variety of data abstractions makes the sharing hard. Vineyard provides out-of-the-box high-level data abstractions over in-memory blobs, by describing objects using hierarchical metadatas. Various computation systems can utilize the built-in high level data abstractions to exchange data with other systems in computation pipeline in a concise manner.

Stream pipelining

A computation doesn’t need to wait all precedent’s result arrive before starting to work. Vineyard provides stream as a special kind of immutable data for such pipelining scenarios. The precedent job can write the immutable data chunk by chunk to vineyard, while maintaining the data structure semantic, and the successor job reads shared-memory chunks from vineyard’s stream without extra copy cost, then triggers it’s own work. The overlapping helps for reducing the overall processing time and memory consumption.

Drivers

Many big data analytical tasks have lots of boilerplate routines for tasks that unrelated to the computation itself, e.g., various IO adaptors, data partition strategies and migration jobs. As the data structure abstraction usually differs between systems such routines cannot be easily reused.

Vineyard provides such common manipulate routines on immutable data as drivers. Besides sharing the high level data abstractions, vineyard extends the capability of data structures by drivers, enabling out-of-the-box reusable routines for the boilerplate part in computation jobs.

Integrate with Kubernetes

Vineyard helps share immutable data between different workloads, is a natural fit to cloud-native computing. Vineyard could provide efficient distributed data sharing in cloud-native environment by embracing cloud-native big data processing and Kubernetes helps vineyard leverage the scale-in/out and scheduling ability of Kubernetes.

Deployment

For better leveraging the scale-in/out capability of Kubernetes for worker pods of a data analytical job, vineyard could be deployed on Kubernetes to as a DaemonSet in Kubernetes cluster. Vineyard pods shares memory with worker pods using a UNIX domain socket with fine-grained access control.

The UNIX domain socket can be either mounted on hostPath or via a PersistentVolumeClaim. When users bundle vineyard and the workload to the same pod, the UNIX domain socket could also be shared using an emptyDir.

Deployment with Helm

Vineyard also has tight integration with Kubernetes and Helm. Vineyard can be deployed with helm:

helm repo add vineyard https://vineyard.oss-ap-southeast-1.aliyuncs.com/charts/
helm install vineyard vineyard/vineyard

In the further vineyard will improve the integration with Kubernetes by abstract vineyard objects as as Kubernetes resources (i.e., CRDs), and leverage a vineyard operator to operate vineyard cluster.

Install vineyard

Vineyard is distributed as a python package and can be easily installed with pip:

pip3 install vineyard

The latest version of online documentation can be found at https://v6d.io.

If you want to build vineyard from source, please refer to Installation.

FAQ

Vineyard shares many similarities with other opensource projects, but still differs a lot with them. We are frequently asked with the following questions about vineyard,

  • Q: Can clients look at the data while the stream is being filled?

    One piece of data for multiple clients is one of the target scenarios as the data live in vineyard is immutable, and multiple clients can safely consume the same piece of data by memory sharing, without the extra cost and extra memory usage of copying data back and forth.

  • Q: How vineyard avoids serialization/deserialization between systems in different languages?

    Vineyard provides higher-level data abstractions (e.g., ndarrays, dataframes) that could be shared in a natural way between different processes.

  • … …

For more detailed information, please refer to our FAQ page.

Getting involved

Thank you in advance for your contributions to vineyard!

Acknowledgements

We thank the following excellent opensource projects:

  • apache-arrow, a cross-language development platform for in-memory analytics;

  • boost-leaf, a C++ lightweight error augmentation framework;

  • ctti, a C++ compile-time type information library;

  • dlmalloc, Doug Lea’s memory allocator;

  • etcd-cpp-apiv3, a C++ API for etcd’s v3 client API;

  • flat_hash_map, an efficient hashmap implementation;

  • jemalloc a general purpose malloc(3) implementation.

  • nlohmann/json, a json library for modern c++.

  • pybind11, a library for seamless operability between C++11 and Python;

  • s3fs, a library provide a convenient Python filesystem interface for S3.

  • tbb a C++ library for threading building blocks.

License

Vineyard is distributed under Apache License 2.0. Please note that third-party libraries may not have the same license as vineyard.

FOSSA Status

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vineyard-0.3.0-cp39-cp39-manylinux2010_x86_64.whl (8.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.12+ x86-64

vineyard-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl (8.8 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

vineyard-0.3.0-cp38-cp38-manylinux2010_x86_64.whl (8.2 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

vineyard-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl (8.8 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

vineyard-0.3.0-cp37-cp37m-manylinux2010_x86_64.whl (8.2 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

vineyard-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl (8.8 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

vineyard-0.3.0-cp36-cp36m-manylinux2010_x86_64.whl (8.2 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

vineyard-0.3.0-cp36-cp36m-macosx_10_9_x86_64.whl (8.8 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

File details

Details for the file vineyard-0.3.0-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: vineyard-0.3.0-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.2

File hashes

Hashes for vineyard-0.3.0-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 8e4085235fbcfccf002f76d84507a6453f6a4e6da358ef8da45138d9a27f5710
MD5 acd1dfdca9bfd9c0bd4be01b4499fa78
BLAKE2b-256 415baa4cff736ec92c85633c2f2225e231e12ef70310b9a0f5c10ad81730a40c

See more details on using hashes here.

File details

Details for the file vineyard-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vineyard-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 8.8 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.2

File hashes

Hashes for vineyard-0.3.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 901fe64790c2323f87bd3a09c8f073d3f302fc486474970f2c162b2d7d56681a
MD5 00f73685e8df41ac9cda23610a0b74ab
BLAKE2b-256 65d8f41311cb06030dd1c2413f8e74a8f4e79f3adee895f498d9cd621a850811

See more details on using hashes here.

File details

Details for the file vineyard-0.3.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: vineyard-0.3.0-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.2

File hashes

Hashes for vineyard-0.3.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f7c607145fd808bf0b966c19f7e4815b07385b11639edee5f6e76f291010065d
MD5 9e2c5fa918c9e2584af9b5f1ca785ac5
BLAKE2b-256 40ec50bd1f8a2b5af71f323b441b3a30398ff2b732096ff89aaec4080c7f5375

See more details on using hashes here.

File details

Details for the file vineyard-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vineyard-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 8.8 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.2

File hashes

Hashes for vineyard-0.3.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 071878e2cec156d3b9a03a8e46ee0a30f34dc285a7a38e8c263c5388c854cf98
MD5 ff1c15ac68b59cf39ae3b10337a00649
BLAKE2b-256 c4f0416c195145f174975c1b9c04e7d2c5727e90fb23daa48b5d24e33ee7403c

See more details on using hashes here.

File details

Details for the file vineyard-0.3.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: vineyard-0.3.0-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.2

File hashes

Hashes for vineyard-0.3.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 bea74207f20afbe4516351878790796244914da7b5a8e569cfa585cb786d80ec
MD5 5490e50768add16b11f69bcd41d72987
BLAKE2b-256 8391a0d6a69f1a1c942cd87cc0cc6ac3e5cc8614b63f9ea8b830deddb58eb2ee

See more details on using hashes here.

File details

Details for the file vineyard-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vineyard-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 8.8 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.2

File hashes

Hashes for vineyard-0.3.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 aaa11b290e7b6afa71f48e5c4909b761a92ed422b01c3bce84b0bd2178f20123
MD5 9c2f3abe46dddfd8f7e96b2f09b6cba8
BLAKE2b-256 6a7e75601824f75eaca5a47569a246d3803d235ad452664cd4b4a710f249c02e

See more details on using hashes here.

File details

Details for the file vineyard-0.3.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: vineyard-0.3.0-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.2

File hashes

Hashes for vineyard-0.3.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f78b89923ab358ac13327ffb7faef8418dae9f365f92884363f001138ae7be4d
MD5 7793203f4e699e51cea27e1cf6da6935
BLAKE2b-256 93283377a9e92dc72d34fe59d06b9afbadbdc731fc72b7b3d194fb3f4eef89f3

See more details on using hashes here.

File details

Details for the file vineyard-0.3.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vineyard-0.3.0-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 8.8 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.2

File hashes

Hashes for vineyard-0.3.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e1bbb41c646e68524ace04eec436a78079c007e80091e945504a951cfde45007
MD5 842b346207cd4390d69d9d69cb8d7f35
BLAKE2b-256 35c92d6084ef107ad37167e6166d5938fe9285c5c9b928876b075c6cb4527da8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page