Skip to main content

An in-memory immutable data manager

Project description

vineyard

vineyard: an in-memory immutable data manager

Vineyard CI Coverage Docs FAQ Discussion Slack License CII Best Practices FOSSA

PyPI crates.io Docker HUB Artifact HUB ACM DL

Vineyard (v6d) is an innovative in-memory immutable data manager that offers out-of-the-box high-level abstractions and zero-copy in-memory sharing for distributed data in various big data tasks, such as graph analytics (e.g., GraphScope), numerical computing (e.g., Mars), and machine learning.

Vineyard is a CNCF sandbox project

Vineyard is a CNCF sandbox project and indeed made successful by its community.

Table of Contents

What is vineyard

Vineyard is specifically designed to facilitate zero-copy data sharing among big data systems. To illustrate this, let’s consider a typical machine learning task of time series prediction with LSTM. This task can be broken down into several steps:

  • First, we read the data from the file system as a pandas.DataFrame.

  • Next, we apply various preprocessing tasks, such as eliminating null values, to the dataframe.

  • Once the data is preprocessed, we define the model and train it on the processed dataframe using PyTorch.

  • Finally, we evaluate the performance of the model.

In a single-machine environment, pandas and PyTorch, despite being two distinct systems designed for different tasks, can efficiently share data with minimal overhead. This is achieved through an end-to-end process within a single Python script.

Comparing the workflow with and without vineyard

What if the input data is too large to be processed on a single machine?

As depicted on the left side of the figure, a common approach is to store the data as tables in a distributed file system (e.g., HDFS) and replace pandas with ETL processes using SQL over a big data system such as Hive and Spark. To share the data with PyTorch, the intermediate results are typically saved back as tables on HDFS. However, this can introduce challenges for developers.

  1. For the same task, users must program for multiple systems (SQL & Python).

  2. Data can be polymorphic. Non-relational data, such as tensors, dataframes, and graphs/networks (in GraphScope) are becoming increasingly common. Tables and SQL may not be the most efficient way to store, exchange, or process them. Transforming the data from/to “tables” between different systems can result in significant overhead.

  3. Saving/loading the data to/from external storage incurs substantial memory-copies and IO costs.

Vineyard addresses these issues by providing:

  1. In-memory distributed data sharing in a zero-copy fashion to avoid introducing additional I/O costs by leveraging a shared memory manager derived from plasma.

  2. Built-in out-of-the-box high-level abstractions to share distributed data with complex structures (e.g., distributed graphs) with minimal extra development cost, while eliminating transformation costs.

As depicted on the right side of the above figure, we demonstrate how to integrate vineyard to address the task in a big data context.

First, we utilize Mars (a tensor-based unified framework for large-scale data computation that scales Numpy, Pandas, and Scikit-learn) to preprocess the raw data, similar to the single-machine solution, and store the preprocessed dataframe in vineyard.

single

data_csv = pd.read_csv('./data.csv', usecols=[1])

distributed

import mars.dataframe as md
dataset = md.read_csv('hdfs://server/data_full', usecols=[1])
# after preprocessing, save the dataset to vineyard
vineyard_distributed_tensor_id = dataset.to_vineyard()

Then, we modify the training phase to get the preprocessed data from vineyard. Here vineyard makes the sharing of distributed data between Mars and PyTorch just like a local variable in the single machine solution.

single

data_X, data_Y = create_dataset(dataset)

distributed

client = vineyard.connect(vineyard_ipc_socket)
dataset = client.get(vineyard_distributed_tensor_id).local_partition()
data_X, data_Y = create_dataset(dataset)

Finally, we execute the training phase in a distributed manner across the cluster.

From this example, it is evident that with vineyard, the task in the big data context can be addressed with only minor adjustments to the single-machine solution. Compared to existing approaches, vineyard effectively eliminates I/O and transformation overheads.

Features

Efficient In-Memory Immutable Data Sharing

Vineyard serves as an in-memory immutable data manager, enabling efficient data sharing across different systems via shared memory without additional overheads. By eliminating serialization/deserialization and IO costs during data exchange between systems, Vineyard significantly improves performance.

Out-of-the-Box High-Level Data Abstractions

Computation frameworks often have their own data abstractions for high-level concepts. For example, tensors can be represented as torch.tensor, tf.Tensor, mxnet.ndarray, etc. Moreover, every graph processing engine has its unique graph structure representation.

The diversity of data abstractions complicates data sharing. Vineyard addresses this issue by providing out-of-the-box high-level data abstractions over in-memory blobs, using hierarchical metadata to describe objects. Various computation systems can leverage these built-in high-level data abstractions to exchange data with other systems in a computation pipeline concisely and efficiently.

Stream Pipelining for Enhanced Performance

A computation doesn’t need to wait for all preceding results to arrive before starting its work. Vineyard provides a stream as a special kind of immutable data for pipelining scenarios. The preceding job can write immutable data chunk by chunk to Vineyard while maintaining data structure semantics. The successor job reads shared-memory chunks from Vineyard’s stream without extra copy costs and triggers its work. This overlapping reduces the overall processing time and memory consumption.

Versatile Drivers for Common Tasks

Many big data analytical tasks involve numerous boilerplate routines that are unrelated to the computation itself, such as various IO adapters, data partition strategies, and migration jobs. Since data structure abstractions usually differ between systems, these routines cannot be easily reused.

Vineyard provides common manipulation routines for immutable data as drivers. In addition to sharing high-level data abstractions, Vineyard extends the capability of data structures with drivers, enabling out-of-the-box reusable routines for the boilerplate parts in computation jobs.

Try Vineyard

Vineyard is available as a python package and can be effortlessly installed using pip:

pip3 install vineyard

For comprehensive and up-to-date documentation, please visit https://v6d.io.

If you wish to build vineyard from source, please consult the Installation guide. For instructions on building and running unittests locally, refer to the Contributing section.

After installation, you can initiate a vineyard instance using the following command:

python3 -m vineyard

For further details on connecting to a locally deployed vineyard instance, please explore the Getting Started guide.

Deploying on Kubernetes

Vineyard is designed to efficiently share immutable data between different workloads, making it a natural fit for cloud-native computing. By embracing cloud-native big data processing and Kubernetes, Vineyard enables efficient distributed data sharing in cloud-native environments while leveraging the scaling and scheduling capabilities of Kubernetes.

To effectively manage all components of Vineyard within a Kubernetes cluster, we have developed the Vineyard Operator. For more information, please refer to the Vineyard Operator documentation.

FAQ

Vineyard shares many similarities with other open-source projects, yet it also has distinct features. We often receive the following questions about Vineyard:

  • Q: Can clients access the data while the stream is being filled?

    Sharing one piece of data among multiple clients is a target scenario for Vineyard, as the data stored in Vineyard is immutable. Multiple clients can safely consume the same piece of data through memory sharing, without incurring extra costs or additional memory usage from copying data back and forth.

  • Q: How does Vineyard avoid serialization/deserialization between systems in different languages?

    Vineyard provides high-level data abstractions (e.g., ndarrays, dataframes) that can be naturally shared between different processes, eliminating the need for serialization and deserialization between systems in different languages.

  • … …

For more detailed information, please refer to our FAQ page.

Get Involved

  • Join the CNCF Slack and participate in the #vineyard channel for discussions and collaboration.

  • Familiarize yourself with our contribution guide to understand the process of contributing to vineyard.

  • If you encounter any bugs or issues, please report them by submitting a GitHub issue or engage in a conversation on Github discussion.

  • We welcome and appreciate your contributions! Submit them using pull requests.

Thank you in advance for your valuable contributions to vineyard!

Publications

If you use this software, please cite our paper using the following metadata:

@article{yu2023vineyard,
   author = {Yu, Wenyuan and He, Tao and Wang, Lei and Meng, Ke and Cao, Ye and Zhu, Diwen and Li, Sanhong and Zhou, Jingren},
   title = {Vineyard: Optimizing Data Sharing in Data-Intensive Analytics},
   year = {2023},
   issue_date = {June 2023},
   publisher = {Association for Computing Machinery},
   address = {New York, NY, USA},
   volume = {1},
   number = {2},
   url = {https://doi.org/10.1145/3589780},
   doi = {10.1145/3589780},
   journal = {Proc. ACM Manag. Data},
   month = {jun},
   articleno = {200},
   numpages = {27},
   keywords = {data sharing, in-memory object store}
}

Acknowledgements

We thank the following excellent open-source projects:

  • apache-arrow, a cross-language development platform for in-memory analytics.

  • boost-leaf, a C++ lightweight error augmentation framework.

  • cityhash, CityHash, a family of hash functions for strings.

  • dlmalloc, Doug Lea’s memory allocator.

  • etcd-cpp-apiv3, a C++ API for etcd’s v3 client API.

  • flat_hash_map, an efficient hashmap implementation.

  • gulrak/filesystem, an implementation of C++17 std::filesystem.

  • libcuckoo, libcuckoo, a high-performance, concurrent hash table.

  • mimalloc, a general purpose allocator with excellent performance characteristics.

  • nlohmann/json, a json library for modern c++.

  • pybind11, a library for seamless operability between C++11 and Python.

  • s3fs, a library provide a convenient Python filesystem interface for S3.

  • skywalking-infra-e2e A generation End-to-End Testing framework.

  • skywalking-swck A kubernetes operator for the Apache Skywalking.

  • wyhash, C++ wrapper around wyhash and wyrand.

  • BBHash, a fast, minimal-memory perfect hash function.

License

Vineyard is distributed under Apache License 2.0. Please note that third-party libraries may not have the same license as vineyard.

FOSSA Status

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vineyard-0.20.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (874.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

vineyard-0.20.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (837.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

vineyard-0.20.0-cp311-cp311-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

vineyard-0.20.0-cp311-cp311-macosx_10_15_universal2.whl (1.1 MB view details)

Uploaded CPython 3.11macOS 10.15+ universal2 (ARM64, x86-64)

vineyard-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (874.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

vineyard-0.20.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (837.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

vineyard-0.20.0-cp310-cp310-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

vineyard-0.20.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (874.4 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

vineyard-0.20.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (837.5 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

vineyard-0.20.0-cp39-cp39-macosx_11_0_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9macOS 11.0+ x86-64

vineyard-0.20.0-cp39-cp39-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

vineyard-0.20.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (837.1 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ ARM64

vineyard-0.20.0-cp38-cp38-macosx_11_0_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8macOS 11.0+ x86-64

vineyard-0.20.0-cp38-cp38-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

vineyard-0.20.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (874.8 kB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

vineyard-0.20.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (837.9 kB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ ARM64

vineyard-0.20.0-cp37-cp37m-macosx_11_0_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7mmacOS 11.0+ x86-64

vineyard-0.20.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (874.7 kB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.17+ x86-64

File details

Details for the file vineyard-0.20.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e432d61a93158034d9d34c2fb50575ba048916367e446b25deda9d81c12190b0
MD5 6c05418a9ac92b82037651a073812b04
BLAKE2b-256 c8e02c956d76f499e4dcc72c9e0234de2a1e49d3709622459fd2108cc2c08ce2

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2f68a1e4abe79bd2ee91ecc0c7b63e9f96dc8c8a298d516c327940369febc3d5
MD5 f80f0d92abcd5b3a8a0e38b3f368432d
BLAKE2b-256 b87c4075256a0206a34cd4574dbcb2f520a113f85cf36660d630f7f643a5f233

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c5559eebb84eb284912cbc7f56ec14507bab4a618090763c7f4d39dcaf8e0afd
MD5 2c44ee624ddd82cba3a8a69fecaaee01
BLAKE2b-256 f21f775b885adfd44b0842b6deb5f88eb68bf47f31b3bd71b5dcd9a0d51bf036

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp311-cp311-macosx_10_15_universal2.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp311-cp311-macosx_10_15_universal2.whl
Algorithm Hash digest
SHA256 e987e24a3c97ad417e15be4c2f9099e20a515b4b7cef66afc8dad6bd802dab4a
MD5 669f9ab3a66431996bb56abfd232a30b
BLAKE2b-256 f6a97b737822abbb69b76e18d92d21cc4e965a93f63b6d750d276ae9744a4bae

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4a3b6e5d87c5bf2978c800f9accc60da7a0831e56f5b81f5a5a2abd31178a643
MD5 a42343a255da93429db223c3449624b7
BLAKE2b-256 898451f7da2498b6531be77e2517e109758531ee168b62cbced0605ed7c33154

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ca9907c6ca695752c652d8abe48c993c2351bfbde600e614535d9779b34cbf3d
MD5 79ca77ada3058da355f3a5ef34224c79
BLAKE2b-256 da27a814994e8dbb4abb7e95ebd3dd6abc2e5b46e04d91cade83db1d46e11206

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6c966caf9e46078e086a8503c740595772dc758ee8fcd3a879907b4bc6ca77b4
MD5 5848a8436b8070e686bae08560870633
BLAKE2b-256 4f387b7086079cfddff44dfc33d603e42dc88ee9de1e76f92f68477d449f6ac6

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 95818ee426b230e34962f43ab630fa9472b5c0b8e3d2d8ea4a20edc115a177de
MD5 2de9224a9c117125e960dda0612ef23e
BLAKE2b-256 ab5e89b1d3e9fc833f570bbb762614da5ecc4a08c792cd6d070d1df8f0bfef98

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a0dd87efd8003ec50c8df2668bfa484818a5147241f4f8932f6b8b5a4bf16d11
MD5 4f5fab38e791fa2835bc651f66b57ae5
BLAKE2b-256 9b98253b94f6139344fac362f5fba7ef554a8bab17fdff3c613a5f5f5785f073

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 e7d5d6e56839ba6a34f5440e854fe950256055557c844b6c800237be09b36ba2
MD5 763b032763994e56b18089b8b78eeb10
BLAKE2b-256 b5212e6e38b47b80ab208f9cd6ddf82887e9e48da0727d9b62df858e0d72456e

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9cafb39890b1b5cf9c3168f60d80e8962eaa2eb3476e0ca0de7959627b3f74a8
MD5 152d837fa67e159c54a395dc88b727a8
BLAKE2b-256 86aede4cf73aa652fef7cf397ca2e684502c5ff86b14f51d01d7a213db8a1f45

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 67900e233d29d9058d235eb50b8c88d39d567fc751a590766348944d01944b6a
MD5 d5cd9f037b6afda173b0bd464de78109
BLAKE2b-256 4f428638dca5e17343f891394481624f9f7487da2ab815d9532229783618328a

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp38-cp38-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp38-cp38-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 519dc48673f3b69ed8fa3fb57b8c20a4e26812866017140db32bd8c10a8d5ce0
MD5 dc50a848e3b37d92184becbbc2d0d6bf
BLAKE2b-256 b3c581628f29148f58681feaaf001e7a5fcc349a656913d7657f8d6f31701e68

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 35821767964c3582f1173ea39d085801500b50c048da15ae76d0f6ea2f484882
MD5 a1ecb5a1346887d1c1901837c11e6c53
BLAKE2b-256 e991a8660da4984a7d5929efba4ed41c895db4b8534f0e856306f4e39296b1e5

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 70e3ea074104593c7f1fd3114dc17ec2ecede4491c76f60e9ff2cec39a28782c
MD5 0da20bf3c165a5f08c6ac2252c57cf03
BLAKE2b-256 60920ad449b79ea194979cdbff7b56380378aa05cc2dade46e016a585aee0d6b

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c9b155a0bd4eb531d9e4bcbe4df74e7f634a6e7bdebb4fc8144493751e15e5a2
MD5 ea18be813c410d37b8d94f68b7dfcfc4
BLAKE2b-256 8b95c6a8d1102a44d5cb325d8e095c6ad829b5aae0bf501c6c6c1c543af3d110

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp37-cp37m-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp37-cp37m-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 753b0348441648d4f291c605721ac4aa4f561248ee2dff706c328e5b84c36a91
MD5 4418316df28fc573f038f5d1608c1723
BLAKE2b-256 bc9b754a4ff151c8ad32128104781e746afd82313edfd01143c14788e15e31e8

See more details on using hashes here.

File details

Details for the file vineyard-0.20.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vineyard-0.20.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75170a2e47dfea98a1addd45a7781afa7ac08f9eee983946de7fc734d17654f9
MD5 b1588b8b2f3f70d3a4f209cb34048844
BLAKE2b-256 dafb2c2e6a24761aeb4e9dc2f4a831fc8d249e29bce612e2d85442a452820779

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page