No project description provided

Project description

dcraft

Data management library based on data lake concept especially for data science and machine leaning.
This helps your daily job's data management by raw, trusted and refined layer concept from data lake. The data is versioned and saved for each layer on specified storages and tables.

Concept

For daily individual work and for team work, we need to manage and organize our datasets to keep clean workflow. This library is to help that based on the data lake's layer concept.
For each layer, you can save the data and metadata to several places such as local file system, GCP and MongoDB which you choose.

Covered Data type

pd.DataFrame
Dict
List of Dict

Covered Format

csv
parquet
json

Covered Storage and Table

You can save the metadata and data on several places. The list below is the present coverage.

Metadata

Local File System
BigQuery
MongoDB

Data

Local File System
Google Cloud Storage
MinIO

Installation

pip install dcraft

To use GCP resources.

pip install dcraft[gcp]

Example

Create layer's data. There are create_trusted and create_refined too.

from dcraft import create_raw
import pandas as pd

data = pd.DataFrame({"a": [1,2], "b": [None, 4]})
raw_layer_data = create_raw(
    data,
    "fake-project",
    "Shuhei Kishi",
    "This is fake project",
    {"version": "0.0.1"}
)

You can choose where the data and metadata should be saved. On this example, it saves both on local.

import os
from dcraft import LocalDataRepository, LocalMetadataRepository

CURRENT_DIR = os.getcwd()
DATA_DIR_PATH = os.path.join(CURRENT_DIR, "data")
METADATA_DIR_PATH = os.path.join(CURRENT_DIR, "metadata")

data_repository = LocalDataRepository(DATA_DIR_PATH)
metadata_repository = LocalMetadataRepository(DATA_DIR_PATH)
raw_layer_data.save("parquet", data_repository, metadata_repository)

The data was saved to raw layer and information were saved as metadata.
You can read the saved data from metadata's id. The format is kept.

from dcraft import read_layer_data
loaded_raw_layer_data = read_layer_data(<id-from-metadata>, data_repository, metadata_repository)

If you want to save the metadata and data on different places such as BigQuery and Google Cloud Storage, you can use different Repository class.

from dcraft import BqMetadataRepository, GcsDataRepository

GCP_PROJECT = "your-project-id"
GCS_BUCKET = "your-bucket-name"

data_repository = GcsDataRepository(GCP_PROJECT, GCS_BUCKET)
metadata_repository = BqMetadataRepository(GCP_PROJECT, "test_dataset", "test_table")

raw_layer_data.save("csv", data_repository, metadata_repository)

Project details

Release history Release notifications | RSS feed

This version

0.5.2

Oct 22, 2023

0.5.1

Oct 21, 2023

0.5.0

Oct 21, 2023

0.4.0

Oct 14, 2023

0.3.0

Oct 13, 2023

0.2.0

Oct 12, 2023

0.1.0

Oct 8, 2023

0.0.2

Oct 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcraft-0.5.2.tar.gz (12.7 kB view hashes)

Uploaded Oct 22, 2023 Source

Built Distribution

dcraft-0.5.2-py3-none-any.whl (21.9 kB view hashes)

Uploaded Oct 22, 2023 Python 3

Hashes for dcraft-0.5.2.tar.gz

Hashes for dcraft-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`0597b6cb08ccca5c60a460c34fe3888e09ddf6e4bf2fc6b4a20f57b1b67ec11a`
MD5	`4098f693d79763ca5ff43b017a9bcada`
BLAKE2b-256	`d674a7ba0415b0c9955b1242a4dc405380fd1b77b70d990adcc41f209accf238`

Hashes for dcraft-0.5.2-py3-none-any.whl

Hashes for dcraft-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`03989cacdcc50a9a530c0e667899074acd6b019d4d4239c8280d71200124b0bf`
MD5	`06972d12fd2cf5ae7cd6a91e37e70309`
BLAKE2b-256	`75aad238e517f5c578965e59d303ddbf22fa0c4d0dccbbaa6673c883c6f032df`