Skip to main content

Data processing module implemented with numpy

Project description

carefree-data

carefree-data implemented a data processing module with numpy.

Update 2021.02.04

carefree-data now uses datatable as backend, which significantly improves the performances on file inputs!

Why carefree-data?

carefree-data is a data processing module which is capable of handling 'dirty' and 'messy' datasets.

For tabular datasets, carefree-data is able to:
  • Elegantly deal with data pre-processing.
    • A Recognizer to recognize whether a column is STRING, NUMERICAL or CATEGORICAL.
    • A Converter to convert a column into friendly format (["one", "two"] -> [0, 1]).
    • A Processor to further process columns (OneHot, Normalize, MinMax, ...).
    • And all the transforms could be inverse! (See tests\unittests\test_tabular.py -> test_recover_labels & test_recover_features).
    • And these procedures are all completed AUTOMATICALLY!
  • Handle datasets saved in files (.txt, .csv).
    • For .txt, " " will be the default delimiter.
    • For .csv, "," will be the default delimiter, and the first row will be skipped as default.
    • delimiter, label index, skip first could be set manually.

Pandas-free

There is one more thing we'd like to mention: carefree-data is 'Pandas-free'. Pandas is an open source library providing easy-to-use data structures on structured datasets. Although it is a widely used library in almost every famous Machine Learning and Deep Learning module, we finally decided to escape from it, and the reasons are listed below:

  • carefree-data wants to have full control on the data, and Pandas is not flexible enough.
  • carefree-data needs higher performances. Pandas is fast, but not as fast as pure numpy (and sometimes cython) codes on some critical code paths.
  • Pandas provides many powerful functions, but carefree-data doesn't need that much, which means Pandas is a little 'heavy' for carefree-data.

In short, Pandas is a more general library, and that's why we've written some codes to cover our needs instead of directly utilizing it.

Currently carefree-data only supports tabular datasets.

Installation

carefree-data requires Python 3.8 or higher.

pip install carefree-data

or

git clone https://github.com/carefree0910/carefree-data.git
cd carefree-data
pip install -e .

Basic Usages

Get scikit-learn datasets

from cfdata.tabular import TabularDataset

iris = TabularDataset.iris()

Read from array / dataset

from cfdata.tabular import *

iris = TabularDataset.iris()
x, y = iris.xy
assert TabularData().read(x, y) == TabularData.from_dataset(iris)

Read from file

from cfdata.tabular import TabularData

file = "/path/to/your/file"
data = TabularData().read(file)
assert data.processed == data.transform(file)

License

carefree-data is MIT licensed, as found in the LICENSE file.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

carefree-data-0.2.9.tar.gz (35.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page