missingpy

Missing Data Imputation for Python

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

missingpy

missingpy is a library for missing data imputation in Python. It has an API consistent with scikit-learn, so users already comfortable with that interface will find themselves in familiar terrain. Currently, the library only supports k-Nearest Neighbors based imputation but we plan to add other imputation tools in the future so please stay tuned!

Installation

pip install missingpy

Example

from missingpy import KNNImputer
imputer = KNNImputer()
X_imputed = imputer.fit_transform(X)

Note: Please check out the imputer's docstring for more information.

k-Nearest Neighbors (kNN) Imputation

The KNNImputer class provides imputation for completing missing values using the k-Nearest Neighbors approach. Each sample's missing values are imputed using values from n_neighbors nearest neighbors found in the training set. Note that if a sample has more than one feature missing, then the sample can potentially have multiple sets of n_neighbors donors depending on the particular feature being imputed.

Each missing feature is then imputed as the average, either weighted or unweighted, of these neighbors. Where the number of donor neighbors is less than n_neighbors, the training set average for that feature is used for imputation. The total number of samples in the training set is, of course, always greater than or equal to the number of nearest neighbors available for imputation, depending on both the overall sample size as well as the number of samples excluded from nearest neighbor calculation because of too many missing features (as controlled by row_max_missing). For more information on the methodology, see [1].

The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean feature value of the two nearest neighbors of the rows that contain the missing values::

>>> import numpy as np
>>> from missingpy import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

References

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.0

Dec 10, 2018

0.1.1

Jul 15, 2018

This version

0.1.0

Jul 15, 2018

0.0.1

Jul 8, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

missingpy-0.1.0.tar.gz (15.6 kB view hashes)

Uploaded Jul 15, 2018 Source

Built Distribution

missingpy-0.1.0-py3-none-any.whl (16.7 kB view hashes)

Uploaded Jul 15, 2018 Python 3

Hashes for missingpy-0.1.0.tar.gz

Hashes for missingpy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9efd0ade6631f7b65c678041f68476d7ede2048e1f5f240bf527ed78300bcb7c`
MD5	`ea249d24d919eef68b21fbf83c49bc3a`
BLAKE2b-256	`990d0986d55bef4366fd2779137199b8dab28e2c0b9e4ded9b2afbd61e0f2366`

Hashes for missingpy-0.1.0-py3-none-any.whl

Hashes for missingpy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0949c984f12925abbaa2a0be4611ad4f2f8c7d911e660ff010badc175dcc057d`
MD5	`ef417ce0d7eb618bd5e4729b8cd5c687`
BLAKE2b-256	`65e3570b7f183cf1b9209100336ad96a4adcf9f4dc5d6322c56aab09e9b23d57`