Skip to main content

Clean your data using a scikit-learn transformer in a single line of code

Project description

pandas_dq

Analyze and clean your data in a single line of code with a Scikit-Learn compatible Transformer.

Table of Contents

Introduction

pandas_dq is a new python library for data quality analysis and improvement. It is fast, efficient and scalable.

Alert!: If you are using pandas version 2.0 ("the new pandas"), beware that weird errors are popping up in all kinds of libraries that use pandas underneath. Our pandas_dq library is no exception. So if you plan to use pandas_dq with pandas version 2.0, beware that you may see weird errors and we can't and won't fix them!

What is pandas_dq?

The new pandas_dq library in Python is a great addition to the pandas ecosystem. It provides a set of tools for data quality assessment, which can be used to identify and address potential problems with data sets. This can help to improve the quality of data analysis and ensure that results are reliable.

The pandas_dq library is still under development, but it already includes a number of useful features. These include:

  • Data profiling: pandas_dq displays a report either in-line or in HTML to give you a quick overview of your data, including its features, feature types, their null and unique value percentages, their maximum and minimum values.
  • Data cleaning: pandas_dq allows you to quickly identify and remove data quality issues and inconsistencies in your data set.
  • Data imputation: pandas_dq allows you to fill missing values with your own choice of values for each feature in your data. For example, you can have one default for age feature and another for income feature.
  • Data transformation: pandas_dq allows you to transform skewed features into a more normal-like distribution.

The pandas_dq library is a valuable tool for anyone who works with data. It can help you to improve the quality of your data analysis and ensure that your results are reliable.

Here are some of the benefits of using the pandas_dq library:

  • It can help you to identify and address potential problems with data sets.
  • It can improve the quality of data analysis.
  • It can ensure that your results are reliable.
  • It is easy to use and can be integrated with other pandas tools.

pandas_dq has three main modules:

  • dq_report: This function displays a data quality report either inline or in HTML after it analyzes your dataset for various issues, such as missing values, outliers, duplicates, correlations, etc. It also checks the relationship between the features and the target variable (if provided) to detect data leakage.
  • Fix_DQ: This class is a scikit-learn compatible transformer that can detect and fix data quality issues in one line of code. It can remove ID columns, zero-variance columns, rare categories, infinite values, mixed data types, outliers, high cardinality features, highly correlated features, duplicate rows and columns, skewed distributions and imbalanced classes.
  • DataSchemaChecker: This class can check your dataset data types against a specific schema and report any mismatches or errors.
  • pandas_dq is designed to provide you the cleanest features with the fewest steps.

    pandas_dq

    Uses

    pandas_dq has multiple important modules: dq_report, Fix_DQ and now DataSchemaChecker.

    1. dq_report function

    dq_report_code

    `dq_report` displays a data quality report (inline or HTML) after it analyzes your dataset looking for these issues:

    1. It detects ID columns
    2. It detects zero-variance columns
    3. It identifies rare categories (less than 5% of categories in a column)
    4. It finds infinite values in a column
    5. It detects mixed data types (i.e. a column that has more than a single data type)
    6. It detects outliers (i.e. a float column that is beyond the Inter Quartile Range)
    7. It detects high cardinality features (i.e. a feature that has more than 100 categories)
    8. It detects highly correlated features (i.e. two features that have an absolute correlation higher than 0.8)
    9. It detects duplicate rows (i.e. the same row occurs more than once in the dataset)
    10. It detects duplicate columns (i.e. the same column occurs twice or more in the dataset)
    11. It detects skewed distributions (i.e. a feature that has a skew more than 1.0)
    12. It detects imbalanced classes (i.e. target variable has one class more than other in a significant way)
    13. It detects feature leakage (i.e. a feature that is highly correlated to target with correlation > 0.8)
    Notice that for large datasets, this report generation may take time, hence we read a 100K sample from your CSV file. If you want us to read the whole data, then send it in as a dataframe.

    2. Fix_DQ class: a scikit_learn transformer which can detect data quality issues and fix them all in one line of code

    fix_dq

    `Fix_DQ` is a great way to clean an entire train data set and apply the same steps in an MLOps pipeline to a test dataset. `Fix_DQ` can be used to detect most issues in your data (similar to dq_report but without the `target` related issues) in one step. Then it fixes those issues it finds during the `fit` method by the `transform` method. This transformer can then be saved (or "pickled") for applying the same steps on test data either at the same time or later.

    Fix_DQ will perform following data quality cleaning steps:

    1. It removes ID columns from further processing
    2. It removes zero-variance columns from further processing
    3. It identifies rare categories and groups them into a single category called "Rare"
    4. It finds infinite values and replaces them with an upper bound based on Inter Quartile Range
    5. It detects mixed data types and drops those mixed-type columns from further processing
    6. It detects outliers and suggests to remove them or use robust statistics.
    7. It detects high cardinality features but leaves them as it is.
    8. It detects highly correlated features and drops one of them (whichever comes first in the column sequence)
    9. It detects duplicate rows and drops one of them or keeps only one copy of duplicate rows
    10. It detects duplicate columns and drops one of them or keeps only one copy
    11. It detects skewed distributions and applies log or box-cox transformations on them
    12. It detects imbalanced classes and leaves them as it is
    13. It detects feature leakage and drops one of those features if they are highly correlated to target

    How can we use Fix_DQ in GridSearchCV to find the best model pipeline?

    This is another way to find the best data cleaning steps for your train data and then use the cleaned data in hyper parameter tuning using GridSearchCV or RandomizedSearchCV along with a LightGBM or an XGBoost or a scikit-learn model.

    3. DataSchemaChecker class: a scikit_learn transformer that can check whether a pandas dataframe conforms to a given schema and coerces the data to conform to it.

    The DataSchemaChecker class has two methods: fit and transform. You need to initialize the class with a schema that you want to compare your data's dtypes against. A schema is a dictionary that maps column names to data types.

    The fit method takes a dataframe as an argument and checks if it matches the schema. The fit method first checks if the number of columns in the dataframe and the schema are equal. If not, it creates an exception. Finally, the fit method displays a table of exceptions it found in your data against the given schema.

    The transform method takes a dataframe as an argument and based on the given schema and the exceptions, converts all the exception data columns to the given schema. If it is not able to transform the column, it skips the column and displays out an error message.

    dq_ds

    Install

    Prerequsites:

    1. pandas_dq is built using pandas, numpy and scikit-learn - that's all. It should run on almost all Python3 Anaconda installations without additional installs. You won't have to import any special libraries.
    The best method to install pandas_dq is to use pip:

    pip install pandas_dq 
    

    To install from source:

    cd <pandas_dq_Destination>
    git clone git@github.com:AutoViML/pandas_dq.git
    

    or download and unzip https://github.com/AutoViML/pandas_dq/archive/master.zip

    conda create -n <your_env_name> python=3.7 anaconda
    conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
    cd pandas_dq
    pip install -r requirements.txt
    

    Usage

    To get a quick profile of your data, can simply call dq_report

    from pandas_dq import dq_report
    dqr = dq_report(data, target=target, html=False, csv_engine="pandas", verbose=1)
    

    It displays a data quality report like this inline or in HTML format (and it saves the HTML to your machine):

    dq_report

    To fix your data quality issues, use Fix_DQ as a scikit-learn compatible transformer

    from pandas_dq import Fix_DQ
    
    # Create an instance of the fix_data_quality transformer with default parameters
    fdq = Fix_DQ()
    
    # Fit the transformer on X_train and transform it
    X_train_transformed = fdq.fit_transform(X_train)
    
    # Transform X_test using the fitted transformer
    X_test_transformed = fdq.transform(X_test)
    
    

    To validate that your data conforms to a given schema, use DataSchemaChecker:

    Once you define the schema as below, you can use it as follows:

    schema = {'name': 'string',
            'age': 'float32',
            'gender': 'object',
            'income': 'float64',
            'date': 'date',
            'target': 'integer'}
    
    from pandas_dq import DataSchemaChecker
    
    ds = DataSchemaChecker(schema=schema)
    ds.fit_transform(X_train)
    df.transform(X_test)
    

    API

    pandas_dq has a very simple API with one major goal: find data quality issues in your data and fix them.

    Arguments

    dq_report has the following arguments:

    Caution: For very large data sets, we randomly sample 100K rows from your CSV file to speed up reporting. If you want a larger sample, simply read in your file offline into a pandas dataframe and send it in as input, and we will load it as it is. This is one way to go around our speed limitations.

    • data: You can provide any kind of file format (string) or even a pandas DataFrame (df). It reads parquet, csv, feather, arrow, all kinds of file formats straight from disk. You just have to tell it the path to the file and the name of the file.
    • target: default: None. Otherwise, it should be a string name representing the name of a column in df. You can leave it as None if you don't want any target related issues.
    • html: default is False. If you want to display your report in HTML in a browser, set it to True. Otherwise, it defaults to inline in a notebook or prints on the terminal. It also saves the HTML file in your working directory in your machine.
    • csv_engine: default is pandas. If you want to load your CSV file using any other backend engine such as arrow or parquet please specify it here. This option only impacts CSV files.
    • verbose: This has 2 possible states:
      • 0 summary report. displays only the summary level data quality issues in the dataset. Great for managers.
      • 1 detailed report. displays all the gory details behind each DQ issue in your dataset and what to do about them. Great for engineers.

    dq_report returns a dataframe containing all the data quality issues in your data.

    Fix_DQ has the following arguments:

    Caution: X_train and y_train in Fix_DQ must be pandas Dataframes or pandas Series. I have not tested it on numpy arrays. You can try your luck.

    • quantile: float (0.75): Define a threshold for IQR for outlier detection. Could be any float between 0 and 1. If quantile is set to None, then no outlier detection will take place.
    • cat_fill_value: string ("missing") or a dictionary: Define a fill value for missing categories in your object or categorical variables. This is a global default for your entire dataset. You can also give a dictionary where you specify different fill values for different columns.
    • num_fill_value: integer (99) or float value (999.0) or a dictionary: Define a fill value for missing numbers in your integer or float variables. This is a global default for your entire dataset. You can also give a dictionary where you specify different fill values for different columns.
    • rare_threshold: float (0.05): Define a threshold for rare categories. If a certain category in a column is less than say 5% (0.05) of samples, then it will be considered rare. All rare categories in that column will be merged under a new category named "Rare".
    • correlation_threshold: float (0.8): Define a correlation limit. Anything above this limit, if two variables are correlated, one of them will be dropped. The program will tell you which variable is being dropped. You can switch the sequence of variables in your dataset if you want the one or other dropped.

    DataSchemaChecker is very similar in that it is also a scikit-learn transformer. It checks you data against a given schema:

    What is a schema? A schema (dict) is a dictionary that maps column names to data types.

    DataSchemaChecker has two methods:

    • fit method: Checks if the dataframe matches the schema and displays a table of errors if any.
    • transform method: Transforms the dataframe's dtypes to the given schema and displays errors if any.

    Maintainers

    Contributing

    See the contributing file!

    PRs accepted.

    License

    Apache License 2.0 © 2020 Ram Seshadri

    Note of Gratitude

    This libray would not have been possible without the help of ChatGPT and Bard. This library is dedicated to the thousands of people who worked to create LLM's.

    DISCLAIMER

    This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    pandas_dq-1.12.tar.gz (22.9 kB view hashes)

    Uploaded Source

    Built Distribution

    pandas_dq-1.12-py3-none-any.whl (21.4 kB view hashes)

    Uploaded Python 3

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page