Shrink Pandas DataFrames with precision safe schema inference.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Pandas Downcast

Shrink Pandas DataFrames with precision safe schema inference. pandas-downcast finds the minimum viable type for each column, ensuring that resulting values are within tolerance of original values.

Installation

pip install pandas-downcast

Dependencies

python >= 3.6
pandas
numpy

License

MIT

Usage

import pdcast as pdc
import numpy as np
import pandas as pd

data = {
    "integers": np.linspace(1, 100, 100),
    "floats": np.linspace(1, 1000, 100).round(2),
    "booleans": np.random.choice([1, 0], 100),
    "categories": np.random.choice(["foo", "bar", "baz"], 100),
}

df = pd.DataFrame(data)

# Downcast DataFrame to minimum viable schema.
df_downcast = pdc.downcast(df)

# Infer minimum schema for DataFrame.
schema = pdc.infer_schema(df)

# Coerce DataFrame to schema - required if converting float to Pandas Integer.
df_new = pdc.coerce_df(df, schema)

Smaller data types $\Rightarrow$ smaller memory footprint.

df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
#  #   Column      Non-Null Count  Dtype  
# ---  ------      --------------  -----  
#  0   integers    100 non-null    float64
#  1   floats      100 non-null    float64
#  2   booleans    100 non-null    int64  
#  3   categories  100 non-null    object 
# dtypes: float64(2), int64(1), object(1)
# memory usage: 3.2+ KB

df_downcast.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
#  #   Column      Non-Null Count  Dtype   
# ---  ------      --------------  -----   
#  0   integers    100 non-null    uint8   
#  1   floats      100 non-null    float32 
#  2   booleans    100 non-null    bool    
#  3   categories  100 non-null    category
# dtypes: bool(1), category(1), float32(1), uint8(1)
# memory usage: 932.0 bytes

Numerical data types will be downcast if the resulting values are within tolerance of the original values. For details on tolerance for numeric comparison, see the notes on np.allclose.

print(df.head())
#    integers  floats  booleans categories
# 0       1.0    1.00         1        foo
# 1       2.0   11.09         0        baz
# 2       3.0   21.18         1        bar
# 3       4.0   31.27         0        bar
# 4       5.0   41.36         0        foo

print(df_downcast.head())
#    integers     floats  booleans categories
# 0         1   1.000000      True        foo
# 1         2  11.090000     False        baz
# 2         3  21.180000      True        bar
# 3         4  31.270000     False        bar
# 4         5  41.360001     False        foo


print(pdc.options.ATOL)
# >>> 1e-08

print(pdc.options.RTOL)
# >>> 1e-05

Tolerance can be set at the module level or passed in function arguments.

pdc.options.ATOL = 1e-10
pdc.options.RTOL = 1e-10
df_downcast_new = pdc.downcast(df)

infer_dtype_kws = {
    "ATOL": 1e-10,
    "RTOL": 1e-10
}
df_downcast_new = pdc.downcast(df, infer_dtype_kws=infer_dtype_kws)

The floats column is now kept as float64 to meet the tolerance requirement. Values in the integers column are still safely cast to uint8.

df_downcast_new.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
#  #   Column      Non-Null Count  Dtype   
# ---  ------      --------------  -----   
#  0   integers    100 non-null    uint8   
#  1   floats      100 non-null    float64 
#  2   booleans    100 non-null    bool    
#  3   categories  100 non-null    category
# dtypes: bool(1), category(1), float64(1), uint8(1)
# memory usage: 1.3 KB

Inferred schemas can be restricted to Numpy data types only.

# Downcast DataFrame to minimum viable Numpy schema.
df_downcast = pdc.downcast(df, numpy_dtypes_only=True)

# Infer minimum Numpy schema for DataFrame.
schema = pdc.infer_schema(df, numpy_dtypes_only=True)

Example

The following example shows how downcasting data often leads to size reductions of greater than 70%, depending on the original types.

import pdcast as pdc
import pandas as pd
import seaborn as sns

df_dict = {df: sns.load_dataset(df) for df in sns.get_dataset_names()}

results = []

for name, df in df_dict.items():
    size_pre = df.memory_usage(deep=True).sum()
    df_post = pdc.downcast(df)
    size_post = df_post.memory_usage(deep=True).sum()
    shrinkage = int((1 - (size_post / size_pre)) * 100)
    results.append(
        {"dataset": name, "size_pre": size_pre, "size_post": size_post, "shrink_pct": shrinkage}
    )

results_df = pd.DataFrame(results).sort_values("shrink_pct", ascending=False).reset_index(drop=True)
print(results_df)

           dataset  size_pre  size_post  shrink_pct
0             fmri    213232      14776          93
1          titanic    321240      28162          91
2        attention      5888        696          88
3         penguins     75711       9131          87
4             dots    122240      17488          85
5           geyser     21172       3051          85
6           gammas    500128     108386          78
7         anagrams      2048        456          77
8          planets    112663      30168          73
9         anscombe      3428        964          71
10            iris     14728       5354          63
11        exercise      3302       1412          57
12         flights      3616       1888          47
13             mpg     75756      43842          42
14            tips      7969       6261          21
15        diamonds   3184588    2860948          10
16  brain_networks   4330642    4330642           0
17     car_crashes      5993       5993           0

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.2.4

Mar 29, 2023

1.2.3

Feb 22, 2023

1.2.2

Oct 10, 2022

1.2.1

Jul 1, 2022

1.2.0

Jul 1, 2022

1.1.0

Jul 14, 2021

1.0.1

Jun 26, 2021

1.0.0

Jun 14, 2021

0.1.0.post3

Jun 13, 2021

0.1.0.post2

Jun 13, 2021

0.1.0.post1

Jun 13, 2021

0.1.0

Jun 13, 2021

0.1.0.dev1 pre-release

Jun 4, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_downcast-1.2.4.tar.gz (7.6 kB view hashes)

Uploaded Mar 29, 2023 Source

Built Distribution

pandas_downcast-1.2.4-py3-none-any.whl (8.5 kB view hashes)

Uploaded Mar 29, 2023 Python 3

Hashes for pandas_downcast-1.2.4.tar.gz

Hashes for pandas_downcast-1.2.4.tar.gz
Algorithm	Hash digest
SHA256	`0cadbfe1a57ffdbb9b4c8c9aa1266c2d50f65cb9579e03956140e1db747e1efa`
MD5	`6f60626078df7c406f9c25cea61bb882`
BLAKE2b-256	`88801cf3e8ac26523c26fdaf4b50c951d3af0123cdd4d1706c1ece7e721d0097`

Hashes for pandas_downcast-1.2.4-py3-none-any.whl

Hashes for pandas_downcast-1.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b972f05fa98af04a6dca496c305406cb15607c41521d97328e2bc4390e30c14c`
MD5	`4ceb86665603eb5178c6b3218233720b`
BLAKE2b-256	`5ddee6f661357011bfe7e909595f46941d1e99018cdd749a6c2b1d2247203502`