Machine learning tools to make a Data Scientist's work more efficient

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

verstack 1.1.12 Documentation

Machine learning tools to make a Data Scientist's work efficient

veratack package contains the following tools:

LGBMTuner automated lightgbm models tuniner with optuna
NaNImputer impute all missing values in a pandas dataframe using advanced machine learning with 1 line of code
Multicore execute any function in concurrency using all the available cpu cores
ThreshTuner tune threshold for binary classification predictions
stratified_continuous_split create train/test splits stratified on the continuous variable
categoric_encoders encode categoric variable by numeric labels

Factorizer encode categoric variable by numeric labels

OneHotEncoder represent categoric variable as a set of binary variables

FrequencyEncoder encode categoric variable by class frequencies

MeanTargetEncoder encode categoric variable by mean of the target variable

WeightOfEvidenceEncoder encode categoric variable as a weight of evidence of a binary target variable

timer convenient timer decorator to quickly measure and display time of any function execution

Getting verstack

$ pip install verstack

$ pip install --upgrade verstack

LGBMTuner

Fully automated lightgbm model hyperparameter tuning class with optuna under the hood. LGBMTuner selects optimal hyperparameters based on executed trials (configurable), optimizes n_estimators and fits the final model to the whole train set. Feature importances are available in numeric format, as a static plot, and as an interactive plot (html). Optimization history and parameters importance in static and interactive formats are alse accesable by built in methods.

Logic

The only required user inputs are the X (features), y (labels) and evaluation metric name, LGBMTuner will handle the rest

lgbm model type (regression/classification) is inferred from the labels and evaluation metric (passed by user)

optimization metric may be different from the evaluation metric (passed by user). LGBMTuner at hyperparameters search stage imploys the error reduction strategy, thus:

most regression task type metrics are supported for optimization, if not, MSE is selected for optimization

for classification task types hyperparameters are tuned by optimizing log_loss, n_estimators are tuned with evaluation_metric

early stopping is engaged at each stage of LGBMTuner optimizations

for every trial (iteration) a random train_test_split is performed (stratified for classification)

lgbm model initial parameters!=defaults and are inferred from the data stats and built in logic

optimization parameters and their search space are inferred from the data stats and built in logic

LGBMTuner class instance (after optimization) can be used for making predictions with conventional syntaxis (predict/predict_proba)

verbosity is controlled and by default outputs only the necessary optimization process/results information

Initialize LGBMTuner

from verstack import LGBMTuner

# initialize with default parameters
tuner = LGBMTuner('rmse')

# initialize with selected parameters
tuner = LGBMTuner(metric = 'rmse', 
                  trials = 200, 
                  refit = False, 
                  verbosity = 0, 
                  visualization = False, 
                  seed = 999)

Parameters

metric [default=None]

Evaluation metric for hyperparameters optimization. LGBMTuner supports the following metrics (note the syntax)

: ['mae', 'mse', 'rmse', 'rmsle', 'mape', 'smape', 'rmspe', 'r2', 'auc', 'gini', 'log_loss', 'accuracy', 'balanced_accuracy', 'precision', 'precision_weighted', 'precision_macro', 'recall', 'recall_weighted', 'recall_macro', 'f1', 'f1_weighted', 'f1_macro', 'lift']
trials [default=100]

Number of trials to run
refit [default=True]

Fit the model with optimized hyperparameters on the whole train set (required for feature_importances, plot_importances() and prediction methods)
verbosity [default=1]

Console verbosity level: 0 - no output except for optuna CRITICAL errors and builtin exceptions; (1-5) based on optuna.logging options. The default is 1
visualization [default=True]

Automatically output feature_importance & optimization plots into the console after tuning. Plots are also available on demand by corresponding methods
seed [default=42]

Random state parameter

Methods

fit(X, y)

Execute LGBM model hyperparameters tuning
Parameters
- X [pd.DataFrame]
  
  Train features
- y [pd.Series]
  
  Train labels
optimize_n_estimators(X, y, params, verbose_eval = 100)

Optimize n_estimators for lgb model.
Parameters
- X [np.array]
  
  Train features
- y [np.array]
  
  Train labels
- params [dict]
  
  parameters to use for training the model with early stopping
- verbose_eval [int]
  
  evaluation output at each verbose_eval iteratio n
returns

: (best_iteration, best_score)
fit_optimized(X, y)

Train model with tuned params on whole train data
- X [np.array]
  
  Train features
- y [np.array]
predict(test, threshold = 0.5)

Predict by optimized model on new data
- test [pd.DataFrame]
  
  Test features
- threshold [default=0.5]
  
  Classification threshold (applicable for binary classification)
returns

: array of int
predict_proba(test)

Predict probabilities by optimized model on new data
- test [pd.DataFrame]
  
  Test features
returns

: array of float
plot_importances(n_features = 15, figsize = (10,6), interactive = False)

Plot feature importance
- n_features [default=15]
  
  Number of important features to plot
- figsize [default=(10,6)]
  
  plot size
- interactive [default=False]
  
  Create & display with the default browser the interactive html plot or (if browser displya is unavailable) save to current wd.
plot_optimization_history(interactive = False)

Plot optimization function improvement history
- interactive [default=False]
  
  Create & display with the default browser the interactive html plot or (if browser displya is unavailable) save to current wd.
plot_param_importances(interactive = False)

Plot params importance plot
- interactive [default=False]
  
  Create & display with the default browser the interactive html plot or (if browser displya is unavailable) save to current wd.
plot_intermediate_values(interactive = False, legend = False)

Plot optimization trials history. Shows successful and terminated trials. If trials > 50 it is better to study the interactive version
- interactive [default=False]
  
  Create & display with the default browser the interactive html plot or (if browser displya is unavailable) save to current wd.
- legend [default=False]
  
  Plot legen on a static plot

Attributes

metric

Evaluation metric defined by user at LGBMTuner init
refit

Setting for refitting the optimized model on whole train dataset
verbosity

Verbosity level settings
visualization

Automatic plots output after optimization setting
seed

Random state value
fitted_model

Trained LGBM booster model with optimized parameters
feature_importances

Feature importance values
study

optuna.study.study.Study object after hyperparameters tuning
init_params

initial LGBM model parameters
best_params

learned optimized parameters

Examples

Using LGBMTuner with all default parameters

imputer = LGBMTuner('auc')
tuner.fit(X, y)
tuner.feature_importances
tuner.plot_importances()
tuner.plot_intermediate_values()
tuner.plot_optimization_history()
tuner.plot_param_importances()
tuner.best_params
tuner.predict(test)

LGBMTuner with custom settings

imputer = LGBMTuner(metric = 'auc', trials = 300, verbosity = 3, visualization = False)
tuner.fit(X, y)
tuner.plot_importances(legend = True)
tuner.plot_intermediate_values(interactive = True)
tuner.predict(test, threshold = 0.3)

NaNImputer

Impute all missing values in a pandas dataframe by xgboost models in multiprocessing mode using a single line of code.

Logic

With NaNImputer you can fill missing values in numeric, binary and categoric columns in your pandas dataframe using advanced XGBRegressor/XGBClassifier models with just 1 line of code. Regardless of the data types in your dataframe (string/bool/numeric):

all of the columns will be checked for missing values

transformed into numeric formats

split into subsets with and without missing values

applicalbe models will be selected and configured for each of the columns with NaNs

models will be trained in multiprocessing mode utilizing all the available cores and threads of your cpu (this saves a lot of time)

NaNs will be predicted and placed into corresponding indixes

columns with all NaNs will be droped

columns containing NaNs and known values as a single constant

data will be reverse-transformed into original format

The module is highly configurable with default argumets set for the highest performance and verbosity

The only limitation is:

NaNs in pure text columns are not imputed. By default they are filled with 'Missing_data' value. Configurable. If disabled - will return these columns with missing values untouched

Initialize NaNImputer

from verstack import NaNImputer

# initialize with default parameters
imputer = NaNImputer()

# initialize with selected parameters
imputer = NaNImputer(conservative = False, 
                     n_feats = 10, 
                     nan_cols = None, 
                     fix_string_nans = True, 
                     multiprocessing_load = 3, 
                     verbose = True, 
                     fill_nans_in_pure_text = True, 
                     drop_empty_cols = True, 
                     drop_nan_cols_with_constant = True)

Parameters

conservative [default=False]

Model complexity level used to impute missing values. If True: model will be set to less complex and much faster.
n_feats [default=10]

Number of corellated independent features to be used forcorresponding column (with NaN) model training and imputation.
nan_cols [default=None]

List of columns to impute missing values in. If None: all the columns with missing values will be used.
fix_string_nans [default=True]

Find possible missing values in numeric columns that had been (mistakenly) encoded as strings, E.g. 'Missing'/'NaN'/'No data' and replace them with np.nan for further imputation.
multiprocessing_load [default=3]
- Levels of parallel multiprocessing compute
  - 1 = single core
  - 2 = half of all available cores
  - 3 = all available cores
verbose [default=True]

Print the imputation progress.
fill_nans_in_pure_text [default=True]

Fill the missing values in text fields by string 'Missing_data'.Applicable for text fields (not categoric).
drop_empty_cols [default=True]

Drop columns with all NaNs.
drop_nan_cols_with_constant [default=True]

Drop columns containing NaNs and known values as a single constant.
feature_selection [default="correlation"]
- Define algorithm to select most important feats for each column imputation. Quick option: "correlation" is based on selecting n_feats with the highest binary correlation with each column for NaNs imputation. Less quick but more precise: "feature_importance" is based on extracting feature_importances from an xgboost model.

Methods

impute(data)

Execute NaNs imputation columnwise in a pd.DataFrame
Parameters
- data pd.DataFrame
  
  dataframe with missing values in a single/multiple columns

Examples

Using NaNImputer with all default parameters

imputer = NaNImputer()
df_imputed = imputer.impute(df)

Say you would like to impute missing values in a list of specific columns, use 20 most important features for each of these columns imputation and deploy a half of the available cpu cores

imputer = NaNImputer(nan_cols = ['col1', 'col2'], n_feats = 20, multiprocessing_load = 2)
df_imputed = imputer.impute(df)

Multicore

Execute any function in concurrency using all the available cpu cores.

Logic

Multicore module is built on top of concurrent.futures package. Passed iterables are divided into chunks according to the number of workers and passed into separate processes.

Results are extracted from finished processes and combined into a single/multiple output as per the defined function output requirements.

Multiple outputs are returned as a nested list.

Initialize Multicore

from verstack import Multicore

# initialize with default parameters
multicore = Multicore()

# initialize with selected parameters
multicore = Multicore(workers = 6,
                      multiple_iterables = True)

Parameters

workers int or bool [default=False]

Number of workers if passed by user. If False: all available cpu cores will be used.
multiple_iterables bool [default=False]

If function needs to iterate over multiple iterables, set to True.

Multiple iterables must be passed as a list (see examples below).

Methods

execute(func, iterable)

Execute passed function and iterable(s) in concurrency.
Parameters
- func function
  
  function to execute in parallel
- iterable list/pd.Series/pd.DataFrame/dictionary
  
  data to iterate over

Examples

Use Multicore with all default parameters

multicore = Multicore()
result = multicore.execute(function, iterable_list)

If you want to use a limited number of cpu cores and need to iterate over two objects:

multicore = Multicore(workers = 2, multiple_iterables = True)
result = multicore.execute(function, [iterable_dataframe, iterable_list])

ThreshTuner

Find the best threshold to split your predictions in a binary classification task. Most applicable for imbalance target cases. In addition to thresholds & loss_func scores, the predicted_ratio (predicted fraction of 1) will be calculated and saved for every threshold. This will help the identify the appropriate threshold not only based on the score, but also based on the resulting distribution of 0 and 1 in the predictions.

Logic

Default behavior (only pass the labels and predictions):

: - Calculate the labels balance (fraction_of_1 in labels) - Define the min_threshold as fraction_of_1 * 0.8 - Define the max_threshold as fraction_of_1 * 1.2 but not greater than 1 - Define the n_thresholds = 200 - Create 200 threshold options uniformly distributed between min_threshold & max_threshold - Deploy the balanced_accuracy_score as loss_func - Peform loss function calculation and save results in class instance placeholders

Customization options

: - Change the n_thresholds to the desired value - Change the min_threshold & max_threshold to the desired values - Pass the loss_func of choice, e.g. sklearn.metrics.f1_score

This will result in user defined granulation of thresholds to test

Initialize ThreshTuner

from verstack import ThreshTuner

# initialize with default parameters
thresh = ThreshTuner()

# initialize with selected parameters
thresh = ThreshTuner(n_thresholds = 500,
                     min_threshold = 0.3,
                     max_threshold = 0.7)

Parameters

n_thresholds int [default=200]

Number of thresholds to test. If not set by user: 200 thresholds will be tested.
min_threshold float or int [default=None]

Minimum threshold value. If not set by user: will be inferred from labels balance based on fraction_of_1
max_threshold float or int [default=None]

Maximum threshold value. If not set by user: will be inferred from labels balance based on fraction_of_1

Methods

fit(labels, pred, loss_func)

Calculate loss_func results for labels & preds for the defined/default thresholds. Print the threshold(s) with the best loss_func scores
Parameters
- labels array/list/series [default=balanced_accuracy_score]
  
  y_true labels represented as 0 or 1
- pred array/list/series
  
  predicted probabilities of 1
- loss_func function
  
  loss function for scoring the predictions, e.g. sklearn.metrics.f1_score
result()

Display a dataframe with thresholds/loss_func_scores/fraction_of_1 for for all the the defined/default thresholds
best_score()

Display a dataframe with thresholds/loss_func_scores/fraction_of_1 for the best loss_func_score
best_predict_ratio()

Display a dataframe with thresholds/loss_func_scores/fraction_of_1 for the (predicted) fraction_of_1 which is closest to the (actual) labels_fraction_of_1

Examples

Use ThreshTuner with all default parameters

thresh = ThreshTuner()
thres.fit(labels, pred)

Customized ThreshTuner application

from sklearn.metrics import f1_score

thresh = ThreshTuner(n_thresholds = 500, min_threshold = 0.2, max_threshold = 0.6)
thresh.fit(labels, pred, f1_score)

Access the results after .fit()

thresh = ThreshTuner()
thres.fit(labels, pred)

# return pd.DataFrame with all the results
thresh.result
# return pd.DataFrame with the best loss_func score
thresh.best_score()
thresh.best_score()['threshold']
# return pd.DataFrame with the best predicted fraction_of_1
thresh.best_predict_ratio()
# return the actual labels fraction_of_1
thresh.labels_fractio_of_1

stratified_continuous_split

Create stratified splits based on either continuous or categoric target variable.

: - For continuous target variable verstack uses binning and categoric split based on bins - For categoric target enhanced sklearn.model_selection.train_test_split is used: in case there are not enough categories for the split, the minority classes will be combined with nearest neighbors.

Can accept only pandas.DataFrame/pandas.Series as data input.

verstack.stratified_continuous_split.scsplit(*args, 
                                             stratify, 
                                             test_size = 0.3, 
                                             train_size = 0.7, 
                                             continuous = True, 
                                             random_state = None)

Parameters

X,y,data

data input for the split in pandas.DataFrame/pandas.Series format.
stratify

target variable for the split in pandas/eries format.
test_size [default=0.3]

test split ratio.
train_size [default=0.7]

train split ratio.
continuous [default=True]

stratification target definition. If True, verstack will perform the stratification on the continuous target variable, if False, sklearn.model_selection.train_test_split will be performed with verstack enhancements.
random_state [default=5]

random state value.

Examples

from verstack.stratified_continuous_split import scsplit

train, test = scsplit(data, stratify = data['continuous_column_name'])
X_train, X_val, y_train, y_val = scsplit(X, y, stratify = y, 
                                         test_size = 0.3, random_state = 5)

categoric_encoders

::: {.note} ::: {.title} Note :::

All the categoric encoders are conveniently integrated to work with pandas.DataFrame. Modules receive pd.DataFrame and kwargs as inputs and return pd.DataFrame with encoded column. All the necessary attributes for further transform/inverse_transform are saved in instance objects and can be seralized (e.g. pickle) for latter application. :::

Factorizer

Encode categoric column by numeric labels.

Logic

Assign numeric labels starting with 0 to all unique variable's categories.

Missing values can be encoded by an integer value (defaults to -1) / float / string or can be left untransformed.

When transform () - unseen categories will be be represented as NaN.

Initialize Factorizer

from verstack import Factorizer

# initialize with default parameters
factorizer = Factorizer()

# initialize with changing the NaN encoding value
factorizer = Factorizer(na_sentinel = np.nan) #-999/0.33333/'No data')

Attributes

na_sentinel

Defined (at init) missing values encoding value.
colname

Defined (at fit_transform()) column that had been transformed.
pattern

Defined (at fit_transform()) encoding map.

Parameters

na_sentinel [default=-1]

Missing values encoding value. Can take int/float/str/np.nan values.

Methods

fit_transform(df, colname)

Fit Factorizer to data and return transformed data.
Parameters
- df pd.DataFrame
  
  df containing the colname to transform.
- colname str
  
  Column name in df to be transformed.
transform(df)

Apply the fitted Factorizer to new data and return transformed data. Unseen categories will be represented by NaN.
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.
inverse_transform(df)

Inverse transform data that had been encoded by Factorizer. Data must contain colname that was passed at fit_transform().
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.

Examples

Use with default na_sentinel:

factorizer = Factorizer()
train_encoded = factorizer.fit_transform(train, 'colname') # will encode NaN values by -1
test_encoded = factorizer.transform(test)

train_reversed_to_original = factorizer.inverse_transform(train_encoded)
test_reversed_to_original = factorizer.inverse_transform(test_encoded)

Keep missing values untransformed:

factorizer = Factorizer(na_sentinel = np.nan)
train_encoded = factorizer.fit_transform(train)

OneHotEncoder

Encode categoric column by a set of binary columns.

Logic

Categoric 'column':['a','b','c'] will be represented by three binary columns 'a', 'b', 'c'. Original categoric 'column' is droped.

Missing values can be represented by a separate column or omited.

When transform() - unseen categories will not be represented by new columns, missing categories will be represented by empty (all zeros) columns.

Initialize OneHotEncoder

from verstack import OneHotEncoder
ohe = OneHotEncoder()
train_encoded = ohe.fit_transform(train, 'colname') # will create a separate column for NaN values (if any)
test_encoded = ohe.transform(test)

train_reversed_to_original = ohe.inverse_transform(train_encoded)
test_reversed_to_original = ohe.inverse_transform(test_encoded)

Attributes

na_sentinel

Defined (at init) missing values encoding value.
colname

Defined (at fit_transform()) column that had been transformed.
categories

Defined (at fit_transform()) unique class categories which will be represented by binary columns.

Parameters

na_sentinel [default=True]

If True: create separate class column for NaN values.

Methods

fit_transform(df, colname, prefix)

Fit OneHotEncoder to data and return transformed data.
Parameters
- df pd.DataFrame
  
  df containing the colname to transform.
- colname str
  
  Column name in df to be transformed.
- prefix str/int/float/bool/None, optional
  
  String to append DataFrame column names. The default is None.
transform(df)

Apply the fitted OneHotEncoder to new data and return transformed data. Unseen categories will not be represented by new columns, missing categories will be represented by empty (all zeros) columns.
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.
inverse_transform(df)

Inverse transform data that had been encoded by OneHotEncoder. Data must contain one-hot-encoded columns that was created at fit_transform().
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.

Examples

ohe = OneHotEncoder()
train_encoded = ohe.fit_transform(train, 'colname', prefix = 'colname')
test_encoded = ohe.transform(test)

train_reversed_to_original = ohe.inverse_transform(train_encoded)
test_reversed_to_original = ohe.inverse_transform(test_encoded)

FrequencyEncoder

Encoder to represent categoric variable classes' frequency across the dataset.

Logic

Original column ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', np.nan]

Encoded column [0.3, 0.3, 0.3, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.1] # np.nan]

When transform() - unseen categories will be represented by the most common (highest) frequency.

Can handle missing values - encode NaN by NaN frequency or leave NaN values untransformed. Resulting frequencies are normalized as a percentage.

Initialize FrequencyEncoder

from verstack import FrequencyEncoder
fe = FrequencyEncoder()
train_encoded = fe.fit_transform(train, 'colname')
test_encoded = fe.transform(test)

train_reversed_to_original = fe.inverse_transform(train_encoded)
test_reversed_to_original = fe.inverse_transform(test_encoded)

Attributes

na_sentinel

Defined (at init) missing values encoding value.
colname

Defined (at fit_transform()) column that had been transformed.
pattern

Defined (at fit_transform()) encoding map.

Parameters

na_sentinel [default=True]
- If True: Encode NaN values by their frequency. If False return np.nan in the encoded column.

Methods

fit_transform(df, colname)

Fit FrequencyEncoder to data and return transformed data.
Parameters
- df pd.DataFrame
  
  df containing the colname to transform.
- colname str
  
  Column name in df to be transformed.
transform(df)

Apply the fitted FrequencyEncoder to new data and return transformed data. Unseen categories will be represented as NaN.
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.
inverse_transform(df)

Inverse transform data that had been encoded by FrequencyEncoder. Data must contain colname that was passed at fit_transform().
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.

Examples

frequency_encoder = FrequencyEncoder()
train_encoded = frequency_encoder.fit_transform(train, 'colname')
test_encoded = frequency_encoder.transform(test)

train_reversed_to_original = frequency_encoder.inverse_transform(train_encoded)
test_reversed_to_original = frequency_encoder.inverse_transform(test_encoded)

MeanTargetEncoder

Encode train cat cols by mean target value for category.

Logic

To avoid target leakage train set encoding is performed by breaking data into 5 folds & encoding categories of each fold with their respective target mean values calculated on the other 4 folds. This will introduce minor noize to train data encoding (at fit_transform()) as a normalization technique. Test set (transform()) is encoded without normalization.

When transform() - unseen categories will be represented by the global target mean.

Can handle missing values - encode NaN by global mean or leave NaN values untransformed.

Initialize MeanTargetEncoder

from verstack import MeanTargetEncoder
mean_target_encoder = MeanTargetEncoder(save_inverse_transform = True)
train_encoded = mean_target_encoder.fit_transform(train, 'colname', 'targetname')
test_encoded = mean_target_encoder.transform(test)

train_reversed_to_original = mean_target_encoder.inverse_transform(train_encoded)
test_reversed_to_original = mean_target_encoder.inverse_transform(test_encoded)

Attributes

na_sentinel

Defined (at init) missing values encoding value.
colname

Defined (at fit_transform()) column that had been transformed.
pattern

Defined (at fit_transform()) encoding map.
save_inverse_transform

Defined (at init) flag for saving the pattern for inverse transform.

Parameters

na_sentinel [default=True]

If True: Encode NaN values by target global mean. If False return np.nan in the encoded column.
save_inverse_transform [default=False]

If True: Saves mean target values for each category at each encoding fold. Enable if need to inverse_transform the encoded data. Defaults to False because for large datasets saved pattern can significantly increase instance object size.

Methods

fit_transform(df, colname, targetname)

Fit MeanTargetEncoder to data and return transformed data.
Parameters
- df pd.DataFrame
  
  df containing the colname to transform.
- colname str
  
  Column name in df to be transformed.
- targetname str
  
  Target column name in df for extracting the mean values for each colname category.
transform(df)

Apply the fitted MeanTargetEncoder to new data and return transformed data. Unseen categories will be encoded by the global target mean.
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.
inverse_transform(df)

Inverse transform data that had been encoded by MeanTargetEncoder. Data must contain colname that was passed at fit_transform().
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.

Examples

mean_target_encoder = MeanTargetEncoder(save_inverse_transform = True)
train_encoded = mean_target_encoder.fit_transform(train, 'colname', 'targetname')
test_encoded = mean_target_encoder.transform(test)

train_reversed_to_original = mean_target_encoder.inverse_transform(train_encoded)
test_reversed_to_original = mean_target_encoder.inverse_transform(test_encoded)

WeightOfEvidenceEncoder

Encoder to represent categoric variables by Weight of Evidence in regards to the binary target variable.

Logic

Built on top of sclearn package category_encoders.woe.WOEEncoder.

If encoded value is negative - it represents a category that is more heavily enclided to the negative target class (0). Positive encoding result represents inclination to the positive target class (1).

When fit_transform() is used on a train set, variable is encoded with adding minor noize to reduce the risk of overfitting.

Can handle missing values - encode NaN by zero WoE or leave NaN untransformed.

Initialize WeightOfEvidenceEncoder

from verstack import WeightOfEvidenceEncoder
WOE = WeightOfEvidenceEncoder()
train_encoded = WOE.fit_transform(train, 'colname', 'targetname')
test_encoded = WOE.transform(test)

train_reversed_to_original = WOE.inverse_transform(train_encoded)
test_reversed_to_original = WOE.inverse_transform(test_encoded)

Attributes

na_sentinel

Defined (at init) missing values encoding value.
colname

Defined (at fit_transform()) column that had been transformed.
params

Defined (at init) category_encoders.woe.WOEEncoder parameters

Parameters

na_sentinel [default=True]

If True: Encode NaN values by zero WoE. If False return np.nan in the encoded column.
kwargs

category_encoders.woe.WOEEncoder parameters. Following parameters are set by default: 'randomized':True, 'random_state':42, 'handle_missing':'return_nan' <- inferred from na_sentinel setting.

Methods

fit_transform(df, colname, targetname)

Fit WeightOfEvidenceEncoder to data and return transformed data.
Parameters
- df pd.DataFrame
  
  df containing the colname to transform.
- colname str
  
  Column name in df to be transformed.
- targetname str
  
  Target column name in df for calculating WoE for each colname category.
transform(df)

Apply the fitted WeightOfEvidenceEncoder to new data and return transformed data. Unseen categories' WoE is set to 0.
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.
inverse_transform(df)

Inverse transform data that had been encoded by WeightOfEvidenceEncoder. Data must contain colname that was passed at fit_transform().
Parameters
- df pd.DataFrame
  
  Data containing the colname to transform.

Examples

WOE = WeightOfEvidenceEncoder()
train_encoded = WOE.fit_transform(train, 'colname', 'targetname')
test_encoded = WOE.transform(test)

train_reversed_to_original = WOE.inverse_transform(train_encoded)
test_reversed_to_original = WOE.inverse_transform(test_encoded)

timer

Timer decorator to measure any function execution time and create elapsed time output: hours/minues/seconds will be calculated and returned conveniently.

verstack.tools.timer

Examples

timer is a decorator function: it must placed above the function (that needs to be timed) definition

from verstack.tools import timer

@timer
def func(a,b):
    print(f'Result is: {a + b}')

func(2,3)

>>>Result is: 5
>>>Time elapsed for func execution: 0.0002 seconds

Links

Git

pypi

author

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

3.9.2

Feb 22, 2024

3.9.1

Jan 8, 2024

3.9.0

Dec 19, 2023

3.8.16

Dec 11, 2023

3.8.15

Dec 11, 2023

3.8.14

Dec 11, 2023

3.8.13

Dec 7, 2023

3.8.12

Oct 2, 2023

3.8.11

Aug 29, 2023

3.8.10

Aug 29, 2023

3.8.9

Jul 27, 2023

3.8.8

Jul 4, 2023

3.8.7

Jul 4, 2023

3.8.6

Jul 4, 2023

3.8.5

Jul 4, 2023

3.8.4

Jul 4, 2023

3.8.3

Jul 4, 2023

3.8.2

Jul 4, 2023

3.8.1

Jun 22, 2023

3.8.0

Jun 12, 2023

3.7.1

May 3, 2023

3.7.0

May 3, 2023

3.6.7

Mar 28, 2023

3.6.6

Feb 12, 2023

3.6.5

Jan 16, 2023

3.6.4

Jan 13, 2023

3.6.3

Jan 11, 2023

3.6.2

Jan 11, 2023

3.6.1

Jan 11, 2023

3.6.0

Dec 19, 2022

3.5.1

Dec 13, 2022

3.5.0

Dec 11, 2022

3.4.0

Dec 9, 2022

3.3.4

Dec 9, 2022

3.3.3

Dec 6, 2022

3.3.2

Dec 6, 2022

3.3.1

Dec 6, 2022

3.3.0

Nov 21, 2022

3.2.5

Nov 11, 2022

3.2.4

Oct 18, 2022

3.2.3

Jun 21, 2022

3.2.2

Jun 17, 2022

3.2.1

Jun 10, 2022

3.2.0

Jun 9, 2022

3.1.9

Jun 2, 2022

3.1.8

May 20, 2022

3.1.7

May 17, 2022

3.1.6

May 13, 2022

3.1.5

May 4, 2022

3.1.4

Apr 29, 2022

3.1.3

Apr 29, 2022

3.1.2

Apr 29, 2022

3.1.1

Apr 29, 2022

3.0.3

Apr 7, 2022

3.0.2

Apr 5, 2022

3.0.1

Apr 4, 2022

3.0.0

Apr 4, 2022

2.0.12

Mar 17, 2022

2.0.11

Mar 15, 2022

2.0.10

Mar 15, 2022

2.0.9

Mar 14, 2022

2.0.8

Mar 12, 2022

2.0.7

Mar 11, 2022

2.0.6

Feb 24, 2022

2.0.5

Feb 22, 2022

2.0.4

Feb 14, 2022

2.0.1

Feb 10, 2022

2.0.0

Feb 10, 2022

1.1.13

Feb 3, 2022

This version

1.1.12

Dec 28, 2021

1.1.11

Dec 24, 2021

1.1.10

Dec 24, 2021

1.1.9

Dec 24, 2021

1.1.8

Dec 22, 2021

1.1.7

Dec 22, 2021

1.1.6

Dec 22, 2021

1.1.5

Dec 22, 2021

1.1.4

Dec 22, 2021

1.1.3

Dec 22, 2021

1.1.2

Dec 22, 2021

1.1.1

Dec 22, 2021

1.1.0

Dec 22, 2021

1.0.7

Dec 10, 2021

1.0.6

Dec 9, 2021

1.0.5

Dec 7, 2021

1.0.3

Dec 7, 2021

1.0.2

Dec 7, 2021

1.0.1

Dec 7, 2021

1.0.0

Dec 7, 2021

0.5.2

Nov 8, 2021

0.5.1

Oct 19, 2021

0.5.0

Sep 23, 2021

0.4.4

Aug 12, 2021

0.4.3

Aug 12, 2021

0.4.2

Aug 12, 2021

0.4.1

Aug 12, 2021

0.4.0

Aug 12, 2021

0.3.1

Oct 22, 2020

0.3.0

Oct 22, 2020

0.2.2

Sep 26, 2020

0.2.0

Sep 26, 2020

0.1.1

Sep 19, 2020

0.1.0

Sep 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verstack-1.1.12.tar.gz (6.8 MB view hashes)

Uploaded Dec 28, 2021 Source

Hashes for verstack-1.1.12.tar.gz

Hashes for verstack-1.1.12.tar.gz
Algorithm	Hash digest
SHA256	`3fa1430d02709280efded73e88f095179c386780489a7cde8cc1cb5fe35ab35f`
MD5	`34f065c24669c6b93a65a89eebaf0ea5`
BLAKE2b-256	`0a8ee3b7f55666a6b450d59b3983f29241ebae4417c1a00ebda9f61c65f54755`

verstack 1.1.12

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

verstack 1.1.12 Documentation

Getting verstack

LGBMTuner

Logic

Parameters

Methods

Examples

NaNImputer

Logic

Parameters

Methods

Examples

Multicore

Logic

Parameters

Methods

Examples

ThreshTuner

Logic

Parameters

Methods

Examples

stratified_continuous_split

Parameters

Examples

categoric_encoders

Factorizer

Logic

Parameters

Methods

Examples

OneHotEncoder

Logic

Parameters

Methods

Examples

FrequencyEncoder

Logic

Parameters

Methods

Examples

MeanTargetEncoder

Logic

Parameters

Methods

Examples

WeightOfEvidenceEncoder

Logic

Parameters

Methods

Examples

timer

Examples

Links

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution