mlimputer

MLimputer - Missing Data Imputation Framework for Machine Learning

These details have not been verified by PyPI

Project links

Homepage

Project description

MLimputer: Missing Data Imputation Framework for Machine Learning

Framework Contextualization: Advanced Missing Data Imputation for Tabular Data

The MLimputer project provides a comprehensive and integrated framework to automate the handling of missing values in datasets through advanced machine learning imputation. It aims to reduce bias and increase the precision of imputation results compared to traditional methods by leveraging supervised learning algorithms.

This package offers multiple algorithm options to impute your data, where each column with missing values is predicted using robust preprocessing and state-of-the-art machine learning models.

The architecture includes three main components:

Missing Data Analysis: Automatic detection and pattern analysis of missing values
Data Preprocessing: Intelligent handling of categorical and numerical features
Supervised Model Imputation: Multiple ML algorithms for accurate value prediction

Key Capabilities

General applicability on tabular datasets: Works with any tabular data for both regression and classification tasks
Robust preprocessing: Automatic handling of categorical encoding and feature scaling
Multiple imputation strategies: Choose from 7 different ML algorithms based on your data characteristics
Performance evaluation: Built-in evaluation framework to compare and select the best imputation strategy
Production ready: Save and load fitted imputers for deployment

Main Development Tools

Major frameworks used to build this project:

Pandas - Data manipulation and analysis
Scikit-learn - Core ML algorithms
XGBoost - Gradient boosting
CatBoost - Gradient boosting with categorical support
Pydantic - Data validation

Installation

Binary installer for the latest released version is available at the Python Package Index (PyPI).

pip install mlimputer

GitHub Project Link: https://github.com/TsLu1s/MLimputer

Quick Start Guide

Basic Usage Example

The first step is to import the package, load your dataset, and choose an imputation model. Available imputation models are:

RandomForest
ExtraTrees
GBR
KNN
XGBoost
Catboost

import pandas as pd
from mlimputer import MLimputer
from mlimputer.schemas.parameters import imputer_parameters
from mlimputer.utils.splitter import DataSplitter
import warnings
warnings.filterwarnings("ignore")

# Load your data
data = pd.read_csv('your_dataset.csv')

# Split with automatic index reset (required for MLimputer)
splitter = DataSplitter(random_state=42)
X_train, X_test, y_train, y_test = splitter.split(
    data.drop(columns=['target']), 
    data['target'], 
    test_size=0.2
)

# Configure imputation parameters (optional)
params = imputer_parameters()
params["RandomForest"]["n_estimators"] = 50
params["RandomForest"]["max_depth"] = 10

# Create and fit imputer
imputer = MLimputer(imput_model="RandomForest", imputer_configs=params)
imputer.fit(X=X_train)

# Transform datasets
X_train_imputed = imputer.transform(X=X_train)
X_test_imputed = imputer.transform(X=X_test)

# Save fitted imputer for production use
import pickle
with open("fitted_imputer.pkl", 'wb') as f:
    pickle.dump(imputer, f)

Advanced Configuration

Customize imputation model hyperparameters for better performance:

from mlimputer.schemas.parameters import imputer_parameters, update_model_config

# Get default parameters
params = imputer_parameters()

# Method 1: Direct modification
params["KNN"]["n_neighbors"] = 7
params["KNN"]["weights"] = "distance"

# Method 2: Using update function with validation
params["RandomForest"] = update_model_config(
    "RandomForest",
    {"n_estimators": 100, "max_depth": 15, "min_samples_split": 5}
)

# Apply different strategies
strategies = ["RandomForest", "KNN", "XGBoost"]
for strategy in strategies:
    imputer = MLimputer(imput_model=strategy, imputer_configs=params)
    imputer.fit(X=X_train)
    print(f"{strategy}: {imputer.get_summary()['n_columns_imputed']} columns imputed")

Performance Evaluation

The MLimputer framework includes a robust evaluation module to assess and compare different imputation strategies. This helps you select the most effective approach for your specific dataset.

Evaluation Framework

from mlimputer.evaluation.evaluator import Evaluator
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression

# Define evaluation parameters
imputation_strategies = ["RandomForest", "ExtraTrees", "GBR", "KNN"]

# Choose models based on your task
if target.dtype == "object":  # Classification
    models = [
        LogisticRegression(max_iter=1000),
        RandomForestClassifier(n_estimators=50)
    ]
else:  # Regression
    models = [
        LinearRegression(),
        RandomForestRegressor(n_estimators=50)
    ]

# Initialize evaluator
evaluator = Evaluator(
    imputation_models=imputation_strategies,
    train=train_data,
    target="target_column",
    n_splits=3,  # Cross-validation folds
    hparameters=params
)

# Run cross-validation evaluation
cv_results = evaluator.evaluate_imputation_models(models=models)

# Get best performing imputation strategy
best_imputer = evaluator.get_best_imputer()
print(f"Best imputation strategy: {best_imputer}")

# Evaluate on test set
test_results = evaluator.evaluate_test_set(
    test=test_data,
    imput_model=best_imputer,
    models=models
)

Custom Cross-Validation

For more control over the evaluation process:

from mlimputer.evaluation.cross_validation import CrossValidator, CrossValidationConfig

# Configure custom cross-validation
custom_config = CrossValidationConfig(
    n_splits=5,
    shuffle=True,
    random_state=42,
    verbose=1
)

# Create validator
validator = CrossValidator(config=custom_config)

# Run validation
results = validator.validate(
    X=X_imputed,
    target='target',
    y=y,
    models=models,
    problem_type="regression"  # or "binary_classification", "multiclass_classification"
)

# Get leaderboard
leaderboard = validator.get_leaderboard()
print(leaderboard.head())

Working with Generated Data

MLimputer includes utilities for generating datasets with missing values for testing:

from mlimputer.data.dataset_generator import ImputationDatasetGenerator

generator = ImputationDatasetGenerator(random_state=42)

# Regression dataset
X_reg, y_reg = generator.quick_regression(
    n_samples=2000, 
    missing_rate=0.15
)

# Binary classification
X_bin, y_bin = generator.quick_binary(
    n_samples=2000, 
    missing_rate=0.15
)

# Multiclass classification
X_multi, y_multi = generator.quick_multiclass(
    n_samples=2000,
    n_classes=4,
    missing_rate=0.15,
    n_categorical=3  # Include categorical features
)

Production Deployment

Saving and Loading Models

from mlimputer.utils.serialization import ModelSerializer

# Save with metadata
ModelSerializer.save(
    obj=imputer,
    filepath="production_imputer.joblib",
    format="joblib",
    metadata={
        "model": "RandomForest",
        "train_shape": X_train.shape,
        "version": "1.0"
    }
)

# Load with metadata
loaded_imputer, metadata = ModelSerializer.load_with_metadata(
    filepath="production_imputer.joblib",
    format="joblib"
)

# Use loaded imputer on new data
new_data_imputed = loaded_imputer.transform(new_data)

Important Notes

Index Reset Required: Always use DataSplitter or reset indices manually after splitting data
Categorical Handling: The framework automatically detects and encodes categorical columns
Missing Pattern Preservation: The imputer learns missing patterns from training data for consistent imputation
Memory Efficient: Large datasets are processed in batches automatically

Example Notebooks

1. Basic Usage Example

A complete walkthrough demonstrating fundamental imputation workflow:

Dataset generation with controlled missing patterns
Train/test splitting with automatic index handling
Model configuration and fitting
Imputation and evaluation
Saving fitted models for production

View Basic Example

2. Performance Evaluation Example

Comprehensive evaluation comparing multiple imputation strategies:

Cross-validation setup for robust evaluation
Comparison of 7 different imputation algorithms
Custom evaluation configurations
Best model selection based on metrics
Production deployment preparation

View Evaluation Example

3. Detailed Analysis Example

In-depth analysis of a single imputation strategy:

Column-wise missing data patterns
Fold-by-fold cross-validation results
Feature importance analysis
Test set predictions and metrics
Comprehensive performance reporting

View Detailed Analysis →

Interactive Notebooks

For a more interactive experience, feel free to explore the Jupyter notebooks with step-by-step execution and guidelines:

📓 Interactive Notebooks

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use MLimputer in your research, please cite:

@software{mlimputer,
  author = {Luis Fernando Santos},
  title = {MLimputer: Missing Data Imputation Framework for Supervised Machine Learning},
  year = {2023},
  url = {https://github.com/TsLu1s/MLimputer}
}

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Luis Santos - LinkedIn

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0.26

Feb 10, 2026

2.0.25

Feb 2, 2026

2.0.24

Jan 26, 2026

2.0.23

Jan 19, 2026

2.0.22

Dec 30, 2025

2.0.21

Dec 29, 2025

2.0.20

Dec 29, 2025

1.0.80

Jan 26, 2025

1.0.70

Oct 24, 2024

1.0.68

Jul 31, 2024

1.0.67

May 27, 2024

1.0.66

May 7, 2024

1.0.65

Apr 20, 2024

1.0.56

Apr 1, 2024

1.0.50

Jan 31, 2024

1.0.46

Jan 27, 2024

1.0.40

Jan 2, 2024

1.0.10

Oct 31, 2023

1.0.6

Jul 4, 2023

1.0.5

May 6, 2023

1.0.1

Apr 17, 2023

1.0.0

Mar 12, 2023

0.1.2

Feb 20, 2023

0.0.98

Feb 8, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlimputer-2.0.26-py3-none-any.whl (47.3 kB view details)

Uploaded Feb 10, 2026 Python 3

File details

Details for the file mlimputer-2.0.26-py3-none-any.whl.

File metadata

Download URL: mlimputer-2.0.26-py3-none-any.whl
Upload date: Feb 10, 2026
Size: 47.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for mlimputer-2.0.26-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3389104440bb607332b99d604fc055b5975367cb3141d66a84f5f4682f8923f5`
MD5	`d82ff4346e6b396c17aa64daebe39d60`
BLAKE2b-256	`768ea8938e84502699e43e4ca43c3a6a80196d600c053b39c5187c78622a1d74`

See more details on using hashes here.

mlimputer 2.0.26

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLimputer: Missing Data Imputation Framework for Machine Learning

Framework Contextualization: Advanced Missing Data Imputation for Tabular Data

Key Capabilities

Main Development Tools

Installation

Quick Start Guide

Basic Usage Example

Advanced Configuration

Performance Evaluation

Evaluation Framework

Custom Cross-Validation

Working with Generated Data

Production Deployment

Saving and Loading Models

Important Notes

Example Notebooks

1. Basic Usage Example

2. Performance Evaluation Example

3. Detailed Analysis Example

Interactive Notebooks

Contributing

License

Citation

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes