Skip to main content

Python package to sanitize in a standard way ML-related labels.

Project description

Sanitize ML Labels

PyPI Downloads License CI

Sanitize ML Labels is a Python package designed to standardize and sanitize ML-related labels. Currently supports over 100 labels, including metric and model names.

If you have ML-related labels, and you find yourself renaming and sanitizing them in a consistent manner, with the proper capitalizaton, this package ensures they are always sanitized in a standard way.

How do I install this package?

You can install it using pip:

pip install sanitize_ml_labels

Usage examples

Here are some common use cases for normalizing labels:

Example for metrics

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "acc",
    "loss",
    "auroc",
    "lr"
]

assert sanitize_ml_labels(labels) == [
    "Accuracy",
    "Loss",
    "AUROC",
    "Learning rate"
]

Example for models

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "mlp",
    "cnn",
    "ffNN",
    "Feed-forward neural network",
    "perceptron",
    "recurrent neural network",
    "LStM"
]

assert sanitize_ml_labels(labels) == [
    "MLP",
    "CNN",
    "FFNN",
    "FFNN",
    "Perceptron",
    "RNN",
    "LSTM"
]

assert sanitize_ml_labels("vanilla mlp") == "MLP"
assert sanitize_ml_labels("vanilla cnn") == "CNN"

assert sanitize_ml_labels([
    "Large Language Model",
    "transe",
    "Generative Pre-trained Transformer",
    "Graph Convolutional Neural Network",
    "Convolutional Graph Neural Network",
    "Graph Neural Network",
    "Graph Attention Network",
    "Graph Attention Neural Network",
]) == ["LLM","TransE","GPT","GCN","GCN","GNN","GAT","GAT"]

Sometimes, it happens that you have prefixed all your models with "vanilla" or "simple" or "basic". This package can help you remove these prefixes.

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "vanilla mlp",
    "vanilla cnn",
    "vanilla ffnn",
    "vanilla perceptron"
]

assert sanitize_ml_labels(labels) == ["MLP", "CNN", "FFNN", "Perceptron"]

Corner cases

Sometimes, you might encounter hyphenated terms that need to be correctly identified and normalized. We use a heuristic approach based on an extended list of over 45K hyphenated English words, originally from the Metadata consulting website.

The lookup heuristic, written by Tommaso Fontana, ensures efficient and accurate hyphenated word recognition.

from sanitize_ml_labels import sanitize_ml_labels

# Running the following
assert sanitize_ml_labels("non-existent-edges-in-graph") == "Non-existent edges in graph"

Extra utilities

In addition to label sanitization, the package provides methods to check metric normalization:

Is normalized metric

Validates if a metric falls within the range [0, 1].

from sanitize_ml_labels import is_normalized_metric

assert not is_normalized_metric("MSE")
assert is_normalized_metric("acc")
assert is_normalized_metric("accuracy")
assert is_normalized_metric("AUROC")
assert is_normalized_metric("auprc")

Is absolutely normalized metric

Validates if a metric falls within the range [-1, 1].

from sanitize_ml_labels import is_absolutely_normalized_metric

assert not is_absolutely_normalized_metric("auprc")
assert is_absolutely_normalized_metric("MCC")
assert is_absolutely_normalized_metric("Markedness")

Shoud be maximized

Whether a metric should be maximized or minimized. Unknown metrics will raise a NotImplementedError.

from sanitize_ml_labels import should_be_maximized

assert not should_be_maximized("MSE")
assert should_be_maximized("AUROC")
assert should_be_maximized("accuracy")

License

This software is licensed under the MIT license. See the LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanitize_ml_labels-1.1.5.tar.gz (327.4 kB view details)

Uploaded Source

File details

Details for the file sanitize_ml_labels-1.1.5.tar.gz.

File metadata

  • Download URL: sanitize_ml_labels-1.1.5.tar.gz
  • Upload date:
  • Size: 327.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for sanitize_ml_labels-1.1.5.tar.gz
Algorithm Hash digest
SHA256 af777b269aac26270faf501224836a145a3a0bdb028b688c45d777645098f77c
MD5 d6578e03e9d6e85f2289c5c1c4e4ddb7
BLAKE2b-256 6b43dc0d941b7476ac62884ae21d1c02dee2c36bf357859e9a429de0871d7855

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page