Skip to main content

EntiPy is a Python library that implements an incremental clustering approach to entity resolution.

Project description

EntiPy

EntiPy is a Python library that implements an incremental clustering approach to entity resolution.

Motivation

Entity resolution (ER, also known as identity resolution, data deduplication, data matching, record linkage, merge-purge, and more) is the field concerned with grouping data records that are determined to point to the same real-world thing. The broad concept of ER has uses in master data management, customer data integration, and fraud detection.

ER is a difficult data problem. It is only necessary when matching data records do not share a common identifier, and if implemented naïvely, it is inherently quadratic in time complexity. Modern approaches to ER tend to be divided into three general phases:

  • First, data preprocessing. Records from one or more data sources are transformed into a shape that later stages can consume.
  • Second, blocking. If possible, any given record is tagged as possibly belonging to one or more subsets, or blocks, of records to avoid the need to compare each record against every other record. For example, ER on customers might be blocked by ZIP code.
  • Third, resolution. Records within each block are compared against one another and are clustered based on their similarity to other records. Each cluster aims to be as close as possible to a real-world entity.

This library, EntiPy, implements resolution based on research done by Tauer et al. and Ilagan and Ilagan.

Prerequisites

  • Python 3.11 or higher.
    • We built and tested this version of EntiPy with Python 3.11.2.
  • sortedcontainers
    • Your Python environment must be able to install and use the sortedcontainers library, which can be found here.

Installation

You can install EntiPy with pip. We recommend installing EntiPy in a virtual environment.

pip install entipy

Getting Started

EntiPy's primary focus is on implementing the resolution algorithm, which means that you will need to model your data upfront. We provide tools and documentation to help you with this data modeling.

In this tutorial, a data record will be a potentially-misspelled product name that was read from an OCR scan. To cluster these records is thus to resolve them to the real, underlying products behind the observed product name.

Modeling your data with References and Fields

We first need to define a data record type. We will call the overall shape of a data record a Reference. A Reference will have one or more field properties.

from entipy import Reference, Field

class ProductNameReference(Reference):
    observed_name: Field

Fields represent the different properties of a data record. A field class inherits from the generic Field class. Your custom field class will need to implement one method, compare, which will return whether or not a field instance should be considered to match with another field instance. You can use the value property of the generic Field class as the basis for this comparison.

from entipy import Reference, Field
from rapidfuzz import fuzz

class ProductNameReference(Reference):
    observed_name: Field

class ObservedNameField(Field):
    value: str
    def compare(self, other) -> bool:
        return fuzz.ratio(self.value, other.value) >= 70

A field class also has two additional properties. The float true_match_probability represents the probability that two coreferential References will match on the field. The float false_match_probability represents the probability that two non-coreferential References will match on the field. The default value for true_match_probability is 0.9, and the default value for false_match_probability is 0.1. It is likely that you will need to change these values for each field, which you can do as such:

from entipy import Reference, Field
from rapidfuzz import fuzz

class ProductNameReference(Reference):
    observed_name: Field

class ObservedNameField(Field):
    value: str
    true_match_probability = 0.85
    false_match_probability = 0.15
    def compare(self, other) -> bool:
        return fuzz.ratio(self.value, other.value) >= 70

Once you have modeled your field class, you can replace the type annotation in your ProductNameReference class with your custom ObservedNameField class.

from entipy import Reference, Field
from rapidfuzz import fuzz

# Note that we placed the field class first to satisfy the Python interpreter
class ObservedNameField(Field):
    value: str # This line is optional. It indicates that Field inheritors are expected to have a `value` property.
    true_match_probability = 0.85
    false_match_probability = 0.15
    def compare(self, other) -> bool:
        return fuzz.ratio(self.value, other.value) >= 70

class ProductNameReference(Reference):
    # Please note that you must assign the class of a Field model itself to a property name on your Reference model.
    observed_name = ObservedNameField

Once you have modeled your reference class, you can use it to create Reference objects like so. Your Reference objects can be instantiated with kwargs. The value of each kwarg should be the value you intend the respective Field to take.

ref_1 = ProductNameReference(observed_name="PrimeHarvestCheese10Qg")
ref_2 = ProductNameReference(observed_name="PureGourCetYogurt2.4kg")
ref_3 = ProductNameReference(observed_name="PrimeHarvLstCheese1F0g")

Your reference objects can then be used with the SerialResolver entity resolution engine, which we will discuss next.

Implementing entity resolution with the SerialResolver

The central building block of EntiPy is the SerialResolver. This class represents a stateful agent that clusters data records serially.

At a basic level, the SerialResolver accepts a sequence (a list or a set) of references when first instantiated.

from entipy import Reference, Field, SerialResolver

...

resolver = SerialResolver([r1, r2, r3])

You can then call the .resolve() method of the SerialResolver to begin entity resolution. This will make the SerialResolver process the Reference inheritors inplace.

resolver.resolve()

When resolution is complete, you can retrieve the generated clusters with .retrieve_clusters(), which returns a dictionary whose keys are arbitrary cluster IDs and whose values are lists of your Reference instances.

clusters = resolver.retrieve_clusters()

A working demonstration

We can tie everything we have seen so far into a short working demonstration of resolving a small batch of references.

from entipy import Field, Reference, SerialResolver
from rapidfuzz import fuzz


class ObservedNameField(Field):
    true_match_probability = 0.85
    false_match_probability = 0.15
    def compare(self, other):
        return fuzz.ratio(self.value, other.value) >= 70


class ProductNameReference(Reference):
    observed_name = ObservedNameField


r1 = ProductNameReference(observed_name='PrimeHarvestCheese10Qg')
r2 = ProductNameReference(observed_name='PureGourCetYogurt2.4kg')
r3 = ProductNameReference(observed_name='PrimeHarvLstCheese1F0g')
r4 = ProductNameReference(observed_name='NutSaFusionBakingSoda200g')
r5 = ProductNameReference(observed_name='PrimeIarvestCh~ose100g')
r6 = ProductNameReference(observed_name='PureGotrmetYogurt2_4kg')

sr = SerialResolver([r1, r2, r3, r4, r5, r6])

sr.resolve()

clusters = sr.retrieve_clusters()

The clusters variable should look something like this:

{10: [{'observed_name': 'NutSaFusionBakingSoda200g'}],
 12: [{'observed_name': 'PrimeHarvestCheese10Qg'},
  {'observed_name': 'PrimeHarvLstCheese1F0g'},
  {'observed_name': 'PrimeIarvestCh~ose100g'}],
 14: [{'observed_name': 'PureGourCetYogurt2.4kg'},
  {'observed_name': 'PureGotrmetYogurt2_4kg'}]}

Incremental resolution

A key feature of EntiPy is its ability to incrementally resolve references that arrive after the initial batch. SerialResolvers support adding either single references or lists of references through its .add() method.

r7 = ProductNameReference(observed_name='PureGourmetCookinMOil300mL')

sr.add(r7)

sr.resolve()

r8 = ProductNameReference(observed_name='DeliFresqeoyXauce1L')
r9 = ProductNameReference(observed_name='DeliFreshSoySakcE1.2L')

sr.add([r8, r9])

sr.resolve()

clusters = sr.retrieve_clusters()

The clusters variable should now look something like this:

{10: [{'observed_name': 'NutSaFusionBakingSoda200g'}],
 12: [{'observed_name': 'PrimeHarvestCheese10Qg'},
      {'observed_name': 'PrimeHarvLstCheese1F0g'},
      {'observed_name': 'PrimeIarvestCh~ose100g'}],
 14: [{'observed_name': 'PureGourCetYogurt2.4kg'},
      {'observed_name': 'PureGotrmetYogurt2_4kg'}],
 16: [{'observed_name': 'PureGourmetCookinMOil300mL'}],
 21: [{'observed_name': 'DeliFresqeoyXauce1L'},
      {'observed_name': 'DeliFreshSoySakcE1.2L'}]}

Including Reference metadata

When instantiating a Reference, you can assign a dictionary to the metadata kwarg. This is useful for knowing which row a Reference was sourced from and for managing similar tracking data.

r1 = ProductNameReference(observed_name='PrimeHarvestCheese10Qg', metadata={'id': 1})
r2 = ProductNameReference(observed_name='PureGourCetYogurt2.4kg', metadata={'id': 2})
r3 = ProductNameReference(observed_name='PrimeHarvLstCheese1F0g', metadata={'id': 3})
r4 = ProductNameReference(observed_name='NutSaFusionBakingSoda200g', metadata={'id': 4})
r5 = ProductNameReference(observed_name='PrimeIarvestCh~ose100g', metadata={'id': 5})
r6 = ProductNameReference(observed_name='PureGotrmetYogurt2_4kg', metadata={'id': 6})

The metadata dictionary must be JSON-serializable. Data assigned to the metadata kwarg in this way will remain attached to the reference as it is processed by EntiPy's resolvers, but it will not be included in reference comparisons.

When retrieving clusters from a SerialResolver, you can toggle whether reference metadata should be included in the dictionary representation of your clusters with the include_reference_metadata keyword. This kwarg is False by default.

sr.retrieve_clusters(include_reference_metadata=True)

''' Returns
{10: [{'metadata': '{"id": 4}', 'observed_name': 'NutSaFusionBakingSoda200g'}],
 12: [{'metadata': '{"id": 1}', 'observed_name': 'PrimeHarvestCheese10Qg'},
      {'metadata': '{"id": 3}', 'observed_name': 'PrimeHarvLstCheese1F0g'},
      {'metadata': '{"id": 5}', 'observed_name': 'PrimeIarvestCh~ose100g'}],
 14: [{'metadata': '{"id": 2}', 'observed_name': 'PureGourCetYogurt2.4kg'},
      {'metadata': '{"id": 6}', 'observed_name': 'PureGotrmetYogurt2_4kg'}]}
'''

Other demonstrations

Other demonstrations may be found in the demos/ folder of this repository. We recommend trying the product_name_resolution demo to understand what motivated the development of EntiPy.

Roadmap

EntiPy is currently in pre-alpha. Do not expect the API to remain stable.

The EntiPy project aims to implement the following features in future versions:

  • Blocking
  • Parallel resolution
  • Weak cluster dispersion

It is not the aim of EntiPy to implement similarity functions for fields.

License

By default, EntiPy is licensed under the GNU Affero General Public License version 3 (AGPLv3). If you would like to use EntiPy for a project that cannot abide by the terms of AGPLv3, please contact us to purchase a commercial license, payable to Archmob Pte. Ltd.

Contributions

EntiPy is not currently accepting contributions. This may change once the use cases of the project develop.

Contact

The author and maintainer of this library is Joe Ilagan. He can be reached at joe@archmob.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entipy-0.0.3.tar.gz (69.1 kB view hashes)

Uploaded Source

Built Distribution

entipy-0.0.3-py3-none-any.whl (21.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page