Skip to main content

Fuzzy biject people's names between two lists.

Project description

names-matcher

Build Status Code coverage PyPI package

Fuzzy biject people's names between two lists.

Let's define an identity as a series of names belonging to the same person. The algorithm is:

  1. Parse, normalize, and split names in each identity. The result is a set of strings for each identity.
  2. Define the similarity between identities as the Jaccard similarity between their sets of strings.
  3. Construct the distance matrix between identities in two specified lists.
  4. Solve the Linear Assignment Problem (LAP) on that matrix.

We use metaphones in the normalization step to reduce the influence of different spelling and typos. We use lapjv to solve the LAP, so our solution scales to ~1000-s of identities. If you have a bigger problem size, you should use MinHashes (e.g. http://ekzhu.com/datasketch/) over the identity sets produced by reap_identity(). Feel free to PR them.

Example:

>>> from names_matcher import NamesMatcher
>>> NamesMatcher()([["Vadim Markovtsev", "vmarkovtsev"], ["Long, Waren", "warenlg"]], \
                    [["Warren"], ["VMarkovtsev"], ["Eiso Kant"]])
(array([1, 0], dtype=int32), array([0.75      , 0.57142857]))

The first resulting tuple element is the mapping indexes: of same length as the first sequence, with indexes in the second sequence. The second element is the corresponding confidence values from 0 to 1.

Installation

pip3 install names-matcher

Command line interface

Given one identity per line in two files, print the matches to standard output:

python3 -m names_matcher path/to/file/1 path/to/file/2

Each identity is several names merged with |, for example:

Vadim Markovtsev|vmarkovtsev|vadim

Contributing

Contributions are very welcome and desired! Please follow the code of conduct and read the contribution guidelines.

License

Apache-2.0, see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

names-matcher-1.1.1.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

names_matcher-1.1.1-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file names-matcher-1.1.1.tar.gz.

File metadata

  • Download URL: names-matcher-1.1.1.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.7

File hashes

Hashes for names-matcher-1.1.1.tar.gz
Algorithm Hash digest
SHA256 a95a3fda1b614a5df2e06c53db4a1caf4a80c605934851390dd61ebbf8d127ab
MD5 912e4bd1614e6288d6cbd527fa39222b
BLAKE2b-256 1c11422a93fba40be2e9cb1d423de73b6f8a99d0d0b3b7d505c337941fdec027

See more details on using hashes here.

File details

Details for the file names_matcher-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: names_matcher-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.7

File hashes

Hashes for names_matcher-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5fbe9391bbc1a8603f6e7ff4ef1c4b8cfec7356529dfd2c5c17c889c2e932708
MD5 43165efad864c1c539adc86186b5126e
BLAKE2b-256 f1ad88c0df37dfce793d61eb1025fb485d819143e93ac56b148b121465c9dcd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page