Fuzzily biject people's names between two lists.
Project description
names-matcher
Fuzzily biject people's names between two lists.
Let's define an identity as a series of names belonging to the same person. The algorithm is:
- Parse, normalize, and split names in each identity. The result is a set of strings per each.
- Define the similarity between identities as
max(ratio, token_set_ratio)
, whereratio
andtoken_set_ratio
are inspired by string comparison functions from FuzzyWuzzy. - Construct the distance matrix between identities in two specified lists.
- Solve the Linear Assignment Problem (LAP) on that matrix.
Our LAP's solution scales up to ~1000-s of identities.
Example:
>>> from names_matcher import NamesMatcher
>>> NamesMatcher()([["Vadim Markovtsev", "vmarkovtsev"], ["Long, Waren", "warenlg"]], \
[["Warren"], ["VMarkovtsev"], ["Eiso Kant"]])
(array([1, 0], dtype=int32), array([0.75 , 0.57142857]))
The first resulting tuple element is the mapping indexes: of same length as the first sequence, with indexes in the second sequence. The second element is the corresponding confidence values from 0 to 1.
Installation
pip3 install names-matcher
Command line interface
Given one identity per line in two files, print the matches to standard output:
python3 -m names_matcher path/to/file/1 path/to/file/2
Each identity is several names merged with |
, for example:
Vadim Markovtsev|vmarkovtsev|vadim
Contributing
Contributions are very welcome and desired! Please follow the code of conduct and read the contribution guidelines.
License
Apache-2.0, see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for names_matcher-2.0.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd9b2dc5d69e9ecd0575868a2fb6f44b1f5a27ad060fefb74a36dfe0d6436be6 |
|
MD5 | 94a592c8f3cf38d2e2253ed0a9f9d2d2 |
|
BLAKE2b-256 | efbfdd94f751662ec196e08d2a228167e95b634c8e6da26caec22bb1ffcf8cfc |