Skip to main content

# Remove duplicates 重复内容筛选 tkitSimhash zh 根据经验,一般当两个文档特征字之间的汉明距离小于 3, 就可以判定两个文档相似。《数学之美》一书中,在讲述信息指纹时对这种算法有详细的介绍。 ```python from tkitSimhash import simHash sim=simHash() text1 = """' , in Valve's absence, the modern slew of co-op zombie games have not

Project description

Remove duplicates 重复内容筛选

tkitSimhash zh

根据经验,一般当两个文档特征字之间的汉明距离小于 3, 就可以判定两个文档相似。《数学之美》一书中,在讲述信息指纹时对这种算法有详细的介绍。

from tkitSimhash import simHash
sim=simHash()
text1 = """' , in Valve's absence, the modern slew of co-op zombie games have not been picking up the slack. The recent World War Z was lackluster at best, feeling like a cheap knockoff of a better game. The Vermintide series is much better in the gameplay department, but a fantasy battle against rat-men just isn't the same as fighting against hordes of undead. The Zombies modes in the Call of Duty games do a decent job of scratching the zombie itch, but what we're hoping for is a stand-alone zombie game, not DLC attached to a military shooter.  \nRelated: Screenshots From The New Resident Evil Have Leaked  \nBut now there's the hope that maybe, just maybe, Capcom can pull off a major multiplayer hit that will have players forgetting all about Valve and their long-suspected triskaphobia. Certainly, the Resident Evil name sure has the clout needed to get people to pay attention to the new series.  \n  \nCapcom has been experimenting with multiplayer in its Resident Evil games for years. This dates all the way back to Resident Evil ."""
text2 = """, in Valve's absence, the modern slew of co-op zombie games have not been picking up the slack. The recent World War Z was lackluster at best, feeling like a cheap knockoff of a better game. The Vermintide series is much better in the gameplay department, but a fantasy battle against rat-men just isn't the same as fighting against  of undead. The Zombies modes in the Call of Duty games do a decent job of scratching the zombie itch, but what we're hoping for is a stand-alone zombie game, not DLC attached to a military shooter.  \nRelated: Screenshots From The New Resident Evil Have Leaked  \nBut now there's the hope that maybe, just maybe, Capcom can pull off a major multiplayer hit that will have players forgetting all about Valve and their long-suspected triskaphobia. Certainly, its Resident Evil games for years. This dates all the way back to Resident Evil  """
a = sim.simhash(text1)
b = sim.simhash(text2)

# print(a)
print("拆分子码,子码至少存在一个一样的才需要计算相关度")
code_a=sim.autoencode([text1])[0]
print(code_a)
code_b=sim.autoencode([text2])[0]
print(code_b)
# print(sim.subcode(a))

# print(b)
# print(sim.subcode(b))


sim.similarity(code_a['code'],code_b['code']),sim.getdistance(code_a['code'],code_b['code'])

拆分子码,子码至少存在一个一样的才需要计算相关度 {'subcode': ['1101100011001100', '1010110001010111', '0101101101110111', '0001111011011101'], 'code': '1101100011001100101011000101011101011011011101110001111011011101'} {'subcode': ['1101100110001100', '1010110001010111', '0001111101110111', '0001111011011101'], 'code': '1101100110001100101011000101011100011111011101110001111011011101'} (0.999999910089919, 4)

update


0.0.1.6 修正依赖 pytest==7.1.3和nltk

0.0.1.5 修正依赖 pytest==7.1.3和nltk

0.0.1.4

修改word列表为文本

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tkitSimhash-0.0.1.9.tar.gz (5.9 kB view hashes)

Uploaded Source

Built Distribution

tkitSimhash-0.0.1.9-py2.py3-none-any.whl (6.3 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page