Skip to main content

新词发现算法

Project description

pyUnit-NewWord

无监督训练文本词库

安装

pip install pyunit-newword

注意事项

该算法采用Hash字典存储,大量消耗内存。100M的纯中文文本需要12G以上的内存,不然耗时太严重。

训练代码(文本是UTF-8格式)

from pyunit_newword import NewWords

if __name__ == '__main__':
    nw = NewWords(filter_cond=10, filter_free=2)
    nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
    nw.analysis_data()
    with open('分析结果.txt', 'w', encoding='utf-8')as f:
        for word in nw.get_words():
            print(word)
            f.write(word[0] + '\n')

爬虫的微博数据一部分截图(大概100M纯文本)

训练微博数据后的结果

5个词语

训练后得到的词语视频


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyunit_newword-2018.2.28-py3-none-any.whl (4.3 kB view details)

Uploaded Python 3

File details

Details for the file pyunit_newword-2018.2.28-py3-none-any.whl.

File metadata

  • Download URL: pyunit_newword-2018.2.28-py3-none-any.whl
  • Upload date:
  • Size: 4.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.4

File hashes

Hashes for pyunit_newword-2018.2.28-py3-none-any.whl
Algorithm Hash digest
SHA256 f8747c988011f133a702d0830e568c15229036230101d34a6e71d9a7b930446b
MD5 1a46f7dafa8f773e0a7ac7f6734683b8
BLAKE2b-256 b1ef6a30ac73a49d79198ba73f2a34a56cb00f8308655b7ab23b4df1a313a250

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page