charset-mnbvc

本项目旨在对大量文本文件进行快速编码检测以辅助mnbvc语料集项目的数据清洗工作

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

项目描述

本项目旨在对大量文本文件进行快速编码检测以辅助mnbvc语料集项目的数据清洗工作

实现机制

读取每个文件的前100个字符(长度可定义)
尝试使用5种编码对字符进行decode utf_8,utf_16,gb18030,gb2312,big5
将每一组decode的结果对中文字符串和常用中文字进行正则匹配,有匹配结果的表明符合编码要求

使用说明

chinese_charset_detect.py -i inputDirectory为需要检测的目录
dist目录包含macos下的可执行文件,windows环境下暂未打包,希望有朋友帮忙编译一下

模块调用方法

根据文件夹获取所有文件编码

from charset_mnbvc import api

file_count, results = api.from_dir(
    folder_path=ifolder_path,
)

for result in results:
    print(f"文件名: {result[0]}, 编码: {result[1]}")

获取单个文件编码

from charset_mnbvc import api

file_path = "test.txt"
coding_name = get_cn_charset(file_path)
print(f"文件名: {file_path}, 编码: {coding_name}")

使用可执行文件范例:

./dist/chinese_charset_detect -i tests
or
python chinese_charset_detect.py -i tests

测试结果:

文件名: tests/.DS_Store, 编码: unknow
文件名: tests/fixtures/test4.txt, 编码: gb18030
文件名: tests/fixtures/1045.txt, 编码: gb18030
文件名: tests/fixtures/10.txt, 编码: gb18030
文件名: tests/fixtures/test2.txt, 编码: unknow
文件名: tests/fixtures/test3.txt, 编码: unknow
文件名: tests/fixtures/test.txt, 编码: utf_8
文件名: tests/fixtures/18.txt, 编码: utf_8
总文件数: 8
总耗时长: 0.5920612812042236

全部使用范例代码:

import time
import sys
import getopt
from charset_mnbvc import api


def main(argv):
    ifolder_path = ""
    try:
        opts, args = getopt.getopt(argv, "hi:o:", ["ifolder_path="])
    except getopt.GetoptError:
        print('test.py -i <inputDirectory>')
        sys.exit(2)
    for opt, arg in opts:
        if opt == '-h':
            print('chinese_charset_detect.py -i <inputDirectory> inputDirectory为需要检测的目录')
            sys.exit()
        elif opt in ("-i", "--ifolder_path"):
         ifolder_path = arg

    start = time.time()
    file_count, results = api.from_dir(
        folder_path=ifolder_path,
    )
    for result in results:
        print(f"文件名: {result[0]}, 编码: {result[1]}")
    print(f"总文件数: {file_count}")


    end = time.time()
    print(f"总耗时长: {end - start}")


if __name__ == "__main__":
    try:
        main(sys.argv[1:])
    except Exception as e:
        print('chinese_charset_detect.py -i <inputDirectory> inputDirectory为需要检测的目录')
        sys.exit(2)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.15

Feb 18, 2024

0.0.14

Jan 11, 2024

0.0.13

Jan 11, 2024

0.0.12

Sep 5, 2023

0.0.11

Aug 10, 2023

0.0.10

Aug 4, 2023

0.0.9

Jul 28, 2023

0.0.8

Jul 18, 2023

0.0.7

Jun 8, 2023

0.0.6

May 26, 2023

0.0.5

May 10, 2023

0.0.4

Mar 3, 2023

0.0.3

Mar 2, 2023

This version

0.0.2

Feb 9, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charset_mnbvc-0.0.2.tar.gz (4.7 kB view hashes)

Uploaded Feb 9, 2023 Source

Hashes for charset_mnbvc-0.0.2.tar.gz

Hashes for charset_mnbvc-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`2acb93f42751d2e09b8b5a16bddac46643e825029c06a0d85fbe9a0d00d3afe0`
MD5	`8779610e9b1ea9338fa0a2f50642424e`
BLAKE2b-256	`0e0436f48c367d8cfc3369eac6321f853a56d92c07747415a47cc70de063bb20`