Skip to main content

Library for decoding bytes content into unicode

Project description

Unicodec Package Documentation

Test Status Code Quality Type Check Test Coverage Status

This package provides functions for:

  • decoding bytes content of HTML document into Unicode text
  • detecting encoding of bytes content of HTML document
  • normalization of encoding's name to canonical form, according to WHATWG HTML standard

Feel free to give feedback in Telegram groups: @grablab and @grablab_ru.

Installation

pip install -U unicodec

Usage Example #1

Download web document with urllib and convert its content to Unicode.

from urllib.request import urlopen

from unicodec import decode_content, detect_content_encoding

res = urlopen("http://lib.ru")
rawdata = res.read()
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))

Output:

<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r

Usage Example #2

Download web document with urllib3 and convert its content to Unicode.

from urllib3 import PoolManager

from unicodec import decode_content, detect_content_encoding

res = PoolManager().urlopen("GET", "http://lib.ru")
rawdata = res.data
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))

Output:

<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r

Usage Example #3

Convert names of encodings to canonical form (according to WHATWG HTML standard).

from unicodec.normalization import normalize_encoding_name

for name in ["iso8859-1", "utf8", "cp1251"]:
    print("{} -> {}".format(name, normalize_encoding_name(name)))

Output:

iso8859-1 -> windows-1252
utf8 -> utf-8
cp1251 -> windows-1251

References

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicodec-0.2.0.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicodec-0.2.0-py2.py3-none-any.whl (10.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file unicodec-0.2.0.tar.gz.

File metadata

  • Download URL: unicodec-0.2.0.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for unicodec-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a59748691835826f5f9915434543f3d96882a8eb70a882ba7f76445de0c96863
MD5 913f51d68c1a7920e24c5b384c970f04
BLAKE2b-256 3223508cd345d57c621da7a19925fdfb7dccabd379c727a4c493f353cd87c77a

See more details on using hashes here.

File details

Details for the file unicodec-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: unicodec-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for unicodec-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 609dae02f66490a53a0ad0f84a340e2563424362c2d247c50e1e65dc5cef4655
MD5 7701cde18144e75532bd349eefb9c2f0
BLAKE2b-256 6e55a9d9d997205cde95f4f38f490011efd354f35a3a06b88e15c29ed597e1f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page