Library for decoding bytes content into unicode
Project description
Unicodec Package Documentation
This package provides functions for:
- decoding bytes content of HTML document into Unicode text
- detecting encoding of bytes content of HTML document
- normalization of encoding's name to canonical form, according to WHATWG HTML standard
Feel free to give feedback in Telegram groups: @grablab and @grablab_ru.
Installation
pip install -U unicodec
Usage Example #1
Download web document with urllib and convert its content to Unicode.
from urllib.request import urlopen
from unicodec import decode_content, detect_content_encoding
res = urlopen("http://lib.ru")
rawdata = res.read()
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
Output:
<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r
Usage Example #2
Download web document with urllib3 and convert its content to Unicode.
from urllib3 import PoolManager
from unicodec import decode_content, detect_content_encoding
res = PoolManager().urlopen("GET", "http://lib.ru")
rawdata = res.data
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
Output:
<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r
Usage Example #3
Convert names of encodings to canonical form (according to WHATWG HTML standard).
from unicodec.normalization import normalize_encoding_name
for name in ["iso8859-1", "utf8", "cp1251"]:
print("{} -> {}".format(name, normalize_encoding_name(name)))
Output:
iso8859-1 -> windows-1252
utf8 -> utf-8
cp1251 -> windows-1251
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unicodec-0.2.0.tar.gz.
File metadata
- Download URL: unicodec-0.2.0.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a59748691835826f5f9915434543f3d96882a8eb70a882ba7f76445de0c96863
|
|
| MD5 |
913f51d68c1a7920e24c5b384c970f04
|
|
| BLAKE2b-256 |
3223508cd345d57c621da7a19925fdfb7dccabd379c727a4c493f353cd87c77a
|
File details
Details for the file unicodec-0.2.0-py2.py3-none-any.whl.
File metadata
- Download URL: unicodec-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
609dae02f66490a53a0ad0f84a340e2563424362c2d247c50e1e65dc5cef4655
|
|
| MD5 |
7701cde18144e75532bd349eefb9c2f0
|
|
| BLAKE2b-256 |
6e55a9d9d997205cde95f4f38f490011efd354f35a3a06b88e15c29ed597e1f6
|