Skip to main content

A pure Python library to determine Unicode text segmentations

Project description

A pure Python module to determine Unicode text segmentations

You can see the full documentation including the package reference on http://uniseg-python.readthedocs.org.

Features

This package provides:

  • Functions to get Unicode Character Database (UCD) properties concerned with text segmentations.

  • Functions to determin segmentation boundaries of Unicode strings.

  • Classes that help implement Unicode-aware text wrapping on both console (monospace) and graphical (monospace / propotional) font environments.

Supporting segmentations are:

code point

Code point is “any value in the Unicode codespace.” It is the basic unit for processing Unicode strings.

grapheme cluster

Grapheme cluster approximately represents “user-perceived character.” They may be made up of single or multiple Unicode code points. e.g. “G” + acute-accent is a user-perceived character.

word break

Word boundaries are familiar segmentation in many common text operations. e.g. Unit for text highlighting, cursor jumping etc. Note that words are not determinable only by spaces or punctuations in text in some languages. Such languages like Thai or Japanese require dictionaries to determine appropriate word boundaries. Though the package only provides simple word breaking implementation which is based on the scripts and doesn’t use any dictionaires, it also provides ways to customize its default behaviours.

sentensce break

Sentence breaks are also common in text processing but they are more contextual and less formal. The sentence breaking implementation (which is specified in UAX: Unicode Standard Annex) in the package is simple and formal too. But it must be still useful in some usages.

line break

Implementing line breaking algorithm is one of the key features of this package. The feature is important in many general text presentations in both CLI and GUI applications.

Requirements

  • Python 2.7 / 3.3 / 3.4

Download

Source / binary distributions (PyPI)

https://pypi.python.org/pypi/uniseg

All sources and build tools etc. (Bitbucket)

https://bitbucket.org/emptypage/uniseg-python

Install

Just type:

% pip install uniseg

or download the archive and:

% python setup.py install

Changes

0.6.4 (2015-02-10)
  • Add uniseg-dbpath console command, which just print the path of ucd.sqlite3.

  • Include sample scripts under the package’s subdirectory.

0.6.3 (2015-01-25)
  • Python 3.4

  • Support modern setuptools, pip and wheel.

0.6.2 (2013-06-09)
  • Python 3.3

0.6.1 (2013-06-08)
  • Unicode 6.2.0

References

UAX #14: Unicode Line Breaking Algorithm (6.2.0)

http://www.unicode.org/reports/tr14/tr14-30.html

UAX #29 Unicode Text Segmentation (6.2.0)

http://www.unicode.org/reports/tr29/tr29-21.html

License

Copyright (c) 2013 Masaaki Shibata <mshibata@emptypage.jp>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniseg-0.6.4.zip (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uniseg-0.6.4-py2.py3-none-any.whl (1.5 MB view details)

Uploaded Python 2Python 3

File details

Details for the file uniseg-0.6.4.zip.

File metadata

  • Download URL: uniseg-0.6.4.zip
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for uniseg-0.6.4.zip
Algorithm Hash digest
SHA256 b87ac0dcb87c0da50e98a63800dc6359485a035bd1fb2313113c56e037376c19
MD5 4dc6467fc895256c68df2375effbe2a8
BLAKE2b-256 20d8be076d6b379b9ac54a1e447b043c04579cc4b14a10837411609afab1dc98

See more details on using hashes here.

File details

Details for the file uniseg-0.6.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for uniseg-0.6.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c32f647cd9a03b8aa69d6531539a717caa8441d21711c55368ccea271f337dcb
MD5 ae58427b06c2218cd7ad74b4d85c1343
BLAKE2b-256 7dea83e7878761650db2ccba185ace385189544070d3114cd5a3dbc48dd8545b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page