textdata

Easily get clean data, direct from Python source

These details have not been verified by PyPI

Project links

Homepage

Project description

It’s very common to need to extract data from program source.

The problem is that the Python likes to have its text indented means that literal data would often have extra spaces and lines that you really don’t want. This drives many developers to drop in Python list data structures but that’s tedious, more verbose, and often less legible.

textdata makes it easy to have clean, nicely-whitespaced data specified in your program, but to get the data that you want without extra whitespace cluttering things up. It’s permissive of whitespace needed to make the program source look and work right, yet doesn’t require that they they be seen in the resulting data.

Python string methods give easy ways to clean this text up, but it’s no joy reinventing that particular text-cleanup wheel every time you need it–especially since many of the details are nitsy, dropping the code down into low-level constructs rather than just “give me the text!” And because the details can be a little tricky and frustrating, it’s good to not just whip up some routine a la carte, but to use well-tested code.

This module helps clean up included text (or text lines) in a simple, reusable way that won’t muck up your programs with extra code, and won’t require constant wheel-reinvention.

Lines

data = lines("""
    There was an old woman who lived in a shoe.
    She had so many children, she didn't know what to do;
    She gave them some broth without any bread;
    Then whipped them all soundly and put them to bed.
""")

will result in:

['There was an old woman who lived in a shoe.',
 "She had so many children, she didn't know what to do;",
 'She gave them some broth without any bread;',
 'Then whipped them all soundly and put them to bed.']

Text

textlines is an optional entry point with the same parameters as lines, but that joins the resulting lines into a unified string.:

data = textlines("""
    There was an old woman who lived in a shoe.
    She had so many children, she didn't know what to do;
    She gave them some broth without any bread;
    Then whipped them all soundly and put them to bed.
""")

Yields:

"There was an old woman who lived in a shoe.\nShe ... to bed."
# where the ... abbreviates exactly the characters you'd expect

API Options

Both lines and textlines provide provide routinely-needed cleanups:

remove starting and ending blank lines (which are usually due to Python source formatting)

remove blank lines internal to your text block

remove common indentation

strip leading/trailing spaces other than the common prefix (leading spaces removed by request, trailing by default)

join lines together with your choice of separator string

lines(text, noblanks=True, dedent=True, lstrip=False, rstrip=True, join=False)

Returns text as a series of cleaned-up lines.

text is the text to be processed.

noblanks => all blank lines are eliminated, not just starting and ending ones. (default True).

dedent => strip a common prefix (usually whitespace) from each line (default True).

lstrip => strip all left (leading) space from each line (default False). Note that lstrip and dedent are mutually exclusive ways of handling leading space.

rstrip => strip all right (trailing) space from each line (default True)

join => either False (do nothing), True (concatenate lines), or a string that will be used to join the resulting lines (default False)

textlines(text, noblanks=True, dedent=True, lstrip=False, rstrip=True, join='\n')

Does the same helpful cleanups as lines(), but returns result as a single string, with lines separated by newlines (by default) and without a trailing newline.

Words

Often the data you need to encode is almost, but not quite, a series of words. A list of names, a list of color names–values that are mostly single words, but sometimes have an embedded spaces. textdata has you covered:

>>> words(' Billy Bobby "Mr. Smith" "Mrs. Jones"  ')
['Billy', 'Bobby', 'Mr. Smith', 'Mrs. Jones']

Embedded quotes (either single or double) can be used to construct “words” (or phrases) containing whitespace (including tabs and newlines).

words isn’t a full parser, so there are some extreme cases like arbitrarily nested quotations that it can’t handle. It isn’t confused, however, by embedded apostrophes and other common gotchas. For example:

>>> words("don't be blue")
["don't", "be", "blue"]

>>> words(""" "'this'" works '"great"' """)
["'this'", 'works', '"great"']

words is a good choice for situations where you want a compact, friendly, whitespace-delimited data representation–but a few of your entries need more than just str.split().

Comments

If you need to embed more than a few lines of immediate data in your program, you may want some comments to explain what’s going on. textdata routines by default strip out Python-like comments (from # to end of line). So:

exclude = words("""
    __pycache__ *.pyc *.pyo     # compilation artifacts
    .hg* .git*                  # repository artifacts
    .coverage                   # code tool artifacts
    .DS_Store                   # platform artifacts
""")

Yields:

['__pycache__', '*.pyc', '*.pyo', '.hg*', '.git*',
 '.coverage', '.DS_Store']

Which is the same as:

exclude = [
 '__pycache__', '*.pyc', '*.pyo',   # compilation artifacts
 '.hg*', '.git*',                   # repository artifacts
 '.coverage',                       # code tool artifacts
 '.DS_Store'                        # platform artifacts
]

But without all the extra punctuation. If you want to capture the comments, just set cstrip=False (though that makes more sense for lines and textlines than words).

Unicode and Encodings

textdata doesn’t have any unique friction with Unicode characters and encodings. That said, any time you use Unicode characters in Python source files, care is warranted–especially in Python 2!

If your text includes Unicode characters, in Python 2 make sure to mark the string with a “u” prefix: u"★". You can also do this in Python 3.3 and following. Sadly, there was a dropout of compatibility in early Python 3 releases, making it much harder to maintain a unified source base with them in the mix. (A compatibility function such as six.u from six can help alleviate much–though certainly not all–of the pain.)

It can also be helpful to declare your source encoding: put a specially-formatted comment as the first or second line of the source code:

# -- coding: <encoding name> --

This will usually be # -*- coding: utf-8 -*-, but other encodings are possible. Python 3 defaults to a UTF-8 encoding, but Python 2 assumes ASCII.

Notes

Version 1.2 adds comment stripping. Packaging and testing also tweaked.

Version 1.1.5 adds the bdist_wheel packaging format.

Version 1.1.3 switches from BSD to Apache License 2.0 and integrates tox testing with setup.py.

Version 1.1 added the words constructor.

Automated multi-version testing managed with the wonderful pytest, pytest-cov, and tox. Successfully packaged for, and tested against, all late-model versions of Python: 2.6, 2.7, 3.3, 3.4, as well as PyPy 2.5.1 (based on 2.7.9) and PyPy3 2.4.0 (based on 3.2.5). Module should work on Python 3.2, but dropped from testing matrix due to its age and lack of a Unicode literal making test specification much more difficult.)

Common line prefix is now computed without considering blank lines, so blank lines need not have any indentation on them just to “make things work.”

The tricky case where all lines have a common prefix, but it’s not entirely composed of whitespace, now properly handled. This is useful for lines that are already “quoted” such as with leading "|" or ">" symbols (common in Markdown and old-school email usage styles).

textlines() is now somewhat superfluous, now that lines() has a join kwarg. But you may prefer it for the implicit indication that it’s turning lines into text.

It’s tempting to define a constant such as Dedent that might be the default for the lstrip parameter, instead of having separate dedent and lstrip Booleans. The more I use singleton classes in Python as designated special values, the more useful they seem.

Automated multi-version testing managed with pytest and tox. Continuous integration testing with Travis-CI. Packaging linting with pyroma.

Successfully packaged for, and tested against, all late-model versions of Python: 2.6, 2.7, 3.2, 3.3, 3.4, and 3.5 pre-release (3.5.0b3) as well as PyPy 2.6.0 (based on 2.7.9) and PyPy3 2.4.0 (based on 3.2.5).

The author, Jonathan Eunice or @jeunice on Twitter welcomes your comments and suggestions.

Installation

To install or upgrade to the latest version:

pip install -U textdata

To easy_install under a specific Python version (3.3 in this example):

python3.3 -m easy_install --upgrade textdata

(You may need to prefix these with sudo to authorize installation. In environments without super-user privileges, you may want to use pip’s --user option, to install only for a single user, rather than system-wide.)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.4.1

Jan 23, 2019

2.4.0

Dec 21, 2018

2.3.3

Sep 20, 2018

2.3.1

Sep 15, 2018

2.3.0

Sep 15, 2018

2.2.0

Jul 7, 2018

2.1.0

Jul 4, 2018

2.0.1

Jun 4, 2018

1.7.3

Oct 13, 2017

1.7.2

May 30, 2017

1.7.1

Jan 31, 2017

1.7.0

Jan 31, 2017

1.6.2

Jan 23, 2017

1.6.1

Sep 15, 2015

1.6.0

Sep 2, 2015

1.5.1

Sep 2, 2015

1.5.0

Sep 2, 2015

1.4.5

Aug 26, 2015

1.4.4

Aug 26, 2015

1.4.3

Aug 17, 2015

1.4.2

Aug 17, 2015

1.4.1

Aug 16, 2015

1.4.0

Aug 16, 2015

1.3.0

Aug 15, 2015

1.2.3

Aug 6, 2015

1.2.2

Aug 5, 2015

1.2.1

Aug 5, 2015

This version

1.2.0

Aug 5, 2015

1.1.5

Aug 4, 2015

1.1.3

Jul 30, 2015

1.1.2

Jul 28, 2015

1.1.1

Jul 28, 2015

1.1.0

Jul 28, 2015

1.0.8

Jul 23, 2015

1.0.7

Jul 21, 2015

1.0.6

Jul 21, 2015

1.0.5

Jul 21, 2015

1.0.4

Jul 21, 2015

1.0.3

Nov 28, 2014

1.0.2

Aug 16, 2014

1.0.1

Feb 26, 2014

1.0

Feb 26, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

textdata-1.2.0.zip (20.1 kB view details)

Uploaded Aug 5, 2015 Source

textdata-1.2.0.tar.gz (10.6 kB view details)

Uploaded Aug 5, 2015 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textdata-1.2.0-py2.py3-none-any.whl (12.3 kB view details)

Uploaded Aug 5, 2015 Python 2Python 3

File details

Details for the file textdata-1.2.0.zip.

File metadata

Download URL: textdata-1.2.0.zip
Upload date: Aug 5, 2015
Size: 20.1 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for textdata-1.2.0.zip
Algorithm	Hash digest
SHA256	`7460a2f94a771a5a43e6ee846292bac267358ba464447308ebb0d97bd2452463`
MD5	`53928745238df72d39d065d80b5074d6`
BLAKE2b-256	`77548c661480069d005dee1da98599c843f0f24f0b98637f84b8e9f15eabb071`

See more details on using hashes here.

File details

Details for the file textdata-1.2.0.tar.gz.

File metadata

Download URL: textdata-1.2.0.tar.gz
Upload date: Aug 5, 2015
Size: 10.6 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for textdata-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c03df217aa759b6294ab53844f7405f0a52f059e2e952bf52217ab7c4d552837`
MD5	`9284a31aa454b844d97ca10b0eec36ae`
BLAKE2b-256	`aafdf1c7f001b14bcb586d43e4b62fcda08b708fb83f1c8f86adb65e2141171b`

See more details on using hashes here.

File details

Details for the file textdata-1.2.0-py2.py3-none-any.whl.

File metadata

Download URL: textdata-1.2.0-py2.py3-none-any.whl
Upload date: Aug 5, 2015
Size: 12.3 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for textdata-1.2.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`1aee1ba15e8b15e3ddc0349b9656eb998b3a6f720267e10f31016fddc1916034`
MD5	`be8abbbf0faaefa733c705f93ca64449`
BLAKE2b-256	`2e77b4f4964ba3a489d31ab7be50d1286b814a1c69b6c50e29e586ff45be384f`

See more details on using hashes here.

textdata 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Lines

Text

API Options

Words

Comments

Unicode and Encodings

Notes

Installation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes