Lexical analysis functions, tokenisers, transcribers: an arbitrary assortment of lexical and tokenisation functions useful for writing recursive descent parsers, of which I have several. There are also some transcription functions for producing text from various objects, such as `hexify` and `unctrl`.
Project description
Lexical analysis functions, tokenisers, transcribers:
an arbitrary assortment of lexical and tokenisation functions useful
for writing recursive descent parsers, of which I have several.
There are also some transcription functions for producing text
from various objects, such as hexify and unctrl.
Latest release 20250914:
- New single_space(s[,sep=]) function to split a string and rejoin, with a single space by default.
- Obsolete the @has_format_attributes decorator, how covered by FormatableMixin automatically.
- New FormatMapping(obj,base_format_mapping[,missing]) class presenting a mapping where some entries are callable, used by FormatableMixin.format_as.
- FStr: rename lc to lc_ to reflect the use of the lc_() function.
- format_as: new optional
missing(mapping,key)parameter to supply values for keys not informat_mapping. - FormatAsError: annotate the error with .key, .format_s, .format_mapping.
- Many small internal changes.
Generally the get_* functions accept a source string and an offset
(usually optional, default 0) and return a token and the new offset,
raising ValueError on failed tokenisation.
Short summary:
as_lines: Generator yielding complete lines from arbitrary pieces of text from the iterable ofstrchunks.BaseToken: A mixin for token dataclasses.camelcase: Convert a snake cased stringsnakecasedinto camel case.common_prefix: Return the common prefix of the stringsstrs.common_suffix: Return the common suffix of the stringsstrs.CoreTokens: A mixin for token dataclasses whose subclasses includeIdentifier, 'NumericValueandQuotedString`.cropped: If the length ofsexceedsmax_length(default32), replace enough of the tail withellipsisand the lastroffset(default1) characters ofsto fit inmax_lengthcharacters.cropped_repr: Compute a croppedrepr()ofobj.cutprefix: Removeprefixfrom the front ofsif present. Return the suffix ifs.startswith(prefix), elses. As withstr.startswith,prefixmay be astror atupleofstr. If a tuple, the first matching prefix from the tuple will be removed.cutsuffix: Removesuffixfrom the end ofsif present. Return the prefix ifs.endswith(suffix), elses. As withstr.endswith,suffixmay be astror atupleofstr. If a tuple, the first matching suffix from the tuple will be removed.FFloat: Formattablefloat.FInt: Formattableint.FNumericMixin: AFormatableMixinsubclass.format_as: Format the stringformat_susingFormatter.vformat, return the formatted result. This is a wrapper forstr.format_mapwhich raises a more informativeFormatAsErrorexception on failure.format_attribute: A decorator to mark a method as available as a format method. This setsmethod.is_format_attribute=True.format_escape: Escape{}characters in a string to protect them fromstr.format.format_recover: Decorator for__format__methods which replaces failed formats with{self:format_spec}.FormatableFormatter: Astring.Formattersubclass interacting with objects which inherit fromFormatableMixin.FormatableMixin: A subclass ofFormatableFormatterwhich provides 2 main features: - a__format__method which parses theformat_specstring into multiple colon separated terms whose results chain - aformat_asmethod which formats a format string usingstr.format_mapwith a suitable mapping derived from the instance via itsformat_kwargsmethod (whose default is to return the instance itself).FormatAsError: Subclass ofLookupErrorfor use byformat_as.FormatMapping: AMappingsubclass based on an object and a mapping intended for use by theFormatableMixin.format_asmethod. The mapping maps field names to values, where the values may be callables accepting an object. Fetching a value from the mapping will callvalue(obj)if the value is callable. Some additonal extra field names are provided if not already present in the mapping: -self: the object.FStr: Astrsubclass with theFormatableMixinmethods, particularly its__format__method which usesstrmethod names as valid formats.get_chars: Scan the stringsfor characters ingocharsstarting atoffset. Return(match,new_offset).get_decimal: Scan the stringsfor decimal characters starting atoffset(default0). Return(dec_string,new_offset).get_decimal_or_float_value: Fetch a decimal or basic float (nnn.nnn) value from the strsatoffset(default0). Return(value,new_offset).get_decimal_value: Scan the stringsfor a decimal value starting atoffset(default0). Return(value,new_offset).get_delimited: Collect text from the stringsfrom positionoffsetup to the first occurence of delimiterdelim; return the text excluding the delimiter and the offset after the delimiter.get_dotted_identifier: Scan the stringsfor a dotted identifier (by default an ASCII letter or underscore followed by letters, digits or underscores) with optional trailing dot and another dotted identifier, starting atoffset(default0). Return(match,new_offset).get_envvar: Parse a simple environment variable reference to $varname or $x where "x" is a special character.get_hexadecimal: Scan the stringsfor hexadecimal characters starting atoffset(default0). Return(hex_string,new_offset).get_hexadecimal_value: Scan the stringsfor a hexadecimal value starting atoffset(default0). Return(value,new_offset).get_identifier: Scan the stringsfor an identifier (by default an ASCII letter or underscore followed by letters, digits or underscores) starting atoffset(default 0). Return(match,new_offset).get_ini_clause_entryname: Parse a[clausename]entryname string fromsatoffset(default0). Return(clausename,entryname,new_offset).get_ini_clausename: Parse a[clausename]string fromsatoffset(default0). Return(clausename,new_offset).get_nonwhite: Scan the stringsfor characters not instring.whitespacestarting atoffset(default0). Return(match,new_offset).get_other_chars: Scan the stringsfor characters not instopcharsstarting atoffset(default0). Return(match,new_offset).get_prefix_n: Strip a leadingprefixand numeric valuenfrom the stringsstarting atoffset(default0). Return the matched prefix, the numeric value and the new offset. Returns(None,None,offset)on no match.get_qstr: Get quoted text with slosh escapes and optional environment substitution.get_qstr_or_identifier: Parse a double quoted string or an identifier.get_sloshed_text: Collect slosh escaped text from the stringsfrom positionoffset(default0) and return the decoded unicode string and the offset of the completed parse.get_suffix_part: Strip a trailing "part N" suffix from the strings. Return the matched suffix and the number part number. Retrn(None,None)on no match.get_tokens: Parse the stringsfrom positionoffsetusing the supplied tokeniser functionsgetters. Return the list of tokens matched and the final offset.get_uc_identifier: Scan the stringsfor an identifier as forget_identifier, but require the letters to be uppercase.get_white: Scan the stringsfor characters instring.whitespacestarting atoffset(default0). Return(match,new_offset).has_format_attributes: OBSOLETE version of has_format_attributes, suggestion: @has_format_attributes is no longer needed.hexify: A flavour ofbinascii.hexlifyreturning astr.htmlify: Convert a string for safe transcription in HTML.htmlquote: Quote a string for use in HTML.Identifier: A dotted identifier.indent: Return theparagraphindented byline_indent(default" ").is_dotted_identifier: Test if the stringsis an identifier from positionoffsetonward.is_identifier: Test if the stringsis an identifier from positionoffset(default0) onward.is_uc_identifier: Test if the stringsis an uppercase identifier from positionoffset(default0) onward.isUC_: Check that a string matches the regular expression^[A-Z][A-Z_0-9]*$.jsquote: Quote a string for use in JavaScript.lc_: Returnvalue.lower()with'-'translated into'_'and' 'translated into'-'.match_tokens: Wrapper forget_tokenswhich catchesValueErrorexceptions and returns(None,offset).NumericValue: Anintorfloatliteral.parseUC_sAttr: Take an attribute nameattrand return(key,is_plural).phpquote: Quote a string for use in PHP code.printt: A wrapper fortabulate()to print the results. Each positional argument is a table row.QuotedString: A double quoted string.r: Liketyped_strbut usingreprinstead ofstr. This is available as bothtyped_reprandr.s: Return "type(obj).name:str(obj)" for some objectobj. This is available as bothtyped_strands.single_space: Return the stringsstripped and with internal whitespace replaced bysep(default" ").skipwhite: Convenience routine for skipping past whitespace; returns the offset of the next nonwhitespace character.slosh_mapper: Return a string to replace backslash-c, orNone.slosh_quote: Quote a stringraw_swith quote characterq.snakecase: Convert a camel cased stringcamelcasedinto snake case.split_remote_path: OBSOLETE version of split_remote_path, suggestion: cs.fs.RemotePath.from_str.stripped_dedent: Slightly smarter dedent which ignores a string's opening indent.strlist: Convert an iterable to strings and join withsep(default', ').tabpadding: Compute some spaces to use a tab padding at an offfset.tabulate: A generator yielding lines of values fromrowsaligned in columns.texthexify: Transcribe the bytesbsto text using compact text runs for some common text values.titleify_lc: Translate'-'into' 'and'_'translated into'-', then titlecased.typed_repr: Liketyped_strbut usingreprinstead ofstr. This is available as bothtyped_reprandr.typed_str: Return "type(obj).name:str(obj)" for some objectobj. This is available as bothtyped_strands.unctrl: Return the stringswithTABs expanded and control characters replaced with printable representations.untexthexify: Decode a textual representation of binary data into binary data.
Module contents:
-
as_lines(chunks, partials=None): Generator yielding complete lines from arbitrary pieces of text from the iterable ofstrchunks.After completion, any remaining newline-free chunks remain in the partials list; they will be unavailable to the caller unless the list is presupplied.
-
class BaseToken(cs.deco.Promotable): A mixin for token dataclasses.Presently I use this in
cs.app.tagger.rulesandcs.app.pilfer.parse.
BaseToken.__post_init__(self):
An omitted offset,end_offset means the token is the whole source_text.
BaseToken.from_str(text: str) -> 'BaseToken':
Parse test as a token of type cls, return the token.
Raises SyntaxError on a parse failure.
This is a wrapper for the parse class method.
BaseToken.matched_text:
The text from self.source_text which matches this token.
BaseToken.parse(text: str, offset: int = 0, *, skip=False) -> Tuple[ForwardRef('BaseToken'), int]:
Parse a token from test at offset (default 0).
Return a BaseToken subclass instance.
Raise SyntaxError if no subclass parses it.
Raise EOFError if at the end of the text,
checked after any whitespace if skip is true.
The returned token's .end_offset is the next parse point.
This base class method attempts the .parse method of all
the public subclasses.
Parameters:
text: the text being parsedoffset: the offset within thetextof the the parse cursorskip: if true (defaultFalse), skip any leading whitespace before matching
BaseToken.scan(text: str, offset: int = 0, *, skip=True) -> Iterable[ForwardRef('BaseToken')]:
Scan text, parsing tokens using BaseToken.parse and yielding them.
Parameters are as for BaseToken.parse except as follows:
- encountering end of text end the iteration instead of raising
EOFError skipdefaults toTrueto allow whitespace between tokens
BaseToken.token_classes():
Return the baseToken subclasses to consider when parsing a token stream.
-
camelcase(snakecased: str, first_letter_only: bool = False) -> str: Convert a snake cased stringsnakecasedinto camel case.Parameters:
snakecased: the snake case string to convertfirst_letter_only: optional flag (defaultFalse); if true then just ensure that the first character of a word is uppercased, otherwise usestr.title
Example:
>>> camelcase('abc_def') 'abcDef' >>> camelcase('ABc_def') 'abcDef' >>> camelcase('abc_dEf') 'abcDef' >>> camelcase('abc_dEf', first_letter_only=True) 'abcDEf' -
common_prefix(*strs): Return the common prefix of the stringsstrs.Examples:
>>> common_prefix('abc', 'def') '' >>> common_prefix('abc', 'abd') 'ab' >>> common_prefix('abc', 'abcdef') 'abc' >>> common_prefix('abc', 'abcdef', 'abz') 'ab' >>> # contrast with cs.fileutils.common_path_prefix >>> common_prefix('abc/def', 'abc/def1', 'abc/def2') 'abc/def' -
common_suffix(*strs): Return the common suffix of the stringsstrs. -
class CoreTokens(BaseToken): A mixin for token dataclasses whose subclasses includeIdentifier, 'NumericValueandQuotedString`. -
cropped(s: str, max_length: int = 32, roffset: int = 1, ellipsis: str = '...'): If the length ofsexceedsmax_length(default32), replace enough of the tail withellipsisand the lastroffset(default1) characters ofsto fit inmax_lengthcharacters. -
cropped_repr(obj, roffset=1, max_length=32, inner_max_length=None): Compute a croppedrepr()ofobj.Parameters:
obj: the object to representmax_length: the maximum length of the representation, default32inner_max_length: the maximum length of the representations of members ofobj, defaultmax_length//2roffset: the number of trailing characters to preserve, default1
-
cutprefix(s, prefix): Removeprefixfrom the front ofsif present. Return the suffix ifs.startswith(prefix), elses. As withstr.startswith,prefixmay be astror atupleofstr. If a tuple, the first matching prefix from the tuple will be removed.Example:
>>> abc_def = 'abc.def' >>> cutprefix(abc_def, 'abc.') 'def' >>> cutprefix(abc_def, 'zzz.') 'abc.def' >>> cutprefix(abc_def, '.zzz') is abc_def True >>> cutprefix('this_that', ('this', 'thusly')) '_that' >>> cutprefix('thusly_that', ('this', 'thusly')) '_that' -
cutsuffix(s, suffix): Removesuffixfrom the end ofsif present. Return the prefix ifs.endswith(suffix), elses. As withstr.endswith,suffixmay be astror atupleofstr. If a tuple, the first matching suffix from the tuple will be removed.Example:
>>> abc_def = 'abc.def' >>> cutsuffix(abc_def, '.def') 'abc' >>> cutsuffix(abc_def, '.zzz') 'abc.def' >>> cutsuffix(abc_def, '.zzz') is abc_def True >>> cutsuffix('this_that', ('that', 'tother')) 'this_' >>> cutsuffix('this_tother', ('that', 'tother')) 'this_' -
class FFloat(FNumericMixin, builtins.float): Formattablefloat. -
class FNumericMixin(FormatableMixin): AFormatableMixinsubclass.
FNumericMixin.localtime(self):
Treat this as a UNIX timestamp and return a localtime datetime.
FNumericMixin.utctime(self):
Treat this as a UNIX timestamp and return a UTC datetime.
-
format_as(format_s: str, format_mapping, formatter=None, *, error_sep=None, missing: Optional[Callable[[Mapping, Any], Any]] = None, strict=False): Format the stringformat_susingFormatter.vformat, return the formatted result. This is a wrapper forstr.format_mapwhich raises a more informativeFormatAsErrorexception on failure.Parameters:
format_s: the format string to use as the templateformat_mapping: the mapping of available replacement fieldsformatter: an optionalstring.Formatter-like instance with a.vformat(format_string,args,kwargs)method, usually a subclass ofstring.Formatter; if not specified thenFormatableFormatteris usederror_sep: optional separator for the multipart error message, default fromFormatAsError.DEFAULT_SEPARATOR:'; 'missing: an optional callable to turn a key missing fromformat_mappinginto a value to interpolate
-
format_attribute(method): A decorator to mark a method as available as a format method. This setsmethod.is_format_attribute=True.For example, the
FormatableMixin.jsonmethod is defined like this:@format_attribute def json(self): return self.FORMAT_JSON_ENCODER.encode(self)which allows a
FormatableMixinsubclass instance to be used in a format string like this:{instance:json}to insert a JSON transcription of the instance.
It is recommended that methods marked with
@format_attributehave no side effects and do not modify state, as they are intended for use in ad hoc format strings supplied by an end user. -
format_escape(s): Escape{}characters in a string to protect them fromstr.format. -
format_recover(*da, **dkw): Decorator for__format__methods which replaces failed formats with{self:format_spec}. -
class FormatableFormatter(string.Formatter): Astring.Formattersubclass interacting with objects which inherit fromFormatableMixin.
FormatableFormatter.format_field(value, format_spec: str):
Format a value using format_field, returning an FStr
(a str subclass with additional format_spec features).
We actually recognise colon separated chains of formats
and apply each format to the previously converted value.
The final result is promoted to an FStr before return.
FormatableFormatter.get_arg_name(field_name):
The default initial arg_name is an identifier.
Returns (prefix,offset), and ('',0) if there is no arg_name.
FormatableFormatter.get_field(self, field_name, args, kwargs):
Get the object referenced by the field text field_name.
Raises KeyError for an unknown field_name.
FormatableFormatter.get_format_subspecs(format_spec):
Parse a format_spec as a sequence of colon separated components,
return a list of the components.
FormatableFormatter.get_subfield(value, subfield_text: str):
Resolve value against subfield_text,
the remaining field text after the term which resolved to value.
For example, a format {name.blah[0]}
has the field text name.blah[0].
A get_field implementation might initially
resolve name to some value,
leaving .blah[0] as the subfield_text.
This method supports taking that value
and resolving it against the remaining text .blah[0].
For generality, if subfield_text is the empty string
value is returned unchanged.
FormatableFormatter.get_value(self, arg_name, args, kwargs):
Get the object with index arg_name.
This default implementation returns (kwargs[arg_name],arg_name).
-
class FormatableMixin(FormatableFormatter): A subclass ofFormatableFormatterwhich provides 2 main features:- a
__format__method which parses theformat_specstring into multiple colon separated terms whose results chain - a
format_asmethod which formats a format string usingstr.format_mapwith a suitable mapping derived from the instance via itsformat_kwargsmethod (whose default is to return the instance itself)
The
format_asmethod is like an inside outstr.formatorobject.__format__method.The
str.formatmethod is designed for formatting a string from a variety of other objects supplied in the keyword arguments.The
object.__format__method is for filling out a singlestr.formatreplacement field from a single object.By contrast,
format_asis designed to fill out an entire format string from the current object.For example, the
cs.tagset.TagSetclass subclassesFormatableMixinto provide aformat_asmethod whose replacement fields are derived from the tags in the tag set.Subclasses wanting to provide additional
format_specterms should:- override
FormatableFormatter.format_field1to implement terms with no colons, lettingformat_fielddo the split into terms - override
FormatableFormatter.get_format_subspecsto implement the parse offormat_specinto a sequence of terms. This might recognise a special additional syntax and quietly fall back tosuper().get_format_subspecsif that is not present.
- a
FormatableMixin.__format__(self, format_spec):
Format self according to format_spec.
This implementation calls self.format_field.
As such, a format_spec is considered
a sequence of colon separated terms.
Classes wanting to implement additional format string syntaxes should either:
- override
FormatableFormatter.format_field1to implement terms with no colons, lettingformat_field1do the split into terms - override
FormatableFormatter.get_format_subspecsto implement the term parse.
The default implementation of __format1__ just calls super().__format__.
Implementations providing specialised formats
should implement them in __format1__
with fallback to super().__format1__.
FormatableMixin.__init_subclass__(**kw):
Prefill the cls.format_attributes mapping from the
superclass and any format attributes of cls.
FormatableMixin.convert_field(self, value, conversion):
The default converter for fields calls Formatter.convert_field.
This is a tiny shim to transmute the '' conversion to None
which is what Formatter.convert_field expects.
FormatableMixin.convert_via_method_or_attr(self, format_spec) -> Tuple[Any, int]:
Apply a method or attribute name based conversion to self
where format_spec starts with a method or attribute name.
Return (converted,offset)
being the converted value and the offset after the method name.
Note that if there is not a leading identifier on format_spec
then this method returns (self,0).
The converted value is obtained from getattr(self,name);
if this raises an AttributeError a second attempt is made with
getattr(FStr(self),attr) if self is not already an FStr
(this provides the common utility methods on other types).
If the value is callable but does not have a true
.is_format_attribute a TypeError is raised, otherwise
the value is called to complete the conversion.
(The .is_format_attribute is usually set by decorating a
method with the @format_attribute decorator.)
The motivating example was a PurePosixPath,
which does not JSON transcribe;
this tweak supports both
posixpath:basename via the pathlib stuff
and posixpath:json via FStr
even though a PurePosixPath does not subclass FStr.
FormatableMixin.format_as(self, format_s: str, *, error_sep: Optional[str] = None, missing: Optional[Callable[[Mapping, Any], Any]] = None, strict=True, **format_kwargs_kw):
Return the string format_s formatted using the mapping
returned by self.format_kwargs(**format_kwargs_kw).
If a class using the mixin has no format_kwargs() method
to provide a mapping for str.format_map
then the instance itself is used as the mapping.
FormatableMixin.json(self):
The value transcribed as compact JSON.
class FormatAsError(builtins.LookupError): Subclass ofLookupErrorfor use byformat_as.class FormatMapping(collections.abc.Mapping): AMappingsubclass based on an object and a mapping intended for use by theFormatableMixin.format_asmethod. The mapping maps field names to values, where the values may be callables accepting an object. Fetching a value from the mapping will callvalue(obj)if the value is callable. Some additonal extra field names are provided if not already present in the mapping:self: the object
FormatMapping.__getitem__(self, field_name: str):
Fetch the value for field_name.
If the value is callable, call value(self.obj) to get the value.
FormatMapping.items(self):
Proxy .items via self.mapping.
FormatMapping.keys(self):
Proxy .keys via self.mapping.
-
class FStr(FormatableMixin, builtins.str): Astrsubclass with theFormatableMixinmethods, particularly its__format__method which usesstrmethod names as valid formats.It also has a bunch of utility methods which are available as
:method in format strings.
FStr.basename(self):
Treat as a filesystem path and return the basename.
FStr.dirname(self):
Treat as a filesystem path and return the dirname.
FStr.f(self):
Parse self as a float.
FStr.i(self, base=10):
Parse self as an int.
FStr.lc_(self):
Lowercase using lc_().
FStr.path(self):
Convert to a native filesystem pathlib.Path.
FStr.posix_path(self):
Convert to a Posix filesystem pathlib.Path.
FStr.windows_path(self):
Convert to a Windows filesystem pathlib.Path.
-
get_chars(s: str, offset: int, gochars: str) -> Tuple[str, int]: Scan the stringsfor characters ingocharsstarting atoffset. Return(match,new_offset).gocharsmay also be a callable, in which case a characterchis accepted ifgochars(ch)is true. -
get_decimal(s: str, offset: int = 0) -> Tuple[str, int]: Scan the stringsfor decimal characters starting atoffset(default0). Return(dec_string,new_offset). -
get_decimal_or_float_value(s: int, offset: int = 0) -> Tuple[str, int]: Fetch a decimal or basic float (nnn.nnn) value from the strsatoffset(default0). Return(value,new_offset). -
get_decimal_value(s: str, offset: int = 0) -> Tuple[str, int]: Scan the stringsfor a decimal value starting atoffset(default0). Return(value,new_offset). -
get_delimited(s, offset, delim): Collect text from the stringsfrom positionoffsetup to the first occurence of delimiterdelim; return the text excluding the delimiter and the offset after the delimiter. -
get_dotted_identifier(s: str, offset: int = 0, **kw) -> Tuple[str, int]: Scan the stringsfor a dotted identifier (by default an ASCII letter or underscore followed by letters, digits or underscores) with optional trailing dot and another dotted identifier, starting atoffset(default0). Return(match,new_offset).Note: the empty string and an unchanged offset will be returned if there is no leading letter/underscore.
Keyword arguments are passed to
get_identifier(used for each component of the dotted identifier). -
get_envvar(s, offset=0, environ=None, default=None, specials=None): Parse a simple environment variable reference to $varname or $x where "x" is a special character.Parameters:
s: the string with the variable referenceoffset: the starting point for the referencedefault: default value for missing environment variables; ifNone(the default) aValueErroris raisedenviron: the environment mapping, defaultos.environspecials: the mapping of special single character variables
-
get_hexadecimal(s: str, offset: int = 0) -> Tuple[str, int]: Scan the stringsfor hexadecimal characters starting atoffset(default0). Return(hex_string,new_offset). -
get_hexadecimal_value(s: str, offset: int = 0) -> Tuple[str, int]: Scan the stringsfor a hexadecimal value starting atoffset(default0). Return(value,new_offset). -
get_identifier(s: str, offset: int = 0, *, alpha: str = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', number: str = '0123456789', extras: str = '_') -> Tuple[str, int]: Scan the stringsfor an identifier (by default an ASCII letter or underscore followed by letters, digits or underscores) starting atoffset(default 0). Return(match,new_offset).Note: the empty string and an unchanged offset will be returned if there is no leading letter/underscore.
Parameters:
s: the string to scanoffset: the starting offset, default0.alpha: the characters considered alphabetic, defaultstring.ascii_letters.number: the characters considered numeric, defaultstring.digits.extras: extra characters considered part of an identifier, default'_'.
-
get_ini_clause_entryname(s: str, offset: int = 0) -> Tuple[str, str, int]: Parse a[clausename]entryname string fromsatoffset(default0). Return(clausename,entryname,new_offset). -
get_ini_clausename(s: str, offset: int = 0) -> Tuple[str, int]: Parse a[clausename]string fromsatoffset(default0). Return(clausename,new_offset). -
get_nonwhite(s: str, offset: int = 0) -> Tuple[str, int]: Scan the stringsfor characters not instring.whitespacestarting atoffset(default0). Return(match,new_offset). -
get_other_chars(s: str, offset: int = 0, stopchars: Optional[str] = None) -> Tuple[str, int]: Scan the stringsfor characters not instopcharsstarting atoffset(default0). Return(match,new_offset). -
get_prefix_n(s: str, prefix: str, n: Optional[int] = None, *, offset: int = 0) -> str: Strip a leadingprefixand numeric valuenfrom the stringsstarting atoffset(default0). Return the matched prefix, the numeric value and the new offset. Returns(None,None,offset)on no match.Parameters:
s: the string to parseprefix: the prefix string which must appear atoffsetor an object with amatch(str,offset)method such as anre.Patternregexp instancen: optional integer value; if omitted any value will be accepted, otherwise the numeric part must matchn
If
prefixis astr, the "matched prefix" return value isprefix. Otherwise the "matched prefix" return value is the result of theprefix.match(s,offset)call. The result must also support a.end()method returning the offset insbeyond the match, used to locate the following numeric portion.Examples:
import re get_prefix_n('s03e01--', 's') ('s', 3, 3) get_prefix_n('s03e01--', 's', 3) ('s', 3, 3) get_prefix_n('s03e01--', 's', 4) (None, None, 0) get_prefix_n('s03e01--', re.compile('[es]',re.I)) (<re.Match object; span=(0, 1), match='s'>, 3, 3) get_prefix_n('s03e01--', re.compile('[es]',re.I), offset=3) (<re.Match object; span=(3, 4), match='e'>, 1, 6)
-
get_qstr(s, offset=0, q='"', environ=None, default=None, env_specials=None): Get quoted text with slosh escapes and optional environment substitution.Parameters:
s: the string containg the quoted text.offset: the starting point, default0.q: the quote character, default'"'. IfqisNone, do not expect the string to be delimited by quote marks.environ: if notNone, also parse and expand$envvar references.default: passed toget_envvar
-
get_qstr_or_identifier(s, offset): Parse a double quoted string or an identifier. -
get_sloshed_text(s, delim, offset=0, slosh='\\', mapper=<function slosh_mapper at 0x107087740>, specials=None): Collect slosh escaped text from the stringsfrom positionoffset(default0) and return the decoded unicode string and the offset of the completed parse.Parameters:
delim: end of string delimiter, such as a single or double quote.offset: starting offset withins, default0.slosh: escape character, default a slosh ('').mapper: a mapping function which accepts a single character and returns a replacement string orNone; this is used the replace things such as '\t' or '\n'. The default is theslosh_mapperfunction, whose default mapping isSLOSH_CHARMAP.specials: a mapping of other special character sequences and parse functions for gathering them up. When one of the special character sequences is found in the string, the parse function is called to parse at that point. The parse functions acceptsand the offset of the special character. They return the decoded string and the offset past the parse.
The escape character
sloshintroduces an encoding of some replacement text whose value depends on the following character. If the following character is:- the escape character
slosh, insert the escape character. - the string delimiter
delim, insert the delimiter. - the character 'x', insert the character with code from the following 2 hexadecimal digits.
- the character 'u', insert the character with code from the following 4 hexadecimal digits.
- the character 'U', insert the character with code from the following 8 hexadecimal digits.
- a character from the keys of
mapper
-
get_suffix_part(s: str, *, keywords: Iterable[str] = ('part',), numeral_map: Optional[Mapping[str, int]] = None) -> Union[Tuple[str, int], Tuple[NoneType, NoneType]]: Strip a trailing "part N" suffix from the strings. Return the matched suffix and the number part number. Retrn(None,None)on no match.Parameters:
s: the stringkeywords: an iterable ofstrto match, or a singlestr; default'part'numeral_map: an optional mapping of numeral names to numeric values; defaultNUMERAL_NAMES['en'], the English numerals
Exanmple:
>>> get_suffix_part('s09e10 - A New World: Part One') (': Part One', 1) -
get_tokens(s, offset, getters): Parse the stringsfrom positionoffsetusing the supplied tokeniser functionsgetters. Return the list of tokens matched and the final offset.Parameters:
s: the string to parse.offset: the starting position for the parse.getters: an iterable of tokeniser specifications.
Each tokeniser specification
getteris either:- a callable expecting
(s,offset)and returning(token,new_offset) - a literal string, to be matched exactly
- a
tupleorlistwith values(func,args,kwargs); callfunc(s,offset,*args,**kwargs) - an object with a
.matchmethod such as a regex; callgetter.match(s,offset)and return a match object with a.end()method returning the offset of the end of the match
-
get_uc_identifier(s: str, offset: int = 0, number: str = '0123456789', extras: str = '_') -> Tuple[str, int]: Scan the stringsfor an identifier as forget_identifier, but require the letters to be uppercase. -
get_white(s: str, offset: int = 0) -> Tuple[str, int]: Scan the stringsfor characters instring.whitespacestarting atoffset(default0). Return(match,new_offset). -
has_format_attributes(*da, **dkw): OBSOLETE version of has_format_attributes, suggestion: @has_format_attributes is no longer neededA obsolete class decorator formerly used to walk the class for
@formatmethodmethods. This is now done byFormatableMixin.__init_subclass__. -
hexify(bs: bytes) -> str: A flavour ofbinascii.hexlifyreturning astr. -
htmlify(s: str, nbsp: bool = False) -> str: Convert a string for safe transcription in HTML.Parameters:
s: the stringnbsp: replaces spaces with" "to prevent word folding, defaultFalse.
Identifier.parse(text: str, offset: int = 0, *, skip=False) -> Tuple[str, cs.lex.CoreTokens, int]:
Parse a dotted identifier from test.
-
indent(paragraph: str, line_indent: str = ' ') -> str: Return theparagraphindented byline_indent(default" "). -
is_dotted_identifier(s: str, offset: int = 0, **kw) -> bool: Test if the stringsis an identifier from positionoffsetonward. -
is_identifier(s: str, offset: int = 0, **kw) -> bool: Test if the stringsis an identifier from positionoffset(default0) onward. -
is_uc_identifier(s: str, offset: int = 0, **kw) -> bool: Test if the stringsis an uppercase identifier from positionoffset(default0) onward. -
isUC_(s): Check that a string matches the regular expression^[A-Z][A-Z_0-9]*$. -
jsquote(s: str) -> str: Quote a string for use in JavaScript. -
lc_(value: str) -> str: Returnvalue.lower()with'-'translated into'_'and' 'translated into'-'.I use this to construct lowercase filenames containing a readable transcription of a title string.
See also
titleify_lc(), an imperfect reversal of this. -
match_tokens(s, offset, getters): Wrapper forget_tokenswhich catchesValueErrorexceptions and returns(None,offset).
NumericValue.parse(text: str, offset: int = 0, *, skip=False) -> 'NumericValue':
Parse a Python style int or float.
-
parseUC_sAttr(attr): Take an attribute nameattrand return(key,is_plural).Examples:
'FOO'returns('FOO',False).'FOOs'or'FOOes'returns('FOO',True). Otherwise return(None,False).
-
phpquote(s: str) -> str: Quote a string for use in PHP code. -
printt(*table, file=None, flush=False, indent='', print_func=None, **tabulate_kw): A wrapper fortabulate()to print the results. Each positional argument is a table row.Parameters:
file: optional output file, passed toprint_funcflush: optional flush flag, passed toprint_funcindent: optional leading indent for the output linesprint_func: optionalprint()function, defaultbuiltins.printOther keyword arguments are passed totabulate().
QuotedString.parse(text: str, offset: int = 0, *, skip=False) -> 'QuotedString':
Parse a double quoted string from text.
-
r(obj: Any, max_length: Optional[int] = None, *, use_cls: bool = False) -> str: Liketyped_strbut usingreprinstead ofstr. This is available as bothtyped_reprandr. -
s(obj: Any, use_cls: bool = False, use_repr: bool = False, max_length: int = 32) -> str: Return "type(obj).name:str(obj)" for some objectobj. This is available as bothtyped_strands.Parameters:
use_cls: defaultFalse; if true, usestr(type(obj))instead oftype(obj).__name__use_repr: defaultFalse; if true, userepr(obj)instead ofstr(obj)
I use this a lot when debugging. Example:
from cs.lex import typed_str as s ...... X("foo = %s", s(foo)) -
single_space(s: str, *, sep=' ') -> str: Return the stringsstripped and with internal whitespace replaced bysep(default" "). -
skipwhite(s: str, offset: int = 0) -> Tuple[str, int]: Convenience routine for skipping past whitespace; returns the offset of the next nonwhitespace character. -
slosh_mapper(c, charmap=None): Return a string to replace backslash-c, orNone. -
slosh_quote(raw_s: str, q: str): Quote a stringraw_swith quote characterq. -
snakecase(camelcased: str) -> str: Convert a camel cased stringcamelcasedinto snake case.Parameters:
cameelcased: the cameel case string to convertfirst_letter_only: optional flag (defaultFalse); if true then just ensure that the first character of a word is uppercased, otherwise usestr.title
Example:
>>> snakecase('abcDef') 'abc_def' >>> snakecase('abcDEf') 'abc_def' >>> snakecase('AbcDef') 'abc_def' -
split_remote_path(remotepath: str) -> Tuple[Optional[str], str]: OBSOLETE version of split_remote_path, suggestion: cs.fs.RemotePath.from_strSplit a path with an optional leading
[user@]rhost:prefix into the prefix and the remaining path.Noneis returned for the prefix is there is none. This is useful for things likersynctargets etc.OBSOLETE, use
cs.fs.RemotePath.from_strinstead. -
stripped_dedent(s: str, post_indent: str = '', sub_indent: str = '') -> str: Slightly smarter dedent which ignores a string's opening indent.Algorithm: strip the supplied string
s, pull off the leading line, dedent the rest, put back the leading line.This is a lot like the
inspect.cleandoc()function.This supports my preferred docstring layout, where the opening line of text is on the same line as the opening quote.
The optional
post_indentparameter may be used to indent the dedented text before return.The optional
sub_indentparameter may be used to indent the second and following lines if the dedented text before return.Examples:
>>> def func(s): ... """ Slightly smarter dedent which ignores a string's opening indent. ... Strip the supplied string `s`. Pull off the leading line. ... Dedent the rest. Put back the leading line. ... """ ... pass ... >>> from cs.lex import stripped_dedent >>> print(stripped_dedent(func.__doc__)) Slightly smarter dedent which ignores a string's opening indent. Strip the supplied string `s`. Pull off the leading line. Dedent the rest. Put back the leading line. >>> print(stripped_dedent(func.__doc__, sub_indent=' ')) Slightly smarter dedent which ignores a string's opening indent. Strip the supplied string `s`. Pull off the leading line. Dedent the rest. Put back the leading line. >>> print(stripped_dedent(func.__doc__, post_indent=' ')) Slightly smarter dedent which ignores a string's opening indent. Strip the supplied string `s`. Pull off the leading line. Dedent the rest. Put back the leading line. >>> print(stripped_dedent(func.__doc__, post_indent=' ', sub_indent='| ')) Slightly smarter dedent which ignores a string's opening indent. | Strip the supplied string `s`. Pull off the leading line. | Dedent the rest. Put back the leading line. -
strlist(ary: Iterable, sep: str = ', ') -> str: Convert an iterable to strings and join withsep(default', '). -
tabpadding(padlen: int, tabsize: int = 8, offset: int = 0) -> str: Compute some spaces to use a tab padding at an offfset. -
tabulate(*rows, sep=' ', ppcls=None): A generator yielding lines of values fromrowsaligned in columns.Each row in rows is a list of strings. Non-
strobjects are promoted tostrviapprint.pformat. If the strings contain newlines they will be split into subrows.Example:
>>> for row in tabulate( ... ['one col'], ... ['three', 'column', 'row'], ... ['row3', 'multi\nline\ntext', 'goes\nhere', 'and\nhere'], ... ['two', 'cols'], ... ): ... print(row) ... one col three column row row3 multi goes and line here here text two cols >>> -
texthexify(bs: bytes, shiftin: str = '[', shiftout: str = ']', whitelist: Optional[str] = None) -> str: Transcribe the bytesbsto text using compact text runs for some common text values.This can be reversed with the
untexthexifyfunction.This is an ad doc format devised to be compact but also to expose "text" embedded within to the eye. The original use case was transcribing a binary directory entry format, where the filename parts would be somewhat visible in the transcription.
The output is a string of hexadecimal digits for the encoded bytes except for runs of values from the whitelist, which are enclosed in the shiftin and shiftout markers and transcribed as is. The default whitelist is values of the ASCII letters, the decimal digits and the punctuation characters '_-+.,'. The default shiftin and shiftout markers are '[' and ']'.
String objects converted with either
hexifyandtexthexifyoutput strings may be freely concatenated and decoded withuntexthexify.Example:
>>> texthexify(b'&^%&^%abcdefghi)(*)(*') '265e25265e25[abcdefghi]29282a29282a'Parameters:
bs: the bytes to transcribeshiftin: Optional. The marker string used to indicate a shift to direct textual transcription of the bytes, default:'['.shiftout: Optional. The marker string used to indicate a shift from text mode back into hexadecimal transcription, default']'.whitelist: an optional bytes or string object indicating byte values which may be represented directly in text; the default value is the ASCII letters, the decimal digits and the punctuation characters'_-+.,'.
-
titleify_lc(value_lc: str) -> str: Translate'-'into' 'and'_'translated into'-', then titlecased.See also
lc_(), which this reverses imperfectly. -
typed_repr(obj: Any, max_length: Optional[int] = None, *, use_cls: bool = False) -> str: Liketyped_strbut usingreprinstead ofstr. This is available as bothtyped_reprandr. -
typed_str(obj: Any, use_cls: bool = False, use_repr: bool = False, max_length: int = 32) -> str: Return "type(obj).name:str(obj)" for some objectobj. This is available as bothtyped_strands.Parameters:
use_cls: defaultFalse; if true, usestr(type(obj))instead oftype(obj).__name__use_repr: defaultFalse; if true, userepr(obj)instead ofstr(obj)
I use this a lot when debugging. Example:
from cs.lex import typed_str as s ...... X("foo = %s", s(foo)) -
unctrl(s: str, tabsize: int = 8) -> str: Return the stringswithTABs expanded and control characters replaced with printable representations. -
untexthexify(s: str, shiftin: str = '[', shiftout: str = ']') -> str: Decode a textual representation of binary data into binary data.This is the reverse of the
texthexifyfunction.Outside of the
shiftin/shiftoutmarkers the binary data are represented as hexadecimal. Within the markers the bytes have the values of the ordinals of the characters.Example:
>>> untexthexify('265e25265e25[abcdefghi]29282a29282a') b'&^%&^%abcdefghi)(*)(*'Parameters:
s: the string containing the text representation.shiftin: Optional. The marker string commencing a sequence of direct text transcription, default'['.shiftout: Optional. The marker string ending a sequence of direct text transcription, default']'.
Release Log
Release 20250914:
- New single_space(s[,sep=]) function to split a string and rejoin, with a single space by default.
- Obsolete the @has_format_attributes decorator, how covered by FormatableMixin automatically.
- New FormatMapping(obj,base_format_mapping[,missing]) class presenting a mapping where some entries are callable, used by FormatableMixin.format_as.
- FStr: rename lc to lc_ to reflect the use of the lc_() function.
- format_as: new optional
missing(mapping,key)parameter to supply values for keys not informat_mapping. - FormatAsError: annotate the error with .key, .format_s, .format_mapping.
- Many small internal changes.
Release 20250724: tabulate: new optional ppcls argument to supply a custom PrettyPrinter class for formatting, use a date and datetime aware class by default.
Release 20250428:
- cutprefix,cutsuffix: also accept a tuple of str like str.startswith and str.endswith.
- typed_str: use cropped_repr() instead of repr().
Release 20250414: Obsolete split_remote_path(), supplanted by cs.fs.RemotePath.from_str().
Release 20250323:
- tabulate: format nonstr using pformat.
- New printt() wrapper for tabulate() which prints the table.
Release 20250103:
- Move Identifier, NumericValue, QuotedString in from cs.app.tagger.rules.
- BaseToken: expose the parsing subclass selection as a `.token_classes class method.
Release 20241207: tabulate: split cells containing newlines over multiple output rows.
Release 20241122: tabulate: make the default separator two spaces instead of one, immediate return if no rows (avoids max() of empty list).
Release 20241119.1: Add PyPI classifier, in part to test an updated release script.
Release 20241119: stripped_dedent: new optional sub_indent parameter for indenting the second and following lines, handy for usage messages.
Release 20241109:
- stripped_dedent: new optional post_indent parameter to indent the dedented text.
- New tabulate(*rows) generator function yielding lines of padded columns.
Release 20240630: New indent(paragraph,line_indent=" ") function.
Release 20240519: New get_suffix_part() to extract things line ": Part One" from something such as a TV episode name.
Release 20240316: Fixed release upload artifacts.
Release 20240211: New split_remote_path() function to recognise [[user@]host]:path.
Release 20231018: New is_uc_identifier function.
Release 20230401: Import update.
Release 20230217.1: Fix package requirements.
Release 20230217:
- New get_prefix_n function to parse a numeric value preceeded by a prefix.
- Drop strip_prefix_n, get_prefix_n is more general and I had not got around to using strip_prefix_n yet - when I did, I ended up writing get_prefix_n.
Release 20230210:
- @has_format_attributes: new optional inherit parameter to inherit superclass (or other) format attributes, default False.
- New FNumericMixin, FFloat, FInt FormatableMixin subclasses like FStr - they add .localtime and .utctime formattable attributes.
Release 20220918: typed_str(): crop the value part, default max_length=32, bugfix message cropping.
Release 20220626:
- Remove dependency on cs.py3, we've been Python 2 incompatible for a while.
- FormatableFormatter.format_field: promote None to FStr(None).
Release 20220227:
- typed_str,typed_repr: make max_length the first optional positional parameter, make other parameters keyword only.
- New camelcase() and snakecase() functions.
Release 20211208: Docstring updates.
Release 20210913:
- FormatableFormatter.FORMAT_RE_ARG_NAME_s: strings commencing with digits now match \d+(.\d+)[a-z]+, eg "02d".
- Alias typed_str as s and typed_repr as r.
- FormatableFormatter: new .format_mode thread local state object initially with strict=False, used to control whether unknown fields leave a placeholder or raise KeyError.
- FormatableFormatter.format_field: assorted fixes.
Release 20210906:
New strip_prefix_n() function to strip a leading prefix and numeric value n from the start of a string.
Release 20210717:
- Many many changes to FormatableMixin, FormatableFormatter and friends around supporting {foo|conv1|con2|...} instead of {foo!conv}. Still in flux.
- New typed_repr like typed_str but using repr.
Release 20210306:
- New cropped() function to crop strings.
- Rework cropped_repr() to do the repr() itself, and to crop the interiors of tuples and lists.
- cropped_repr: new inner_max_length for cropping the members of collections.
- cropped_repr: special case for length=1 tuples.
- New typed_str(o) object returning type(o).name:str(o) in the default case, useful for debugging.
Release 20201228: Minor doc updates.
Release 20200914:
- Hide terribly special purpose lastlinelen() in cs.hier under a private name.
- New common_prefix and common_suffix function to compare strings.
Release 20200718: get_chars: accept a callable for gochars, indicating a per character test function.
Release 20200613: cropped_repr: replace hardwired 29 with computed length
Release 20200517:
- New get_ini_clausename to parse "[clausename]".
- New get_ini_clause_entryname parsing "[clausename]entryname".
- New cropped_repr for returning a shortened repr()+"..." if the length exceeds a threshold.
- New format_escape function to double {} characters to survive str.format.
Release 20200318:
- New lc_() function to lowercase and dash a string, new titleify_lc() to mostly reverse lc_().
- New format_as function, FormatableMixin and related FormatAsError.
Release 20200229: New cutprefix and cutsuffix functions.
Release 20190812: Fix bad slosh escapes in strings.
Release 20190220: New function get_qstr_or_identifier.
Release 20181108: new function get_decimal_or_float_value to read a decimal or basic float
Release 20180815: No semantic changes; update some docstrings and clean some lint, fix a unit test.
Release 20180810:
- New get_decimal_value and get_hexadecimal_value functions.
- New stripped_dedent function, a slightly smarter textwrap.dedent.
Release 20171231: New function get_decimal. Drop unused function dict2js.
Release 20170904: Python 2/3 ports, move rfc2047 into new cs.rfc2047 module.
Release 20160828:
- Use "install_requires" instead of "requires" in DISTINFO.
- Discard str1(), pointless optimisation.
- unrfc2047: map _ to SPACE, improve exception handling.
- Add phpquote: quote a string for use in PHP code; add docstring to jsquote.
- Add is_identifier test.
- Add get_dotted_identifier.
- Add is_dotted_identifier.
- Add get_hexadecimal.
- Add skipwhite, convenince wrapper for get_white returning just the next offset.
- Assorted bugfixes and improvements.
Release 20150120: cs.lex: texthexify: backport to python 2 using cs.py3 bytes type
Release 20150118: metadata updates
Release 20150116: PyPI metadata and slight code cleanup.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cs_lex-20250914.tar.gz.
File metadata
- Download URL: cs_lex-20250914.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
976ed9aee01602ecf90f17b42a6a2b3bc12e13f7b743b906f22a5c3adea3e0f2
|
|
| MD5 |
497ad3c4b9809be2b6cb1ce87070f1c6
|
|
| BLAKE2b-256 |
3019bb94306096753b56db950bddfaf316e6fcb3af97c6ca4ed0918aa58c10ea
|
File details
Details for the file cs_lex-20250914-py2.py3-none-any.whl.
File metadata
- Download URL: cs_lex-20250914-py2.py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4728197ca27962e4525ae2f2dd32e6a3717ce6e475087c8ba7e3077e1186d40e
|
|
| MD5 |
570ca914323b531010f1559bb2df0c5c
|
|
| BLAKE2b-256 |
fddde6bdcfd2f6e66af0b847e63d3e83e6574eb971cd98245d6c18c412e6192d
|