Alternate regular expression module, to replace re.
Project description
For testing and comparison with the current ‘re’ module the new implementation is in the form of a module called ‘regex’.
Flags
There are 2 kinds of flag: scoped and global. Scoped flags can apply to only part of a pattern and can be turned on or off; global flags apply to the entire pattern and can only be turned on.
The scoped flags are: IGNORECASE, MULTILINE, DOTALL, VERBOSE.
The global flags are: ASCII, LOCALE, UNICODE, ZEROWIDTH.
Additional features
- Atomic grouping (issue #433030)
(?>…) If the following pattern subsequently fails, then the subpattern as a whole will fail.
- Possessive quantifiers.
(?:…)?+ (?:…)*+ (?:…)++ (?:…){min,max}+ The subpattern is matched up to ‘max’ times. If the following pattern subsequently fails, then all of the repeated subpatterns will fail as a whole. For example, “(?:…)++” is equivalent to “(?>(?:…)+)”.
- Scoped flags (issue #433028)
(?flags-flags:…) The flags will apply only to the subpattern. Flags can be turned on or off.
- Inline flags (#433024, #433027)
(?flags-flags) The flags will apply to the end of the group or pattern. Flags can be turned on or off.
- Repeated repeats (#2537)
A regex like r’((x|y+)*)*’ will be accepted and will work correctly, but should complete more quickly.
- Definition of ‘word’ character (#1693050)
The definition of a ‘word’ character has been expanded for Unicode. This applies to w, W, b and B.
- Groups in lookahead and lookbehind (#814253)
Groups and group references are permitted in both lookahead and lookbehind.
- Variable-length lookbehind
A lookbehind can match a variable-length string.
- Correct handling of charset with ignore case flag (#3511)
Ranges within charsets are handled correctly when the ignore-case flag is turned on.
- Unmatched group in replacement (#1519638)
An unmatched group is treated as an empty string in a replacement template.
- ‘Pathological’ patterns (#1566086, #1662581, #1448325, #1721518, #1297193)
‘Pathological’ patterns should complete more quickly.
- Flags argument for regex.split, regex.sub and regex.subn (#3482)
regex.split, regex.sub and regex.subn support a ‘flags’ argument.
- ‘Overlapped’ argument for regex.findall and regex.finditer
regex.findall and regex.finditer support an ‘overlapped’ flag which permits overlapped matches
- Unicode escapes (#3665)
The Unicode escapes uxxxx and Uxxxxxxxx are supported.
- Large patterns (#1160)
Patterns can be much larger.
- Zero-width match with regex.finditer (#1647489)
regex.finditer behaves correctly when it splits at a zero-width match.
- Zero-width split with regex.split (#3262)
regex.split can split at a zero-width match if the zero-width flag is turned on. When the flag is turned off the current behaviour is unchanged because the BDFL thinks that some existing software might depend on it.
- Splititer
regex.splititer has been added. It’s a generator equivalent of regex.split.
- Subscripting for groups
A match object accepts access to the captured groups via subscripting and slicing:
>>> m = regex.search("(?<before>.*?)(?<num>\d+)(?<after>.*)", "pqr123stu") >>> print m["before"] pqr >>> print m["num"] 123 >>> print m["after"] stu >>> print len(m) 4 >>> print m[:] ('pqr123stu', 'pqr', '123', 'stu')
- Named groups
Named groups can be named with (?<name>…) as well as the current (?P<name>…).
- Group references
Groups can be referenced within a pattern with g<name>. This also allows there to be more than 99 groups.
- Named characters
N{name} Named characters are supported.
- Unicode properties
p{name} P{name} Unicode properties are supported. p{name} matches a character which has property ‘name’ and P{name} matches a character which doesn’t have property ‘name’.
- Posix character classes
[[:alpha:]] Posix character classes are supported.
- Search anchor
G A search anchor has been added. It matches at the position where each search started/continued and can be used for contiguous matches or in negative variable-length lookbehinds to limit how far back the lookbehind goes:
>>> regex.findall(r"\w{2}", "abcd ef") ['ab', 'cd', 'ef'] >>> regex.findall(r"\G\w{2}", "abcd ef") ['ab', 'cd']
The search starts at position 0 and matches 2 letters ‘ab’. The search continues at position 2 and matches 2 letters ‘cd’. The search continues at position 4 and fails to match any letters. The anchor stops the search start position from being advanced, so there are no more results.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.