正则表达式入门（十二）Miscellaneous

hj170520 发表于 2020-5-26 22:12

入门代码（十二）：

using dictreplacement string based on the matched text as dictionary key
ex: re.sub(r'pat', lambda m: d.get(m, default), s)
re.subn()gives tuple of modified string and number of substitutions
\Gregex module, restricts matching from start of string like \A
continues matching from end of match as new anchor until it fails
ex: regex.findall(r'\G\d+-?', '12-34 42') gives ['12-', '34']
subexpression callregex module, helps to define recursive matching
ex: r'\((?:[^()]++|(?0))++\)' matches nested sets of parentheses
[[:digit:]]regex module, named character set for \d
[[:^digit:]]to indicate \D
See regular-expressions: POSIX Bracket for full list
(?V1)inline flag to enable version 1 for regex module
regex.DEFAULT_VERSION=regex.VERSION1 can also be used
(?V0) or regex.VERSION0 to get back default version
set operationsV1 enables this feature for character classes, nested [] allowed
||union
~~symmetric difference
&&intersection
--difference
ex: (?V1)[[:punct:]--[.!?]] punctuation except . ! and ?
pat(*SKIP)(*F)regex module, ignore text matched by pat
ex: "[^"]++"(*SKIP)(*F)|, will match , but not inside
double quoted pairs

import re
import regex

'''Using dict'''
# one to one mappings
d = { '1': 'one', '2': 'two', '4': 'four' }
print(re.sub(r'', lambda m: d], '9234012'))
# 返回 '9two3four0onetwo'

# if the matched text doesn't exist as a key, default value will be used
print(re.sub(r'\d', lambda m: d.get(m, 'X'), '9234012'))
# 返回 'XtwoXfourXonetwo'

# For swapping two or more portions without using intermediate result, using a dict is recommended.

swap = { 'cat': 'tiger', 'tiger': 'cat' }
words = 'cat tiger dog tiger cat'

# replace word if it exists as key, else leave it as is
print(re.sub(r'\w+', lambda m: swap.get(m, m), words))
# 返回 'tiger cat dog cat tiger'

# or, build the alternation list manually for simple cases
print(re.sub(r'cat|tiger', lambda m: swap], words))
# 返回 'tiger cat dog cat tiger'

# For dict that have many entries and likely to undergo changes during development,
# building alternation list manually is not a good choice.
# Also, recall that as per precedence rules, longest length string should come first.

d = { 'hand': 1, 'handy': 2, 'handful': 3, 'a^b': 4 }

# take care of metacharacter escaping first
words =
# build alternation list
# add anchors and flags as needed to construct the final RE
print('|'.join(sorted(words, key=len, reverse=True)))
# 返回 'handful|handy|hand|a\\^b'

'''re.subn
The re.subn function returns a tuple of modified string after substitution and number of substitutions made.
This can be used to perform conditional operations based on whether the substitution was successful.
Or, the value of count itself may be needed for solving the given problem.
'''
word = 'coffining'
# recursively delete 'fin'
while True:
word, cnt = re.subn(r'fin', r'', word)
if cnt == 0:
      break

print(word)
# 返回 'cog'

# Here's an example that won't work if greedy quantifier is used instead of possessive quantifier.

row = '421,foo,2425,42,5,foo,6,6,42'

# lookarounds used to ensure start/end of column matching
# possessive quantifier used to ensure partial column is not captured
# if a column has same text as another column, the latter column is deleted
while True:
row, cnt = regex.subn(r'(?<=\A|,)([^,]++).*\K,\1(?=,|\Z)', r'', row)
if cnt == 0:
      break

print(row)
# 返回 '421,foo,2425,42,5,6'

'''\G anchor
The \G anchor (provided by regex module) restricts matching from start of string like the \A anchor. In addition, after a match is done, ending of that match is considered as the new anchor location. This process is repeated again and continues until the given RE fails to match (assuming multiple matches with sub, findall etc).
'''
# all non-whitespace characters from start of string
print(regex.findall(r'\G\S', '123-87-593 42 foo'))
# 返回 ['1', '2', '3', '-', '8', '7', '-', '5', '9', '3']
print(regex.sub(r'\G\S', r'*', '123-87-593 42 foo'))
# 返回 '********** 42 foo'

# all digits and optional hyphen combo from start of string
print(regex.findall(r'\G\d+-?', '123-87-593 42 foo'))
# 返回 ['123-', '87-', '593']
print(regex.sub(r'\G(\d+)(-?)', r'(\1)\2', '123-87-593 42 foo'))
# 返回 '(123)-(87)-(593) 42 foo'

# all word characters from start of string
# only if it is followed by word character
print(regex.findall(r'\G\w(?=\w)', 'cat12 bat pin'))
# 返回 ['c', 'a', 't', '1']
print(regex.sub(r'\G\w(?=\w)', r'\g<0>:', 'cat12 bat pin'))
# 返回 'c:a:t:1:2 bat pin'

# all lowercase alphabets or space from start of string
print(regex.sub(r'\G', r'(\g<0>)', 'par tar-den hen-food mood'))
# 返回 '(p)(a)(r)( )(t)(a)(r)-den hen-food mood'

'''Recursive matching
First up, a RE to match a set of parentheses that is not nested (termed as level-one RE for reference).
'''
# note the use of possessive quantifier
eqn0 = 'a + (b * c) - (d / e)'
print(regex.findall(r'\([^()]++\)', eqn0))
# 返回 ['(b * c)', '(d / e)']

eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
print(regex.findall(r'\([^()]++\)', eqn1))
# 返回 ['(f+x)', '(3-g)']

# Next, matching a set of parentheses which may optionally contain any number of non-nested sets of parentheses
# (termed as level-two RE for reference).
# See debuggex for a railroad diagram, notice the recursive nature of this RE.

eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
# note the use of non-capturing group
print(regex.findall(r'\((?:[^()]++|\([^()]++\))++\)', eqn1))
# 返回 ['((f+x)^y-42)', '((3-g)^z+2)']

eqn2 = 'a + (b) + ((c)) + (((d)))'
print(regex.findall(r'\((?:[^()]++|\([^()]++\))++\)', eqn2))
# 返回 ['(b)', '((c))', '((d))']

# That looks very cryptic. Better to use regex.X flag for clarity as well as for comparing against the recursive version.
# Breaking down the RE, you can see ( and ) have to be matched literally.
# Inside that, valid string is made up of either non-parentheses characters or a non-nested parentheses sequence (level-one RE).

lvl2 = regex.compile('''
      \(          #literal (
         (?:       #start of non-capturing group
         [^()]++    #non-parentheses characters
         |          #OR
         \([^()]++\)#level-one RE
         )++       #end of non-capturing group, 1 or more times
      \)          #literal )
      ''', flags=regex.X)

print(lvl2.findall(eqn1))
# 返回 ['((f+x)^y-42)', '((3-g)^z+2)']

print(lvl2.findall(eqn2))
# 返回 ['(b)', '((c))', '((d))']

# To recursively match any number of nested sets of parentheses, use a capture group and call it within the capture group itself.
# Since entire RE needs to be called here, you can use the default zeroth capture group (this also helps to avoid having to use finditer).
# Comparing with level-two RE, the only change is that (?0) is used instead of the level-one RE in the second alternation.
# To recursively match any number of nested sets of parentheses, use a capture group and call it within the capture group itself.
# Since entire RE needs to be called here, you can use the default zeroth capture group (this also helps to avoid having to use finditer).
# Comparing with level-two RE, the only change is that (?0) is used instead of the level-one RE in the second alternation.

lvln = regex.compile('''
      \(       #literal (
         (?:    #start of non-capturing group
         [^()]++ #non-parentheses characters
         |       #OR
         (?0)    #recursive call
         )++    #end of non-capturing group, 1 or more times
      \)       #literal )
      ''', flags=regex.X)

print(lvln.findall(eqn0))
# 返回 ['(b * c)', '(d / e)']

print(lvln.findall(eqn1))
# 返回 ['((f+x)^y-42)', '((3-g)^z+2)']

print(lvln.findall(eqn2))
# 返回 ['(b)', '((c))', '(((d)))']

eqn3 = '(3+a) * ((r-2)*(t+2)/6) + 42 * (a(b(c(d(e)))))'
print(lvln.findall(eqn3))
# 返回 ['(3+a)', '((r-2)*(t+2)/6)', '(a(b(c(d(e)))))']

'''Named character setsA named character set is defined by a name enclosed between [: and :] and has to be used within a character class [], along with any other characters as needed. Using [:^ instead of [: will negate the named character set.
'''
# similar to: r'\d+' or r'+'
print(regex.split(r'[[:digit:]]+', 'Sample123string42with777numbers'))
# 返回 ['Sample', 'string', 'with', 'numbers']
# similar to: r'+'
print(regex.sub(r'[[:alpha:]]+', r':', 'Sample123string42with777numbers'))
# 返回 ':123:42:777:'

# similar to: r'[\w\s]+'
print(regex.findall(r'[[:word:][:space:]]+', 'tea sea-pit sit-lean\tbean'))
# 返回 ['tea sea', 'pit sit', 'lean\tbean']
# similar to: r'\S+'
print(regex.findall(r'[[:^space:]]+', 'tea sea-pit sit-lean\tbean'))
# 返回 ['tea', 'sea-pit', 'sit-lean', 'bean']

# words not surrounded by punctuation characters
print(regex.findall(r'(?<![[:punct:]])\b\w+\b(?![[:punct:]])', 'tie. ink eat;'))
# 返回 ['ink']

'''Character class set operations
Set operations can be applied inside character class between sets.
Mostly used to get intersection or difference between two sets, where one/both of them is a character range or predefined character set.
To aid in such definitions, you can use [] in nested fashion. The four operators, in increasing order of precedence, are:

|| union
~~ symmetric difference
&& intersection
-- difference
'''
# [^aeiou] will match any non-vowel character
# which means space is also a valid character to be matched
print(re.findall(r'\b[^aeiou]+\b', 'tryst glyph pity why'))
# 返回 ['tryst glyph ', ' why']
# intersection or difference can be used here
# to get a positive definition of characters to match
print(regex.findall(r'(?V1)\b]+\b', 'tryst glyph pity why'))
# 返回 ['tryst', 'glyph', 'why']

# [~~] is same as
print(regex.findall(r'(?V1)\b[~~]+\b', 'gets eat top sigh'))
# 返回 ['eat', 'top']

# remove all punctuation characters except . ! and ?
para = '"Hi", there! How *are* you? All fine here.'
print(regex.sub(r'(?V1)[[:punct:]--[.!?]]+', r'', para))
# 返回 'Hi there! How are you? All fine here.'

'''Skipping matches
Sometimes, you want to change or extract all matches except particular matches. Usually, there are common characteristics between the two types of matches that makes it hard or impossible to define RE only for the required matches. For example, changing field values unless it is a particular name, or perhaps don't touch double quoted values and so on. To use the skipping feature, define the matches to be ignored suffixed by (*SKIP)(*FAIL) and then define the matches required as part of alternation. (*F) can also be used instead of (*FAIL).
'''
# change lowercase words other than imp or rat
words = 'tiger imp goat eagle rat'
print(regex.sub(r'\b(?:imp|rat)\b(*SKIP)(*F)|++', r'(\g<0>)', words))
# 返回 '(tiger) imp (goat) (eagle) rat'

# change all commas other than those inside double quotes
row = '1,"cat,12",nice,two,"dog,5"'
print(regex.sub(r'"[^"]++"(*SKIP)(*F)|,', r'|', row))
# 返回 '1|"cat,12"|nice|two|"dog,5"'

hshcompass 发表于 2020-5-27 08:00

{:1_904:}
还要考英语吗？

hj170520 发表于 2020-5-26 22:24

{:301_971:}习题留空层

haliluyadada 发表于 2020-5-27 07:30

太好了收藏了

sxlcity 发表于 2020-5-27 07:56

谢谢分享，学习了{:1_893:}

hxw0204 发表于 2020-5-27 08:41

回贴支持一下

页: [1]

吾爱破解 - 52pojie.cn's Archiver

正则表达式入门（十二）Miscellaneous