In the data crawl will often use regular expressions, if not familiar with Python’s re module, it is easy to be confused by the various methods inside, today we will review the Python re module.
Before learning the Python module, let’s see what the official description documentation says Implementation.
Helpful information.
|
|
Introduction to Regular Expressions
A regular expression is a logical formula for string manipulation, which is a pre-defined combination of some specific characters and these specific characters to form a “regular string”, which is used to express a filtering logic. Regular expressions are a very powerful tool for matching strings, and they are also used in other programming languages, and Python is no exception.
The following table lists the special elements of the regular expression pattern syntax. If you use a pattern with optional flag arguments, the meaning of some pattern elements will change.
Greedy and non-greedy modes for quantifiers
Regular expressions are typically used to find matching strings in text. the default in Python is greedy, always trying to match as many characters as possible. We generally use non-greedy patterns for extraction. Before I explain this concept, I’d like to show an example. We are looking for an anchor tag from a piece of html text.
Implementation results.
|
|
Let’s change the input and add a second anchor tag.
Implementation results.
|
|
This time the pattern matches the first open tag and the last closed tag and everything in between, making one match instead of two separate matches. This is because the default matching mode is “greedy”.
When in greedy mode, quantifiers (like * and +) match as many characters as possible. When you add a question mark at the end (. *?) it will become “non-greedy”.
Implementation results.
|
|
Backslash problems
As with most programming languages, regular expressions use “" as an escape character, which can cause backslash problems. If you need to match the character “" in the text, you will need four backslashes “" in the regular expression using the programming language: the first two and the last two are used to escape to backslashes in the programming language, and then converted to two backslashes and then escaped to one backslash in regular expressions.
Python’s native strings solve this problem nicely, and the regular expression in this example can be represented by r”". Similarly, “\d” matching a number can be written as r”\d”.
Common methods in Python Re
re.compile(pattern, flags=0)
A Pattern (_sre.SRE_Pattern) object is a compiled regular expression that can be matched against text by a series of methods provided by Pattern. pattern cannot be instantiated directly and must be constructed using re.compile().
|
|
The flag argument in re.compile is a match pattern, which allows you to modify the way a regular expression works. Multiple flags can be specified by OR-ing them by bit. For example, re.I | re.M is set to the I and M flags. Optional values are.
- I(full name: IGNORECASE): Make the match case-insensitive, and ignore case when matching letters in character classes and strings.
- L(full name: LOCALE): makes the intended character class \w \W \b \B \s \S depending on the current region setting. (Not commonly used)
- M(Full name: MULTILINE): Multi-line mode, change the behavior of ‘^’ and ‘$’.
- S(full name: DOTALL): point any match mode, change the behavior of ‘.’ behavior to make . matches all characters, including newlines.
- X(full name: VERBOSE): detail mode. This mode allows regular expressions to be multi-line, ignores whitespace, and can include comments.
- U (full name: UNICODE): makes \w, \W, \b, \B, \d, \D, \s and \S depend on the character attributes defined by UNICODE.
The match pattern can be numeric, and to satisfy multiple match patterns, the numbers are added together.
- I = IGNORECASE = 2
- L = LOCALE = 4
- M = MULTILINE = 8
- S = DOTALL = 16
- U = UNICODE = 32
- X = VERBOSE = 64
Details.
- L: locales is a feature in the C library that is used to provide assistance with programming that requires consideration of different languages. For example, if you are working with French text and you want to use \w+ to match text, but \w only matches the character class [A-Za-z]; it does not match “é” or “ç”. If your system is properly configured and the localization is set to French, then the internal C function will tell the program that “é” should also be considered a letter. Using the LOCALE flag when compiling regular expressions will result in compiling objects with these C functions to handle \w afterwards; it will be slower, but it will also match French text with \w+, as you would expect.
- M: Use “^” to match only the beginning of the string, and $ to match only the end of the string and the end of the string directly before the newline (if any). When this flag is specified, “^” matches the beginning of the string and the beginning of each line in the string. Likewise, the $ metacharacter matches the end of the string and the end of each line in the string.
- X: This flag makes your regular expressions easier to understand by giving you more flexible formatting. When this flag is specified, whitespace in the RE string is ignored unless it is in a character class or after a backslash; this allows you to organize and indent RE more clearly. it also allows you to write comments to RE that will be ignored by the engine; comments are identified by a “#” sign, but the sign cannot be after the string or after a backslash.
re.template(pattern, flags=0)
Template form compilation? Haven’t used it. Can’t find any more details either.
re.escape(pattern)
Application function that escapes all characters in a string that may be interpreted as regular operators. Use this function if the string is long and contains a lot of special technical characters, and you don’t want to enter a bunch of backslashes, or if the string comes from the user (e.g. by getting the input through the raw_input function) and is to be used as part of a regular expression.
Implementation results.
|
|
re.purge()
Clear the cache of regular expressions
re.search(pattern, string, flags=0)
The re.search function looks for pattern matches within the string, only until it finds the first match and returns it, returning the _sre.SRE_Match object, or None if the string has no matches.
Implementation results.
How do I get to the contents of _sre.SRE_Match?
Match Object
The Match object is the result of a match and contains a lot of information about this match, which can be obtained using the readable properties or methods provided by Match.
Attributes.
- string: The text used in the match.
- re: The Pattern object used for the match.
- pos: The index of the regular expression in the text to start the search.
- endpos: The index at which the regular expression in the text ends the search.
- lastindex: index of the last captured grouping in the text. If there are no captured groups, it will be None.
- lastgroup: alias of the last captured group. Will be None if this group has no alias or there is no captured group.
Methods.
- group([group1, …]): gets the string intercepted by one or more groups; will be returned as a tuple when multiple arguments are specified. group can use either a number or an alias; number 0 represents the entire matching substring; returns group(0) when no arguments are filled in; returns None for groups with no intercepted strings. Groups that have intercepted multiple times return the last intercepted substring.
- groups([default]): return all intercepted strings in groups as a tuple. Equivalent to calling groups(1,2,…last). default means that groups with no intercepted strings are replaced with this value, default is None.
- groupdict([default]): Returns a dictionary with the alias of the group with alias as key and the intercepted substring of the group as value. groups without alias are not included. default means the same as above.
- start([group]): return the start index (index of the first character of the substring) of the intercepted substring of the specified group in string. group default is 0.
- end([group]): Returns the end index (index of the last character of the substring + 1) of the substring intercepted by the specified group in string. group defaults to 0.
- span([group]): returns (start(group), end(group)).
- expand(template): Substitute the matched group into template and return. \id or
\g<id>
,\g<name>
can be used to refer to the group in template, but not number 0. \id is equivalent to\g<id>
; but \10 will be considered the 10th group, if you want to express \1 followed by character ‘0’, only\g<1>0
can be used.
re.match(pattern, string, flags=0)
Whether the beginning of a string can match a regular expression. Returns _sre.SRE_Match object, and None if it does not match. match method is very similar to search method, the difference is that match() function only detects if re matches at the beginning of the string, search() scans the whole string to find a match.
|
|
Implementation results
re.findall(pattern, string, flags=0)
Finds all the substrings matched by RE and returns them as a list. This match is returned in an ordered fashion from left to right. If there is no match, the empty list is returned.
Implementation results.
|
|
re.finditer(pattern, string, flags=0)
Finds all substrings matched by RE and returns them as an iterator. This match is returned in an ordered fashion from left to right. If there is no match, the empty list is returned. Returns the _sre.SRE_Match object.
Implementation results.
re.split(pattern, string, maxsplit=0, flags=0)
Separates strings by regular expressions. If the regular expression is enclosed in parentheses, the matching string will also be included in the list and returned. maxsplit is the number of separations, maxsplit=1 separates once, default is 0, no limit on the number of times.
Implementation results.
|
|
re.sub(pattern, repl, string, count=0, flags=0)
Finds all substrings matched by RE and replaces them with a different string. The optional parameter count is the maximum number of substitutions after pattern matching; count must be a non-negative integer. The default value is 0 to replace all matches. If there are no matches, the string will be returned unchanged.
re.subn(pattern, repl, string, count=0, flags=0)
Works the same as the re.sub method, but returns a two-tuple containing the new string and the number of times the replacement was performed.
Reference.