Python源码阅读–Re正则模块

背景：使用python正则表达式时对这种简洁的代码表示惊叹。因此阅读了相关的源码，python源码是3.8.2版本
python正则的使用，不作赘述。如下链接讲的比较详细。网上资料也比较多
Python正则表达式指南

下面开始对re源码的阅读分析。

import enum
import sre_compile
import sre_parse
import functools
try:
    import _locale
except ImportError:
    _locale = None
 
__all__ = [
    "match", "fullmatch", "search", "sub", "subn", "split",
    "findall", "finditer", "compile", "purge", "template", "escape",
    "error", "A", "I", "L", "M", "S", "X", "U",
    "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
    "UNICODE",
]
 
__version__ = "2.2.1"

首先必要模块的导入，其中这个_locale找不到相关代码，直接查看该模块的注释信息发现是Support for POSIX locales，也就是当前系统所使用的语言环境。__all__就是所有支持的正则匹配方法以及flags标志。

class RegexFlag(enum.IntFlag):
    ASCII = sre_compile.SRE_FLAG_ASCII 
    IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE 
    LOCALE = sre_compile.SRE_FLAG_LOCALE 
    UNICODE = sre_compile.SRE_FLAG_UNICODE 
    MULTILINE = sre_compile.SRE_FLAG_MULTILINE 
    DOTALL = sre_compile.SRE_FLAG_DOTALL
    VERBOSE = sre_compile.SRE_FLAG_VERBOSE
    A = ASCII
    I = IGNORECASE
    L = LOCALE
    U = UNICODE
    M = MULTILINE
    S = DOTALL
    X = VERBOSE
    TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE 
    T = TEMPLATE
    DEBUG = sre_compile.SRE_FLAG_DEBUG 
globals().update(RegexFlag.__members__)
 
error = sre_compile.error

re模块主要定义了两个类，第一个就是RegexFlag，它是继承的枚举类型。该类是对sre_compile模块中各种标志位的封装，然后进行了简化，在我们使用正则标志位，例如re.ASCII 和re.A等价，用A可代替ASCII，一个是缩写，一个是完整写法。globals()是一个全局变量字典，通过update方法合并__members__。不过注意__members__这个魔法方法只是在enum枚举类中定义的，不是通用的。sre_compile.error是一个异常类，继承的Exception类。

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
    """Try to apply the pattern to all of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
    """Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used."""
    return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
    """Return a 2-tuple containing (new_string, number).
    new_string is the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in the source
    string by the replacement repl.  number is the number of
    substitutions that were made. repl can be either a string or a
    callable; if a string, backslash escapes in it are processed.
    If it is a callable, it's passed the Match object and must
    return a replacement string to be used."""
    return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
    """Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list."""
    return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
    """Return an iterator over all non-overlapping matches in the
    string.  For each match, the iterator returns a Match object.

    Empty matches are included in the result."""
    return _compile(pattern, flags).finditer(string)

def compile(pattern, flags=0):
    "Compile a regular expression pattern, returning a Pattern object."
    return _compile(pattern, flags)

def purge():
    "Clear the regular expression caches"
    _cache.clear()
    _compile_repl.cache_clear()

def template(pattern, flags=0):
    "Compile a template pattern, returning a Pattern object"
    return _compile(pattern, flags|T)

接下来就是re模块封装的各种正则匹配方法。

他们实际是_compile(pattern, flag).对应的方法(string)，purge用于清除正则表达式缓存。虽然使用 compile 函数生成的 Pattern 对象的一系列方法跟 re 模块的多数函数是对应的，但在使用上有细微差别。

这里先说一下正则的一般使用步骤（不限于Python语言）：

使用 compile 函数将正则表达式的字符串形式编译为一个 Pattern 对象
通过 Pattern 对象提供的一系列方法对文本进行匹配查找，获得匹配结果（一个 Match 对象）
最后使用 Match 对象提供的属性和方法获得信息，根据需要进行其他的操作

import re
s = 'ABC12DEF3GH6.!'

pattern = re.compile('\d+')
a = pattern.findall(s)
print(a)
['12', '3', '6']
[Finished in 0.1s]

大多数语言一种通用的正则用法。阅读源码后发现常用的正则表达式方法，都已经自带了compile。

所以在Python中使用 re 模块有两种方式：
1、使用 re.compile 函数生成一个 Pattern 对象，然后使用 Pattern 对象的一系列方法对文本进行匹配查找
2、直接使用 re.match, re.search 和 re.findall 等函数直接对文本匹配查找

# SPECIAL_CHARS
# closing ')', '}' and ']'
# '-' (a range in character set)
# '&', '~', (extended character set operations)
# '#' (comment) and WHITESPACE (ignored) in verbose mode
_special_chars_map = {i: '\\' + chr(i) for i in b'()[]{}?*+-|^$\\.&~# \t\n\r\v\f'}

def escape(pattern):
    """
    Escape special characters in a string.
    """
    if isinstance(pattern, str):
        return pattern.translate(_special_chars_map)
    else:
        pattern = str(pattern, 'latin1')
        return pattern.translate(_special_chars_map).encode('latin1')

def translate(pat):
    """Translate a shell PATTERN to a regular expression.

    There is no way to quote meta-characters.
    """

    i, n = 0, len(pat)
    res = ''
    while i < n:
        c = pat[i]
        i = i+1
        if c == '*':
            res = res + '.*'
        elif c == '?':
            res = res + '.'
        elif c == '[':
            j = i
            if j < n and pat[j] == '!':
                j = j+1
            if j < n and pat[j] == ']':
                j = j+1
            while j < n and pat[j] != ']':
                j = j+1
            if j >= n:
                res = res + '\\['
            else:
                stuff = pat[i:j]
                if '--' not in stuff:
                    stuff = stuff.replace('\\', r'\\')
                else:
                    chunks = []
                    k = i+2 if pat[i] == '!' else i+1
                    while True:
                        k = pat.find('-', k, j)
                        if k < 0:
                            break
                        chunks.append(pat[i:k])
                        i = k+1
                        k = k+3
                    chunks.append(pat[i:j])
                    # Escape backslashes and hyphens for set difference (--).
                    # Hyphens that create ranges shouldn't be escaped.
                    stuff = '-'.join(s.replace('\\', r'\\').replace('-', r'\-')
                                     for s in chunks)
                # Escape set operations (&&, ~~ and ||).
                stuff = re.sub(r'([&~|])', r'\\\1', stuff)
                i = j+1
                if stuff[0] == '!':
                    stuff = '^' + stuff[1:]
                elif stuff[0] in ('^', '['):
                    stuff = '\\' + stuff
                res = '%s[%s]' % (res, stuff)
        else:
            res = res + re.escape(c)
    return r'(?s:%s)\Z' % res

_special_chars_map定义了所有字符，由于Python3严格区分文本（str）和二进制数据（bytes），文本总是unicode，用str类型，二进制数据则用bytes类型表示。即非str类型转换为latin-1格式

escape 这个函数就是对字符串中所有可能被解释为正则运算符的字符进行转义的应用函数。如果字符串很长且包含很多特殊符号，例如”-,*,/”，通过这个方法将特殊字符前加上反斜杠”\”。

根据代码可知，先判断传入的字符是否是str。再把字符串转成列表进行遍历判断字符是否在集合中来转义。如果是str类型是直接转成列表进行替换

a = re.escape('E:Python-3.8/test.py')
print(a)

for var in b'abcd':
	print(var, type(var))
E:Python\-3\.8/test\.py
97 <class 'int'>
98 <class 'int'>
99 <class 'int'>
100 <class 'int'>
[Finished in 0.2s]

在translate中，依据正则的优先级对str进行字符串的处理，将单串指令切割成相对独立的小串

接下来就是比较关键的_compile函数了。

_cache = {}  # ordered!

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    if isinstance(flags, RegexFlag):
        flags = flags.value
    try:
        return _cache[type(pattern), pattern, flags]
    except KeyError:
        pass
    if isinstance(pattern, Pattern):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            # Drop the oldest item
            try:
                del _cache[next(iter(_cache))]
            except (StopIteration, RuntimeError, KeyError):
                pass
        _cache[type(pattern), pattern, flags] = p
    return p

_MAXCACHE是定义的一个最大缓存。同一个正则表达式，同一个flag，那么调用两次_compile时，第二次会直接读取缓存。但是当缓存大于512时就会清理缓存。传入正则的时候，Python会先查找缓存。因此，反复调用也不会存在影响效率

@functools.lru_cache(_MAXCACHE)
def _compile_repl(repl, pattern):
    return sre_parse.parse_template(repl, pattern)
 
def _expand(pattern, match, template):
    template = sre_parse.parse_template(template, pattern)
    return sre_parse.expand_template(template, match)
 
def _subx(pattern, template):
    template = _compile_repl(template, pattern)
    if not template[0] and len(template[1]) == 1:
        return template[1][0]
    def filter(match, template=template):
        return sre_parse.expand_template(template, match)
    return filter

这里用到了functools.lru_cache装饰器。functools.lru_cache的作用主要是用来做缓存，他能把相对耗时的函数结果进行保存，避免传入相同的参数重复计算。同时，缓存并不会无限增长，不用的缓存会被释放。不过这几个都是内置函数，供其他函数使用的

import copyreg
 
def _pickle(p):
    return _compile, (p.pattern, p.flags)
 
copyreg.pickle(_pattern_type, _pickle, _compile)

copyreg 模块提供了可在封存特定对象时使用的一种定义函数方式。用 copyreg 实现可靠的 pickle 操作，大家都知道pickle模块是用于对相关对象执行序列化和反序列化操作。如果用法比较复杂，那么 pickle 模块的功能也许就会出问题。可以把内置的 copyreg 模块同 pickle 结合起来使用，以便为旧数据添加缺失的属性值、进行类的版本管理等。这里注释也写了，就是自我注册。

# --------------------------------------------------------------------
# experimental stuff (see python-dev discussions for details)

class Scanner:
    def __init__(self, lexicon, flags=0):
        from sre_constants import BRANCH, SUBPATTERN
        if isinstance(flags, RegexFlag):
            flags = flags.value
        self.lexicon = lexicon
        # combine phrases into a compound pattern
        p = []
        s = sre_parse.State()
        s.flags = flags
        for phrase, action in lexicon:
            gid = s.opengroup()
            p.append(sre_parse.SubPattern(s, [
                (SUBPATTERN, (gid, 0, 0, sre_parse.parse(phrase, flags))),
                ]))
            s.closegroup(gid, p[-1])
        p = sre_parse.SubPattern(s, [(BRANCH, (None, p))])
        self.scanner = sre_compile.compile(p)
    def scan(self, string):
        result = []
        append = result.append
        match = self.scanner.scanner(string).match
        i = 0
        while True:
            m = match()
            if not m:
                break
            j = m.end()
            if i == j:
                break
            action = self.lexicon[m.lastindex-1][1]
            if callable(action):
                self.match = m
                action = action(self, m.group())
            if action is not None:
                append(action)
            i = j
        return result, string[i:]

scanner是内置的SRE模式对象的一个属性，引擎通过扫描器，在找到一个匹配后继续找下一个。它基于SRE模式scanner构造，提供了一些更高一层的接口。

re模块中的scanner对于提升「不匹配」的速度并没有多少帮助：基于SRE的基础类型。它的工作方式是接受一个正则表达式的列表和一个回调元组。对于每个匹配调用回调函数然后以此构造一个结果列表。

暂无评论

发表评论取消回复