1. 基本知识

字符串是编程时涉及到的最多的一种数据结构。正则表达式正是一种用来匹配字符串的强有力的武器。它的设计思想是用一种描述性语言来给字符串定义一个规则，用来验证是否匹配。

2. 进阶知识

[0-9a-zA-Z_] : 可以匹配一个数字、字母、下划线；
[0-9a-zA-Z_]+ : 可以匹配至少由一个数字、字母、下划线组成的字符串；
[a-zA-Z_] [0-9a-zA-Z_] : 可以匹配由字母数字下划线开头，后面至少由一个数字、字母、下划线组成的字符串，即: Python的合法变量；
(P|p)ython : 表示Python 或 python
^: 开始符，表示行的开头
$: 结束符，表示行的结束

3. re模块

Python提供re模块，包含了所有正则表达式的功能

>>> import re
>>> re.match(r'^\d{3}\-\d{3,8}', '010-12345') #匹配成功返回一个Match对象
<re.Match object; span=(0, 9), match='010-12345'>
>>> re.match(r'^\d{3}\-\d{3,8}', '010 12345') #匹配失败返回None
>>>

常见应用

test = '用户输入的字符串'
if re.match(r'正则表达式', test):
    print('ok')
else:
    print('failed')

4. 切分字符串

>>> 'a b   c'.split(' ')
['a', 'b', '', '', 'c']
>>> re.split(r'\s+', 'a b  c')
['a', 'b', 'c']
>>> re.split(r'[\s\,]+', 'a,b c  d')
['a', 'b', 'c', 'd']

5. 分组

正则表达式还具有提取子串的强大功能。()表示要提取的Group。

#提取区号和本地号码
>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m
<re.Match object; span=(0, 9), match='010-12345'>
>>> m.group(0)
'010-12345'
>>> m.group(1)
'010'
>>> m.group(2)
'12345'
>>> m.group()
'010-12345'
>>> m.groups()
('010', '12345')

6. 贪婪匹配

正则表达式会尽可能多的匹配字符

>>> re.match(r'^(\d+)(0*)$', '102300').groups()
('102300', '')
>>> re.match(r'^(\d+?)(0*)$', '102300').groups() #贪婪符后面放置'?'，采用了非贪婪匹配
('1023', '00')

7. 编译

当需要大量重复的使用正则表达式时，可以采用预编译方式，提升效率

>>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$') #编译
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010 12345').groups() #不合法，会报错
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'