Python 標準ライブラリ re 正規表現

2021-03-27

Pythonの標準にある正規表現ライブラリreの解説です。

概要

モジュールreでは以下のような正規表現の機能が提供されている。

re.findall マッチする文字列のリストを取得
re.split 文字列の分割
re.sub マッチする箇所を置換した文字列の取得
re.subn マッチする箇所を置換した文字列と置換した回数取得
re.search 正規表現とマッチする最初の場所に対応するマッチングオブジェクトを取得
re.match 文字列の先頭から開始して正規表現とマッチするマッチングオブジェクトを取得
re.fullmatch 文字列全体が正規表現とマッチする場合のマッチングオブジェクトを取得
re.finditer マッチする文字列のマッチングオブジェクトのイテレータを取得
re.compile 正規表現を繰り返し使用する場合に有用な正規表現のコンパイル

上記のメソッドを実施し、正規表現による操作を行った結果はリストや文字列・イテレータで返却される他、 search・match・fullmatch・finditerではマッチングオブジェクトが返却される。

findall(pattern, string, flags=0)

正規表現とマッチする文字列のリストを取得する。

import re

match_list = re.findall(r'\d+','2020/07/01')
match_list # => ['2020', '07', '01']

split(pattern, string, maxsplit=0, flags=0)

正規表現で文字列を分割する。

split_list = re.split(r',|\|','abc,1,2,3|xyz,4,5,6')
split_list # => ['abc', '1', '2', '3', 'xyz', '4', '5', '6']

maxsplitを指定するとその数を超えた分割はリストの最終要素にまとめられる。

split_list = re.split(r',|\|','abc,1,2,3|xyz,4,5,6',maxsplit=5)
split_list # => ['abc', '1', '2', '3', 'xyz', '4,5,6']

sub(pattern, repl, string, count=0, flags=0)

正規表現とマッチする文字列を置換する。

re.sub(r'-', '/', '2021-01-01') #=> '2021/01/01'

括弧でグループ化した正規表現は\数字で参照できる。

re.sub(r'(\d\d\d\d)-(\d\d)-(\d\d)', r'\1年\2月\3日', '2021-01-01 12:13:14') 
#=> '2021年01月01日 12:13:14'

名前をつけている場合、その名前を使って\g<名称>で参照できる。

re.sub(r'(?P<year>\d\d\d\d)-(?P<month>\d\d)-(?P<day>\d\d)', r'\g<year>年\g<month>月\g<day>日', '2021-01-01 12:13:14')
#=> '2021年01月01日 12:13:14'

countで置換回数の上限を指定できる。

re.sub(r'\|', ',', '1|2|3|5|7|11|13',3) #=> '1,2,3,5|7|11|13'

replに関数を指定した場合、マッチングオブジェクトを引数として呼び出しがされる。

def repl_function(m):
    return m.group(1) + '.' + m.group(2) + '.' + m.group(3)

re.sub(r'(\d\d\d\d)-(\d\d)-(\d\d)', repl_function, '2021-01-01 12:13:14')
# => '2021.01.01 12:13:14'

subn(pattern, repl, string, count=0, flags=0)

subnで置換後の文字列と置換回数のタプルを取得できる。

re.subn(r'-', '/', '2021-01-01') #=> ('2021/01/01', 2)

re.subn(r'(\d\d\d\d)-(\d\d)-(\d\d)', r'\1年\2月\3日', '2021-01-01 12:13:14')
#=> ('2021年01月01日 12:13:14', 1)

re.subn(r'\|', ',', '1|2|3|5|7|11|13',3) #=> ('1,2,3,5|7|11|13', 3)

search(pattern, string, flags=0)

正規表現とマッチした場合、対応するマッチングオブジェクトを返却する。マッチしない場合Noneとなる。

m = re.search('xy','aaxaxyaxya')
if m is not None:
    print('matched')

m = re.search('XY','aaxaxyaxya')
if m is None:
    print('not matched')

マッチングオブジェクトには、正規表現のキャプチャーグループ等の情報が含まれている。

match = re.search(r'(\d\d\d\d)-(\d\d)-(\d\d)', '2021-01-23')

# groupの引数なしまたは0はマッチした文字列全体
match.group() # => '2021-01-23'
match.group(0) # => '2021-01-23'

# groupのインデックス1以降はマッチしたグループ
match.group(1) # => '2021'
match.group(2) # => '01'
match.group(3) # => '23'

# groupの内容にはインデックスでのアクセスもできる
match[0] # => '2021-01-23'
match[1] # => '2021'
match[2] # => '01'
match[3] # => '23'

# タプルでグループを取得
match.groups() # => ('2021', '01', '23')

# マッチした開始位置
match.start() # => 0
match.start(0) # => 0
match.start(1) # => 0
match.start(2) # => 5
match.start(3) # => 8

# マッチした終了位置
match.end() # => 10
match.end(0) # => 10
match.end(1) # => 4
match.end(2) # => 7
match.end(3) # => 10

# マッチの開始・終了位置
match.span() # => (0, 10)
match.span(0) # => (0, 10)
match.span(1) # => (0, 4)
match.span(2) # => (5, 7)
match.span(3) # => (8, 10)

# マッチした結果のキャプチャーグループで置換した文字列を取得
match.expand('year=\\1, month=\\2, day=\\3') # => 'year=2021, month=01, day=23'

グループに名前をつけている場合、その名前を用いた参照ができる。

match = re.search(r'(?P<year>\d\d\d\d)-(?P<month>\d\d)-(?P<day>\d\d)', '2021-01-23')

match.group('year') # => '2021'
match.group('month') # => '01'
match.group('day') # => '23'

match['year'] # => '2021'
match['month'] # => '01'
match['day'] # => '23'

match.start('year') # => 0
match.start('month') # => 5
match.start('day') # => 8

match.end('year') # => 4
match.end('month') # => 7
match.end('day') # => 10

match.span('year') # => (0, 4)
match.span('month') # => (5, 7)
match.span('day') # => (8, 10)

match.expand('year=\\g<year>, month=\\g<month>, day=\\g<day>') # => 'year=2021, month=01, day=23'

match(pattern, string, flags=0)

matchはsearchと異なり、先頭からのマッチングを行う。マッチした場合は同様にマッチングオブジェクトを返却する。

m = re.match('xy','aaxaxyaa')
if m is None:
    print('not matched')

m = re.match('aax','aaxaxyaa')
if m is not None:
    print('matched')

fullmatch(pattern, string, flags=0)

fullmatchは文字列全体がマッチした場合にマッチングオブジェクトを返却する。

match = re.fullmatch(r'\d\d\d\d-\d\d-\d\d', '2021-01-23')
if match is not None:
    print('matched')

match = re.fullmatch(r'\d\d\d\d', '2021-01-23')
if match is None:
    print('not matched')

finditer(pattern, string, flags=0)

マッチする文字列に対応するマッチングオブジェクトのイテレータを取得する。

match_it = re.finditer(r'\d+','2020/07/01')
for m in match_it:
    print(m)
    # => <re.Match object; span=(0, 4), match='2020'>
    # => <re.Match object; span=(5, 7), match='07'>
    # => <re.Match object; span=(8, 10), match='01'>

compile(pattern, flags=0)

正規表現を繰り返し使用する場合、事前に正規表現のコンパイルを行っておくことで処理が早くなる場合がある。

compileにより取得できる正規表現オブジェクトは、上記までの各関数のpatternを省いた形式でメソッドを呼び出す事ができる。

items = ['2021-01-23', '2021-02-14', '2021-03-21']

reg = re.compile(r'(\d\d\d\d)-(\d\d)-(\d\d)')

for item in items:
    match = reg.search(item)
    print(match.groups())

flagsに指定できる正規表現のオプション

各関数の引数flagsに以下の各数値を指定し、正規表現の挙動を変えることができる。各数値はOR|でビット和を取ることで複数指定できる。

bin(re.A)          # => '0b100000000'  ASCII限定のマッチング
bin(re.ASCII)      # => '0b100000000'  ASCII限定のマッチング
bin(re.DEBUG)      # =>  '0b10000000'  デバッグ情報
bin(re.X)          # =>   '0b1000000'  読みやすい正規表現
bin(re.VERBOSE)    # =>   '0b1000000'  読みやすい正規表現
bin(re.S)          # =>     '0b10000'  .を改行にもマッチする 
bin(re.DOTALL)     # =>     '0b10000'  .を改行にもマッチする
bin(re.M)          # =>      '0b1000'  ^と$を改行毎に使えるようにする
bin(re.MULTILINE)  # =>      '0b1000'  ^と$を改行毎に使えるようにする
bin(re.L)          # =>       '0b100'　大文字・小文字の区別をロケールに合わせる
bin(re.LOCALE)     # =>       '0b100'  大文字・小文字の区別をロケールに合わせる
bin(re.I)          # =>        '0b10'  大文字・小文字を区別しない
bin(re.IGNORECASE) # =>        '0b10'  大文字・小文字を区別しない

複数行オプション(re.MULTILINE, re.M)の例

lines = '''\
"abc",12345,1.23
"xyz",10000,2.34
"zzz",11223,3.12
'''

match_it = re.finditer(r'^"?(\w+)"?\s*,\s*(\d+),\s*([\.\d]+)$', lines)
list(match_it) # => []

match_it = re.finditer(r'^"?(\w+)"?\s*,\s*(\d+),\s*([\.\d]+)$', lines , flags = re.M)
list(match_it)
# =>[<re.Match object; span=(0, 16), match='"abc",12345,1.23'>,
#    <re.Match object; span=(17, 33), match='"xyz",10000,2.34'>,
#    <re.Match object; span=(34, 50), match='"zzz",11223,3.12'>]

読みやすい正規表現のオプション(re.VERBOSE, re.X)の例

lines = '''\
"abc",12345,1.23
"xyz",10000,2.34
"zzz",11223,3.12
'''

match_it = re.finditer(r'''
^
"?(\w+)"? \s*  # 先頭の要素
,\s* (\d+)     # 2つ目の要素
,\s* ([\.\d]+) # 3つ目の要素
$
''', lines , flags = re.M | re.X)
list(match_it)
# =>[<re.Match object; span=(0, 16), match='"abc",12345,1.23'>,
#    <re.Match object; span=(17, 33), match='"xyz",10000,2.34'>,
#    <re.Match object; span=(34, 50), match='"zzz",11223,3.12'>]

大文字・小文字を区別しない(re.IGNORECASE, re.I)の例

m = re.search('XY', 'aaxaxyaxya')
if m is None:
    print('not matched')

m = re.search('XY', 'aaxaxyaxya', flags = re.I)
if m is not None:
    print('matched')