Python 學習日記 EP 7 - 正規表示式

2021-08-18

2022-03-14

Python 學習日記

參考書目: Python自動化的樂趣: 搞定重複瑣碎&單調無聊的工作

📝 正規表示式

正規表示式(Regular Expression，Regex)，它能夠指定某種模式來進行比對搜尋。

使用正規表示式能節省很多時間，不僅是軟體的使用者連程式設計者也適用。

📓 建立 Regex 物件

Python 中所有正規表示式的函式都在 re 模組內。

import re

呼叫 compile() 方法並傳入代表正規表示式的字串值，會回傳一個 Regex 模式物件，也就是 Regex 物件。

import re

#不用原始字串型態
phoneNumRegex = re.compile('\\d\\d-\\d\\d\\d\\d\\d\\d\\d\\d')

#使用原始字串型態
phoneNumRegex = re.compile(r'\d\d-\d\d\d\d\d\d\d\d')

print(type(phoneNumRegex))

:::info
💡 Hint: 推薦使用"原始字串"型態的正規表示式會比較簡潔。
:::

📓 比對 Regex 物件

Regex 物件的 search() 方法，可以傳入要尋找的字串，搜尋比對符合正規表示式的內容。若沒有找到符合的字串內容則回傳 None，反之有找到的話，則回傳一個 Match 物件。

Match 物件有 group() 方法，可以回傳實際符合比對的字串。

import re

#使用原始字串型態
phoneNumRegex = re.compile(r'\d\d-\d\d\d\d\d\d\d\d')

#有找到符合字串
result = phoneNumRegex.search('Phone: 09-12345678')
print('Phone number found: '+result.group())

#沒找到符合字串
result = phoneNumRegex.search('Phone: 09-1234567')
print('Phone number found: '+str(result))

📓 重點回顧

先 import re 匯入正規表示模組
呼叫 compile() 方法，建立 Regex 物件 (推薦使用原始字串)
呼叫 Regex 物件的 search() 方法傳入想要比對的字串，回傳 Match 物件或 None 值
呼叫 Match 物件的 group() 方法，回傳比對符合的字串

📝 活用正規表示式

📓 利用括號分組

正規表示式字串內的第1對括號為第一組並以此類推。將不同整數傳入 group() 方法，可取得特定的分組。

對 group() 傳入 0 或不傳的話，則回傳整個比對符合的字串。

import re

#使用原始字串型態
phoneNumRegex = re.compile(r'(\d\d)-(\d\d\d\d\d\d\d\d)')

#找到符合字串
result = phoneNumRegex.search('Phone: 09-12345678')

print('Phone number found: '+result.group())

print('Phone number found: '+result.group(0))

print('Phone number found: '+result.group(1))

print('Phone number found: '+result.group(2))

想要一次取得所有分組，可以呼叫 groups() 方法，以多元組型態回傳。

import re

#使用原始字串型態
phoneNumRegex = re.compile(r'(\d\d)-(\d\d\d\d\d\d\d\d)')

#找到符合字串
result = phoneNumRegex.search('Phone: 09-12345678')

print(result.groups())

由於括號在正規表示式中有特殊意義，若想比對括號可用轉義字元的方式，例如:$、$。

import re

#使用原始字串型態
phoneNumRegex = re.compile(r'(\(\d\d\))-(\d\d\d\d\d\d\d\d)')

#找到符合字串
result = phoneNumRegex.search('Phone: (09)-12345678')

print('Phone number found: '+result.group())

print('Phone number found: '+result.group(1))

print('Phone number found: '+result.group(2))

📓 管道比對多個分組

想要進行多個表示式的比對時，可以利用|作為區隔。例如: 正規表示式的 r'Bat|Spider'，會同時比對字串中是否有’Bat’或’Spider’，第一次比對符合的文字以 Match 物件型態回傳。

由於管道字元在正規表示式中有特殊意義，若想比對可用轉義字元的方式，例如:\|。

import re

textRegex = re.compile(r'Bat|Spider')
result = textRegex.search('Bat and Spider')
print(result.group())

textRegex = re.compile(r'Bat|Spider')
result = textRegex.search('Spider and Bat')#交換順序
print(result.group())

import re

textRegex = re.compile(r'(One|Two|Three) Fish')
result = textRegex.search('Lura has One Fish.')
print(result.group(1))
print(result.group())

📓 可選擇性比對 (?)

要比對的內容是選擇性時，可在括號分組後面加上問號(?)，表示該內容出現與否都不影響正規比對結果。

可以把問號字元看成該分組可以出現1次或0次。

由於?在正規表示式中有特殊意義，若想比對可用轉義字元的方式，例如:\?。

import re

textRegex = re.compile(r'(One|Two|Three)? Fish')
result = textRegex.search(' Fish.')
print(result.group())

result = textRegex.search('Lura has One Fish.')
print(result.group())

📓 符合 0 次或多次 (*)

星號前面的分組可以出現任意次。

由於*在正規表示式中有特殊意義，若想比對可用轉義字元的方式，例如:\*。

import re

textRegex = re.compile(r'(One|Two|Three)* Fish')
result = textRegex.search(' Fish.')
print(result.group())

result = textRegex.search('Lura has OneOneOne Fish.')
print(result.group())

result = textRegex.search('Lura has OneTwoOne Fish.')
print(result.group())

📓 符合 1 次或多次 (+)

加號之前的分組一定要至少出現1次以上。

由於+在正規表示式中有特殊意義，若想比對可用轉義字元的方式，例如:\+。

import re

textRegex = re.compile(r'(One|Two|Three)+ Fish')
result = textRegex.search(' Fish.')
print(result == None)

result = textRegex.search('Lura has OneOneOne Fish.')
print(result.group())

result = textRegex.search('Lura has OneTwoOne Fish.')
print(result.group())

📓 指定出現次數 ({ })

大括號可放在分組之後，用來指定該分組能夠出現幾次。

例如:

(Hello){3}，表示分組可以出現3次
(Hello){3,5}，表示分組可以出現3~5次
(Hello){3,}，表示分組可以出現3次以上
(Hello){，5}，表示分組可以出現5次以下

import re

textRegex = re.compile(r'(Big){3} Fish')
result = textRegex.search('BigBigBig Fish.')
print(result.group())

result = textRegex.search('BigBig Fish.')
print(result == None) # return True

📓 貪婪/非貪婪比對

比對(Hello){3,5}時，Python 會自動回傳找到的最長符合字串(貪婪比對)，也就是說 Hello 出現5次，Python 不會回傳5次以外的結果。但是若是想要找到最短的符合結果(非貪婪比對)，則可以在大括號之後加上問號。

import re

textRegex = re.compile(r'(Hello){3,5}')
result = textRegex.search('HelloHelloHelloHelloHello')
print(result.group())

textRegex = re.compile(r'(Hello){3,5}?')
result = textRegex.search('HelloHelloHelloHelloHello')
print(result.group())

📓 findall() 方法

Regex 物件本身還有一個 findall() 方法。findall() 方法不同於 search() 方法，它會返回一組字串串列，這個串列裡會放入所有比對符合的文字內容。

import re

#使用原始字串型態
phoneNumRegex = re.compile(r'\(\d\d\)-\d\d\d\d\d\d\d\d')

#找到符合字串
result = phoneNumRegex.findall('Phone: (09)-12345678')

print(result)

若在正規表示式中有分組，則 findall() 返回的會是多元組串列。

import re

#使用原始字串型態
phoneNumRegex = re.compile(r'(\(\d\d\))-(\d\d\d\d\d\d\d\d)')

#找到符合字串
result = phoneNumRegex.findall('Phone: (09)-12345678')

print('Phone number found: '+result.group())

📝 字元分類

字元分類對於縮短正規表示式很有幫助。

字元分類	表示
`\d`	0~9的任何數字
`\D`	除了0~9的數字之外的字元
`\w`	任何字母、數字或底線字元
`\W`	除了字母、數字或底線字元之外的字元
`\s`	空格、定位符號或換行符號
`\S`	除了空格、定位符號或換行符號之外的字元

📝 建立自己的字元分類

可以用中括號來定義自己的字元分類，也可利用聯字符號(-)指定字母或數字的範圍。

例如: [a-zA-Z0-9] 會去比對找出所有的大小寫字母還有數字。

import re

textRegex = re.compile(r'[a-zA-Z0-9]')

result = textRegex.search('Where can I find the answer for the world?') 

print(result.group())

result = textRegex.findall('Where can I find the answer for the world?')

print(result)

在字元分類的左側中括號後插入^符號，表示要取得相反的字元分類，也就是不在字元分類內的內容。

import re

textRegex = re.compile(r'[^a-zA-Z0-9]')

result = textRegex.findall('Where can I find the answer for the world?')

print(result)

📝 `^`與`$`字元

在正規表示式的起始處加上^，表示字串必須以^後的字串作為開頭才能符合比對。

import re

textRegex = re.compile(r'^Hello')

result = textRegex.search('Hello, Kevin.')

print(result.group())

result = textRegex.search('hello, Kevin.')
print(result == None)

在正規表示式的結尾處加上$，表示字串必須以$前的字串作為結尾才能符合比對。

import re

textRegex = re.compile(r'\d$')

result = textRegex.search('3.141592653')
![](https://i.imgur.com/Z5Oqj9j.png)
![](https://i.imgur.com/GFsm0kU.png)
![](https://i.imgur.com/vKJBHrx.png)
![](https://i.imgur.com/0jPIeLS.png)
![](https://i.imgur.com/sDzIq6T.png)
![](https://i.imgur.com/VjtZIJJ.png)

print(result.group())

result = textRegex.search('3.141592653s')
print(result == None)

同時使用兩者則字串必須符合特定的開頭和結尾才能符合比對。

import re

textRegex = re.compile(r'^\d\.\d+\w$')

result = textRegex.search('3.141592653s')

print(result.group())

result = textRegex.search('3.141592653 ')
print(result == None)

📝 萬用字元

在正規表示式中句點(.)稱為萬用字元(wildcard)，可尋找比對除了換行符號之外的字元。

句點字元在比對時，對應的是一個字元。

由於.在正規表示式中有特殊意義，若想比對可用轉義字元的方式，例如:\.。

import re 

textRegex = re.compile(r'.at')
result = textRegex.findall('The cat in the hat sat on the flat mat.')
print(result)

📓 `.*`比對所有字元

.*預設是使用貪婪比對，也就是找出最長的符合字串。使用 .*?則可以切換成非貪婪比對。

import re 

textRegex = re.compile(r'First: (.*) Second: (.*)')
result = textRegex.search('First: Fried Chicken Second: Burger')
print(result.group())

import re 

textRegex = re.compile(r'<.*>')
result = textRegex.search('<Harry Potter> Rowling, J. K.>')
print(result.group())

textRegex = re.compile(r'<.*?>')
result = textRegex.search('<Harry Potter> Rowling, J. K.>')
print(result.group())

📓 `.`字元找出換行符號

.*會找出除了換行字元以外的字元，若連換行字元也要納入比對，則可在 compile() 方法內傳入 re.DOTALL 引數，這樣在比對時，所有字元都會符合比對。

import re 

textRegex = re.compile('.*')
result = textRegex.search('First: Fried Chicken\nSecond: Burger')
print(result.group())

textRegex = re.compile('.*',re.DOTALL)
result = textRegex.search('First: Fried Chicken\nSecond: Burger')
print(result.group())

📝 不區分大小寫的比對

如果只在意比對的字母，而不在意大小寫，可以在 compile() 方法傳入 re.IGNORECASE 或 re.I。

import re 

textRegex = re.compile(r'Hello',re.I)
result = textRegex.search('HeLLO')
print(result.group())

result = textRegex.search('hElLo')
print(result.group())

📝 取代比對符合的字串

透過 Regex 物件的 sub() 方法，正規表示式除了可以比對找出文字，也能用新的文字取代符合比對的文字。

sub() 方法要傳入2個引數，第1個是用來取代找出的文字，第2個引數是要進行比對的字串。

sub() 方法會回傳取代之後的完整字串。

import re 

textRegex = re.compile(r'Hello')
result = textRegex.sub('Hey','Hello, Kevin')
print(result)

📝 處理複雜的正規表示式

如果要比對的正規表示式很長、複雜、難懂，可以考慮在正規表示式中加入空白和註解，在 compile() 方法傳入 re.VERBOSE，空白和註解就會被忽略。

用三個單引號(‘’')可以建立多行式字串。

import re 

textRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)?\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

textRegex = re.compile(r'''
    (\d{3}|\(\d{3}\))?          # area code
    (\s|-|\.)?                   # separator
    \d{3}                        # first 3 digits
    (\s|-|\.)                    # separator
    \d{4}                        # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})? # extension
''',re.VERBOSE)

📝 傳入複數參數到`compile()`方法

compile() 方法只能接受1個值作為第2個參數，但透過管道字元(|)可以解決這個問題。

1
2
3

import re 

textRegex = re.compile(r'Hello',re.I | re.DOTALL | re.VERBOSE)

技術類文章