python - pyparsing でこれの文法を書く方法: 一連の単語に一致するが、特定のパターンを含まない

Question

私はPythonとpyparsingが初めてです。私は以下を達成する必要があります。

私のテキストのサンプル行は次のようなものです:

12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009

商品説明、期間を抽出する必要があります

tok_date_in_ddmmmyyyy = Combine(Word(nums,min=1,max=2)+ " " + Word(alphas, exact=3) + " " + Word(nums,exact=4))
tok_period = Combine((tok_date_in_ddmmmyyyy + " to " + tok_date_in_ddmmmyyyy)|tok_date_in_ddmmmyyyy)

tok_desc =  Word(alphanums+"-()") but stop before tok_period

これを行う方法？

score 5 · Accepted Answer

不要なテキストの適切な定義があるため、最も適切な pyparsing クラスとして SkipTo を検討することをお勧めしますが、それ以前はほとんど何でも受け入れます。SkipTo を使用するには、次の 2 つの方法があります。

text = """\
12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009"""

# using tok_period as defined in the OP

# parse each line separately
for tx in text.splitlines():
    print SkipTo(tok_period).parseString(tx)[0]

# or have pyparsing search through the whole input string using searchString
for [[td,_]] in SkipTo(tok_period,include=True).searchString(text):
    print td

どちらforのループも次のように出力します。

12 items - Ironing Service    
Washing service (3 Shirt)

score 3 · Accepted Answer

MK Saravanan、この特定の構文解析の問題は、良い 'ole re:

import re
import string

text='''
12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009
This line does not match
'''

date_pat=re.compile(
    r'(\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?)')
for line in text.splitlines():
    if line:
        try:
            description,period=map(string.strip,date_pat.split(line)[:2])
            print((description,period))
        except ValueError:
            # The line does not match
            pass

収量

# ('12 items - Ironing Service', '11 Mar 2009 to 10 Apr 2009')
# ('Washing service (3 Shirt)', '23 Mar 2009')

ここでの主力はもちろん re パターンです。それを分解しましょう：

\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}は日付の正規表現で、と同等ですtok_date_in_ddmmmyyyy。\d{1,2}1 つまたは 2 つの数字に\s+一致する、1 つ以上の空白に[a-zA-Z]{3}一致する、3 文字に一致するなど。

(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?で囲まれた正規表現(?:...)です。これは、グループ化されていない正規表現を示します。これを使用すると、グループ (例: match.group(2)) はこの正規表現に割り当てられません。date_pat.split() は、各グループがリストのメンバーであるリストを返すため、これは重要です。グループ化を抑えることで、全期間11 Mar 2009 to 10 Apr 2009をまとめています。最後のクエスチョンマークは、このパターンが 0 回または 1 回発生する可能性があることを示します。これにより、正規表現がとの両方に一致 23 Mar 2009し11 Mar 2009 to 10 Apr 2009ます。

text.splitlines()でテキストを分割します\n。

date_pat.split('12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009')

date_pat 正規表現で文字列を分割します。返されたリストに一致が含まれます。したがって、次のようになります。

['12 items - Ironing Service ', '11 Mar 2009 to 10 Apr 2009', '']

map(string.strip,date_pat.split(line)[:2])結果を美しくします。

line一致しない場合はdate_patをdate_pat.split(line)返す[line,]ので、

description,period=map(string.strip,date_pat.split(line)[:2])

要素が 1 つしかないリストを 2 タプルにアンパックできないため、ValueError が発生します。この例外をキャッチしますが、単に次の行に渡します。

python - pyparsing でこれの文法を書く方法: 一連の単語に一致するが、特定のパターンを含まない

2 に答える 2

Related

Reference