python - 洗練された構造化テキストファイルの解析

Question

ライブチャットの会話のトランスクリプトを解析する必要があります。このファイルを見て最初に思ったのは、正規表現を問題に投げ込むことでしたが、他の人がどのようなアプローチを使用しているのか疑問に思っていました。

このタイプのタスクは、正規表現に頼るだけでは保守が難しくなる危険性があることを以前に発見したため、タイトルにエレガントを付けました。

トランスクリプトは www.providesupport.com によって生成され、アカウントに電子メールで送信されます。次に、電子メールからプレーンテキストのトランスクリプトの添付ファイルを抽出します。

ファイルを解析する理由は、後で会話のテキストを抽出することと、訪問者とオペレーターの名前を識別して、CRM を介して情報を利用できるようにすることです。

トランスクリプトファイルの例を次に示します。

Chat Transcript

Visitor: Random Website Visitor 
Operator: Milton
Company: Initech
Started: 16 Oct 2008 9:13:58
Finished: 16 Oct 2008 9:45:44

Random Website Visitor: Where do i get the cover sheet for the TPS report?
* There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button
* Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.
Milton: Y-- Excuse me. You-- I believe you have my stapler?
Random Website Visitor: I really just need the cover sheet, okay?
Milton: it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire...
Random Website Visitor: oh i found it, thanks anyway.
* Random Website Visitor is now off-line and may not reply. Currently in room: Milton.
Milton: Well, Ok. But… that's the last straw.
* Milton has left the conversation. Currently in room:  room is empty.

Visitor Details
---------------
Your Name: Random Website Visitor
Your Question: Where do i get the cover sheet for the TPS report?
IP Address: 255.255.255.255
Host Name: 255.255.255.255
Referrer: Unknown
Browser/OS: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)

score 12 · Accepted Answer

いいえ、実際、あなたが説明する特定のタイプのタスクについては、正規表現よりも「クリーンな」方法があるとは思えません。ファイルに改行が埋め込まれているように見えるので、通常、ここで行うことは、行ごとに正規表現を適用して、行を分解の単位にすることです。その間に、小さなステートマシンを作成し、正規表現の一致を使用してそのステートマシンの遷移をトリガーします。このようにして、ファイル内のどこにいるのか、どのタイプの文字データが期待できるのかがわかります。また、名前付きキャプチャグループを使用し、外部ファイルから正規表現を読み込むことを検討してください。そうすれば、トランスクリプトの形式が変更された場合、新しい解析固有のコードを記述するのではなく、正規表現を微調整するだけで済みます。

score 11 · Accepted Answer

Perl では、 Parse::RecDescentを使用できます。

シンプルで、文法は後で保守できます。

score 8 · Accepted Answer

完全なパーサージェネレーターを検討することをお勧めします。

正規表現は、テキストの小さな部分文字列を検索するのには適していますが、ファイル全体を解析して意味のあるデータにすることに本当に関心がある場合は、ひどく不十分です。

部分文字列のコンテキストが重要な場合、それらは特に不十分です。

ほとんどの人は、それが彼らが知っていることなので、すべてに正規表現を投げます。彼らはパーサー生成ツールをまったく学んだことがなく、パーサージェネレーターで無料で入手できるプロダクションルールの構成とセマンティックアクション処理の多くをコーディングすることになります。

正規表現はどれも素晴らしいものですが、パーサーが必要な場合は代わりになりません。

score 6 · Accepted Answer

leplこれは、パーサージェネレーターライブラリに基づく 2 つのパーサーです。どちらも同じ結果になります。

from pprint import pprint
from lepl import AnyBut, Drop, Eos, Newline, Separator, SkipTo, Space

# field = name , ":" , value
name, value = AnyBut(':\n')[1:,...], AnyBut('\n')[::'n',...]    
with Separator(~Space()[:]):
    field = name & Drop(':') & value & ~(Newline() | Eos()) > tuple

header_start   = SkipTo('Chat Transcript' & Newline()[2])
header         = ~header_start & field[1:] > dict
server_message = Drop('* ') & AnyBut('\n')[:,...] & ~Newline() > 'Server'
conversation   = (server_message | field)[1:] > list
footer_start   = 'Visitor Details' & Newline() & '-'*15 & Newline()
footer         = ~footer_start & field[1:] > dict
chat_log       = header & ~Newline() & conversation & ~Newline() & footer

pprint(chat_log.parse_file(open('chat.log')))

より厳密なパーサー

from pprint import pprint
from lepl import And, Drop, Newline, Or, Regexp, SkipTo

def Field(name, value=Regexp(r'\s*(.*?)\s*?\n')):
    """'name , ":" , value' matcher"""
    return name & Drop(':') & value > tuple

Fields = lambda names: reduce(And, map(Field, names))

header_start   = SkipTo(Regexp(r'^Chat Transcript$') & Newline()[2])
header_fields  = Fields("Visitor Operator Company Started Finished".split())
server_message = Regexp(r'^\* (.*?)\n') > 'Server'
footer_fields  = Fields(("Your Name, Your Question, IP Address, "
                         "Host Name, Referrer, Browser/OS").split(', '))

with open('chat.log') as f:
    # parse header to find Visitor and Operator's names
    headers, = (~header_start & header_fields > dict).parse_file(f)
    # only Visitor, Operator and Server may take part in the conversation
    message = reduce(Or, [Field(headers[name])
                          for name in "Visitor Operator".split()])
    conversation = (message | server_message)[1:]
    messages, footers = ((conversation > list)
                         & Drop('\nVisitor Details\n---------------\n')
                         & (footer_fields > dict)).parse_file(f)

pprint((headers, messages, footers))

出力：

({'Company': 'Initech',
  'Finished': '16 Oct 2008 9:45:44',
  'Operator': 'Milton',
  'Started': '16 Oct 2008 9:13:58',
  'Visitor': 'Random Website Visitor'},
 [('Random Website Visitor',
   'Where do i get the cover sheet for the TPS report?'),
  ('Server',
   'There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button'),
  ('Server',
   'Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.'),
  ('Milton', 'Y-- Excuse me. You-- I believe you have my stapler?'),
  ('Random Website Visitor', 'I really just need the cover sheet, okay?'),
  ('Milton',
   "it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire..."),
  ('Random Website Visitor', 'oh i found it, thanks anyway.'),
  ('Server',
   'Random Website Visitor is now off-line and may not reply. Currently in room: Milton.'),
  ('Milton', "Well, Ok. But… that's the last straw."),
  ('Server',
   'Milton has left the conversation. Currently in room:  room is empty.')],
 {'Browser/OS': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)',
  'Host Name': '255.255.255.255',
  'IP Address': '255.255.255.255',
  'Referrer': 'Unknown',
  'Your Name': 'Random Website Visitor',
  'Your Question': 'Where do i get the cover sheet for the TPS report?'})

score 5 · Accepted Answer

パーサーを構築しますか? あなたのデータが十分に規則的であるかどうかは判断できませんが、調べる価値があるかもしれません。

score 4 · Accepted Answer

複数行のコメント付き正規表現を使用すると、メンテナンスの問題を多少軽減できます。1行のスーパー正規表現を避けるようにしてください!

また、取得したい「もの」ごとに 1 つずつ、正規表現を個々のタスクに分割することを検討してください。例えば。

visitor = text.find(/Visitor:(.*)/)
operator = text.find(/Operator:(.*)/)
body = text.find(/whatever....)

それ以外の

text.match(/Visitor:(.*)\nOperator:(.*)...whatever to giant regex/m) do
  visitor = $1
  operator = $2
  etc.
end

次に、特定のアイテムの解析方法を簡単に変更できます。多くの「チャットブロック」を含むファイルを解析する限り、単一のチャットブロックに一致する単一の単純な正規表現を使用し、テキストを反復処理して、これから一致データを他のマッチャーのグループに渡します。

これは明らかにパフォーマンスに影響しますが、膨大なファイルを処理しない限り、心配する必要はありません。

score 2 · Accepted Answer

Ragel の使用を検討してください https://www.colm.net/open-source/ragel/

それが内部で雑種に力を与えているものです。文字列を複数回解析すると、劇的に遅くなります。

score 2 · Accepted Answer

私は Paul McGuire の pyParsing クラスライブラリを使用してきましたが、十分に文書化されており、簡単に開始でき、ルールの微調整と保守が簡単であることに感銘を受け続けています。ところで、ルールは Python コードで表現されます。ログファイルには、各行をスタンドアロンユニットとして解析するのに十分な規則性があるようです。

score 0 · Accepted Answer

簡単な投稿ですが、トランスクリプトの例をちらっと見ただけですが、最近、テキストの解析も検討する必要があり、手巻きの解析のルートを避けたいと思っていました。私はRagelに出くわしましたが、これは頭を使い始めたばかりですが、かなり役に立ちそうです。

python - 洗練された構造化テキスト ファイルの解析

9 に答える 9

より厳密なパーサー

Related

Reference

python - 洗練された構造化テキストファイルの解析