python - 正規表現を使用した文の分割

Question

テキスト (SMS) メッセージがほとんどなく、区切り文字としてピリオド ('.') を使用してセグメント化したいと考えています。次の種類のメッセージを処理できません。Python で正規表現を使用してこれらのメッセージをセグメント化するにはどうすればよいですか。

セグメンテーション前:

「ハイパー カウント 16.8mmol/l.plz レビュー b4 午後 5 時。お知らせします。ありがとうございます」
'ベッド数 8.担当者にお知らせください.tq'

セグメンテーション後:

'hyper count 16.8mmol/l' 'plz review b4 5pm' 'だけお知らせします' 'Thank u'
'ベッド数 8' '担当者にお知らせください' 'tq'

各行は個別のメッセージです

更新しました：

私は自然言語処理を行っていますが、同じように扱っ'16.8mmmol/l'ても大丈夫だと感じています。私には80％の精度で十分ですが、できるだけ 'no of beds 8.2 cups of tea.'減らしたいと思っています。False Positive

score 5 · Accepted Answer

数週間前、私は文字列内の数値を表すすべての文字列をキャッチする正規表現を検索しました。数値が書かれている形式に関係なく、科学表記法であっても、コンマを含むインドの数値であっても、このスレッドを参照してください。

この正規表現を次のコードで使用して、問題を解決します。

他の回答とは対照的に、私のソリューションでは「8.」のドットです。は、ドットの後に数字がない float として読み取ることができるため、分割を行う必要があるドットとは見なされません。

import re

regx = re.compile('(?<![\d.])(?!\.\.)'
                  '(?<![\d.][eE][+-])(?<![\d.][eE])(?<!\d[.,])'
                  '' #---------------------------------
                  '([+-]?)'
                  '(?![\d,]*?\.[\d,]*?\.[\d,]*?)'
                  '(?:0|,(?=0)|(?<!\d),)*'
                  '(?:'
                  '((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|\.(0)'
                  '|((?<!\.)\.\d+?)'
                  '|([\d,]+\.\d+?))'
                  '0*'
                  '' #---------------------------------
                  '(?:'
                  '([eE][+-]?)(?:0|,(?=0))*'
                  '(?:'
                  '(?!0+(?=\D|\Z))((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|((?<!\.)\.(?!0+(?=\D|\Z))\d+?)'
                  '|([\d,]+\.(?!0+(?=\D|\Z))\d+?))'
                  '0*'
                  ')?'
                  '' #---------------------------------
                  '(?![.,]?\d)')



simpler_regex = re.compile('(?<![\d.])0*(?:'
                           '(\d+)\.?|\.(0)'
                           '|(\.\d+?)|(\d+\.\d+?)'
                           ')0*(?![\d.])')


def split_outnumb(string, regx=regx, a=0):
    excluded_pos = [x for mat in regx.finditer(string) for x in range(*mat.span()) if string[x]=='.']
    li = []
    for xdot in (x for x,c in enumerate(string) if c=='.' and x not in excluded_pos):
        li.append(string[a:xdot])
        a = xdot + 1
    li.append(string[a:])
    return li





for sentence in ('hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u',
                 'no of beds 8.please inform person in-charge.tq',
                 'no of beds 8.2 cups of tea.tarabada',
                 'this number .977 is a float',
                 'numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation',
                 'an indian number 12,45,782.258 in this.sentence and 45,78,325. is another',
                 'no dot in this sentence',
                 ''):
    print 'sentence         =',sentence
    print 'splitted eyquem  =',split_outnumb(sentence)
    print 'splitted eyqu 2  =',split_outnumb(sentence,regx=simpler_regex)
    print 'splitted gurney  =',re.split(r"\.(?!\d)", sentence)
    print 'splitted stema   =',re.split('(?<!\d)\.|\.(?!\d)',sentence)
    print

結果

sentence         = hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u
splitted eyquem  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted eyqu 2  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted gurney  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted stema   = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']

sentence         = no of beds 8.please inform person in-charge.tq
splitted eyquem  = ['no of beds 8.please inform person in-charge', 'tq']
splitted eyqu 2  = ['no of beds 8.please inform person in-charge', 'tq']
splitted gurney  = ['no of beds 8', 'please inform person in-charge', 'tq']
splitted stema   = ['no of beds 8', 'please inform person in-charge', 'tq']

sentence         = no of beds 8.2 cups of tea.tarabada
splitted eyquem  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted eyqu 2  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted gurney  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted stema   = ['no of beds 8.2 cups of tea', 'tarabada']

sentence         = this number .977 is a float
splitted eyquem  = ['this number .977 is a float']
splitted eyqu 2  = ['this number .977 is a float']
splitted gurney  = ['this number .977 is a float']
splitted stema   = ['this number ', '977 is a float']

sentence         = numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation
splitted eyquem  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted eyqu 2  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted gurney  = ['numbers 214.21E+45 , 478945', 'E-201 and .12478E+02 are in scientific', 'notation']
splitted stema   = ['numbers 214.21E+45 , 478945', 'E-201 and ', '12478E+02 are in scientific', 'notation']

sentence         = an indian number 12,45,782.258 in this.sentence and 45,78,325. is another
splitted eyquem  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted eyqu 2  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted gurney  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
splitted stema   = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']

sentence         = no dot in this sentence
splitted eyquem  = ['no dot in this sentence']
splitted eyqu 2  = ['no dot in this sentence']
splitted gurney  = ['no dot in this sentence']
splitted stema   = ['no dot in this sentence']

sentence         = 
splitted eyquem  = ['']
splitted eyqu 2  = ['']
splitted gurney  = ['']
splitted stema   = ['']

編集1

このスレッドの私の投稿から、数値を検出するsimpler_regexを追加しました

インドの数字と科学表記法で数字を検出しませんが、実際には同じ結果が得られます

score 2 · Accepted Answer

負の先読みアサーションを使用して「。」と一致させることができます。数字が続かない場合は、次のように使用re.splitします。

>>> import re
>>> splitter = r"\.(?!\d)"
>>> s = 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
>>> re.split(splitter, s)
['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
>>> s = 'no of beds 8.please inform person in-charge.tq'
>>> re.split(splitter, s)
['no of beds 8', 'please inform person in-charge', 'tq']

score 1 · Accepted Answer

どうですか

re.split('(?<!\d)\.|\.(?!\d)', 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u')

ルックアラウンドにより、一方または他方の側が数字ではないことが保証されます。したがって、これは16.8ケースもカバーします。両側に数字がある場合、この式は分割されません。

score 0 · Accepted Answer

それはあなたの正確な文に依存しますが、試すことができます：

.*?[a-zA-Z0-9]\.(?!\d)

それが機能するかどうかを確認してください。これにより引用符が保持されますが、必要に応じて削除できます。

score -1 · Accepted Answer

-1

"...".split(".")

split特定の文字で文字列を区切る Python 組み込み関数です。

于 2011-07-19T10:21:01.383 に答える

python - 正規表現を使用した文の分割

5 に答える 5

編集1

Related

Reference