python - pyparsingで重複を省略する方法は?

Question

わかりました、最終的にすべてのテストケースをキャプチャする文法を取得しましたが、重複 (ケース 3) と誤検知 (ケース 6、「パターン 5」) があります。これが私のテストケースと私の望ましい出力です。

私はまだPythonにかなり慣れていないので（子供たちに教えることができます！怖いです！）、この問題を解決する明白な方法があると確信しています。これがpyparseの問題であるかどうかさえわかりません。今のところ私の出力は次のようになります。

['01/01/01','S01-12345','20/111-22-1001',['GLEASON', ['5', '+', '4'], '=', '9']]
['02/02/02','S02-1234','20/111-22-1002',['GLEASON', 'SCORE', ':', ['3', '+', '3'], '=', '6']]
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'GRADE', ['4', '+', '3'], '=', '7']]
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'SCORE', ':', '7', '=', ['4', '+', '3']]]
['04/17/04','S04-123','30/111-22-1004',['GLEASON', 'SCORE', ':', ['3', '+', '4', '-', '7']]]
['05/28/05','S05-1234','20/111-22-1005',['GLEASON', 'SCORE', '7', '[', ['3', '+', '4'], ']']]
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', ['4', '+', '3']]]
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', 'PATTERN', '5']]
['07/22/07','S07-2749','20/111-22-1007',['GLEASON', 'SCORE', '6', '(', ['3', '+', '3'], ')']]

文法はこちら

num = Word(nums)
arith_expr = operatorPrecedence(num,
    [
    (oneOf('-'), 1, opAssoc.RIGHT),
    (oneOf('* /'), 2, opAssoc.LEFT),
    (oneOf('+ -'), 2, opAssoc.LEFT),
    ])
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
score = (Optional(oneOf('( [')) +
         arith_expr('lhs') +
         Optional(oneOf(') ]')) +
         Optional(oneOf('= -')) +
         Optional(oneOf('( [')) +
         Optional(arith_expr('rhs')) +
         Optional(oneOf(') ]')))
gleason = Group("GLEASON" + Optional("SCORE") + Optional("GRADE") + Optional("PATTERN") + Optional(":") + score)
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
partMatch = patientData("patientData") | gleason("gleason")

そして出力機能。

lastPatientData = None 
for match in partMatch.searchString(TEXT):
    if match.patientData:
        lastPatientData = match
    elif match.gleason:
        if lastPatientData is None:
            print "bad!" 
            continue 
       # getParts() 
        FOUT.write( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]\n".format(lastPatientData.patientData, match.gleason))

ご覧のとおり、出力は見た目ほど良くありません。ファイルに書き込んで、構文の一部を偽造しているだけです。私はpyparsingの中間結果を手に入れる方法に苦労しているので、それらを扱うことができます。これを書き留めて、重複を見つける 2 つ目のスクリプトを実行する必要がありますか?

Paul McGuire の回答に基づいて更新します。この関数の出力では、エントリごとに 1 行になりますが、スコアの一部が失われています (各グリーソンスコアは、知的にはの形式primary + secondary = totalです。これはデータベースに向かっているため、pri、sec、tot は別々です) posgresql 列、またはパーサーの出力の場合はコンマ区切り値)

accumPatientData = None
for match in partMatch.searchString(TEXT):
    if match.patientData:
        if accumPatientData is not None:
             #this is a new patient data, print out the accumulated
             #Gleason scores for the previous one
             writeOut(accumPatientData)
        accumPatientData = (match.patientData, [])
    elif match.gleason:
        accumPatientData[1].append(match.gleason)
if accumPatientData is not None:
    writeOut(accumPatientData)

したがって、出力は次のようになります

01/01/01,S01-12345,20/111-22-1001,9
02/02/02,S02-1234,20/111-22-1002,6
03/02/03,S03-1234,31/111-22-1003,7,4+3
04/17/04,S04-123,30/111-22-1004,
05/28/05,S05-1234,20/111-22-1005,3+4
06/18/06,S06-10686,20/111-22-1006,,
07/22/07,S07-2749,20/111-22-1007,3+3

そこに戻って、失われた要素のいくつかを取得し、それらを再配置し、欠落している要素を見つけて、それらをすべて元に戻したいと思います。次の擬似コードのようなもの:

def diceGleason(glrhs,gllhs)
    if glrhs.len() == 0:
        pri = gllhs[0]
        sec = gllhs[2]
        tot = pri + sec
        return [pri, sec, tot]
    elif glrhs.len() == 1:
        pri = gllhs[0]
        sec = gllhs[2]
        tot = glrhs
        return [pri, sec, tot]
    else:
        pri = glrhs[0]
        sec = glrhs[2]
        tot = gllhs
        return [pri, sec, tot]

更新 2: わかりました、ポールは素晴らしいですが、私は頭が悪いです。彼が言ったことを正確に試して、pri、sec、および tot を取得する方法をいくつか試しましたが、失敗しています。次のようなエラーが発生し続けます。

Traceback (most recent call last):
  File "Stage1.py", line 81, in <module>
    writeOut(accumPatientData)
  File "Stage1.py", line 47, in writeOut
    FOUT.write( "{0.accDate},{0.accNum},{0.patientNum},{1.pri},{1.sec},{1.tot}\n".format( pd, gleaso
nList))
AttributeError: 'list' object has no attribute 'pri'

これらの AttributeErrors は、私が取得し続けるものです。明らかに、その間に何が起こっているのか理解できません (ポール、私は本を持っています。断言しますが、目の前に開いているのですが、理解できません)。これが私のスクリプトです。何か間違った場所にありますか？私は結果を間違って呼んでいますか？

score 2 · Accepted Answer

パーサーには 1 つの変更も加えていませんが、解析後のコードにいくつかの変更を加えました。

問題は、グリーソンスコアを表示するたびに現在の患者データを印刷し、患者データレコードの一部に複数のグリーソンスコアエントリが含まれていることです。あなたが何をしようとしているのか理解できれば、私が従う疑似コードは次のとおりです。

accumulator = None
foreach match in (patientDataExpr | gleasonScoreExpr).searchString(source):

    if it's a patientDataExpr:
        if accumulator is not None:
            # we are starting a new patient data record, print out the previous one
            print out accumulated data
        initialize new accumulator with current match and empty list for gleason data

    else if it's a gleasonScoreExpr:
        add this expression into the current accumulator

# done with the for loop, do one last printout of the accumulated data
if accumulator is not None:
    print out accumulated data

これは非常に簡単に Python に変換されます。

def printOut(patientDataTuple):
    pd,gleasonList = patientDataTuple
    print( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]".format(
        pd, ','.join(''.join(gl.rhs) for gl in gleasonList)))

accumPatientData = None
for match in partMatch.searchString(TEXT):
    if match.patientData:
        if accumPatientData is not None:
            # this is a new patient data, print out the accumulated 
            # Gleason scores for the previous one
            printOut(accumPatientData)

        # start accumulating for a new patient data entry
        accumPatientData = (match.patientData, [])

    elif match.gleason:
        accumPatientData[1].append(match.gleason)
    #~ print match.dump()

if accumPatientData is not None:
    printOut(accumPatientData)

グリーソンデータを正しくダンプしているとは思いませんが、ここから調整できると思います。

編集：

diceGleason解析アクションとしてアタッチして、gleasonこの動作を取得できます。

def diceGleasonParseAction(tokens):
    def diceGleason(glrhs,gllhs):
        if len(glrhs) == 0:
            pri = gllhs[0]
            sec = gllhs[2]
            #~ tot = pri + sec
            tot = str(int(pri)+int(sec))
            return [pri, sec, tot]
        elif len(glrhs) == 1:
            pri = gllhs[0]
            sec = gllhs[2]
            tot = glrhs
            return [pri, sec, tot]
        else:
            pri = glrhs[0]
            sec = glrhs[2]
            tot = gllhs
            return [pri, sec, tot]

    pri,sec,tot = diceGleason(tokens.gleason.rhs, tokens.gleason.lhs)

    # assign results names for later use
    tokens.gleason['pri'] = pri
    tokens.gleason['sec'] = sec
    tokens.gleason['tot'] = tot

gleason.setParseAction(diceGleasonParseAction)

pri合計しsecてを取得するところに 1 つのタイプミスがありましたtotが、これらはすべて文字列であるため、「3」と「4」を追加して「34」を取得していました。追加を行うために int に変換するだけで済みました。それ以外の場合は、解析されたトークンを新しい結果名で装飾するメカニズムから、、、およびを推論するためのロジックを分離するために、のdiceGleason内部にそのまま残しました。解析アクションは新しいものを返さないため、トークンはその場で更新され、その後出力メソッドで使用されるように引き継がれます。diceGleasonParseActionprisectot

python - pyparsingで重複を省略する方法は?

1 に答える 1

Related

Reference