python - タグの最後の出現をキャプチャします

Question

私のテキストは次の形式です。

<Story>
 <Sentence id="1"> some text </Sentence>   
 <Sentence id="2"> some text </Sentence>   
 <Sentence id="3"> some text </Sentence>

</Story>私の仕事は、最後のの後に終了タグを挿入すること</Sentence>です。本文では、すべての</Sentence>後に3つのスペースが続きます。</Sentence>正規表現を使用して最後のキャプチャを試し、</Sentence>(?!.*<Sentence)re.DOTALLも使用しました。しかし、それは機能していません。

使用される実際のコードは
line = re.sub(re.compile('</Sentence>(?!.*<Sentence)',re.DOTALL),'</Sentence></Story>',line)

助けてください。ありがとう。

score 3 · Accepted Answer

ファイル全体を生成する同じコードですか?その場合、xmlライブラリを使用して生成すると、すべてのタグが正しくネストされます.そうでない場合は、有効なXMLになるように生成するコードを修正します.

正規表現と xml はうまく連携しません。

score 1 · Accepted Answer

この作業を行うには、BeautifulSoupのようなパーサーを使用する必要があります。BeautifulSoup は、非常に不正確な HTML/XML を解析して、正しく見えるようにしようとします。コードは次のようになります (間違ったタグの前後にいくつかのタグがあると仮定しています。Storyそうでない場合は、David のコメントのアドバイスに従うことになります)。

from BeautifulSoup import BeautifulStoneSoup

html = '''
<Document>
<PrevTag></PrevTag>
<Story>
 <Sentence id="1"> some text </Sentence>   
 <Sentence id="2"> some text </Sentence>   
 <Sentence id="3"> some text </Sentence>
<EndTag></EndTag>
</Document> 
'''
# Parse the document:
soup = BeautifulStoneSoup(html)

BeautifulSoup がどのように解析したかを見てください。

print soup.prettify()

#<document>
# <prevtag>
# </prevtag>
# <story>
#  <sentence id="1">
#   some text
#  </sentence>
#  <sentence id="2">
#   some text
#  </sentence>
#  <sentence id="3">
#   some text
#  </sentence>
#  <endtag>
#  </endtag>
# </story>
#</document>

BeautifulSoup は Story を囲むタグ (Document) を閉じる直前に閉じていることに注意してください。そのため、最後の文の隣に閉じタグを移動する必要があります。

# Find the last sentence:
last_sentence = soup.findAll('sentence')[-1]

# Find the Story tag:
story = soup.find('story')

# Move all tags after the last sentence outside the Story tag:
sib = last_sentence.nextSibling
while sib:
    story.parent.append(sib.extract())
    sib = last_sentence.nextSibling

print soup.prettify()

#<document>
# <prevtag>
# </prevtag>
# <story>
#  <sentence id="1">
#   some text
#  </sentence>
#  <sentence id="2">
#   some text
#  </sentence>
#  <sentence id="3">
#   some text
#  </sentence>
# </story>
# <endtag>
# </endtag>
#</document>

最終結果は、まさにあなたが望んでいたものになるはずです。このコードは、ドキュメント内にストーリーが 1 つしかないことを前提としていることに注意してください。そうでない場合は、少し変更する必要があります。幸運を！

score 0 · Accepted Answer

3つすべて（またはいくらでも）の要素を一致<Sentence>させて、グループ参照でそれらをプラグインし直してみませんか？

re.sub(r'(?:(\r?\n) *<Sentence.*?</Sentence> *)+',
       r'$0$1</Story>',
       line)

score 0 · Accepted Answer

最後に出現したタグを見つけることだけが必要な場合は、次のことができます。

reSentenceClose= re.compile('</Sentence> *')
match= None
for match in reSentenceClose.finditer(your_text):
    pass

if match: # it was found
    print match.end() # the index in your_text where the pattern was found

python - タグの最後の出現をキャプチャします

4 に答える 4

Related

Reference