python - 複数のファイルでテキストブロックを検索し、それらのテキストブロックを別のファイルに書き込む方法

Question

入力ファイルの例を次に示します。

<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
    HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING 
    <br>
      <div id="text"><div id="text-interesting1">11/222-AA</div>
      <h2>This is the title</h2>

            <P>Here is some multiline desc-<br>
                 cription about what is <br><br>
                 going on here
      </div>

      <div id="text2"><div id="text-interesting2">IV-VI</div>
        <br>
        <h1> Some really interesting text</h1>
</body>
</html>

ここで、このファイルの複数のブロックをgrepしたいと思います。たとえば、その間<div id="text-interesting1">、</div>次に間<P>、</div>次に、の間<div id="text-interesting2">など</div>です。重要なのは、取得したい値が複数あるということです。

これらの値をファイルに書き込みたいのですが、たとえばカンマ区切りです。どうすればそれができますか？

ルークが提供した例から、私は次のように作成しました。

import os, re
path = 'C:/Temp/Folder1/allTexts'
listing = os.listdir(path)
for infile in listing:
    text = open(path + '/' + infile).read()
    match = re.search('<div id="text-interesting1">', text)
    if match is None:
        continue
    start = match.end()
    end = re.search('</div>', text).start()
    print (text[start:end])


    match = re.search('<h2>', text)
    if match is None:
        continue
    start = match.end()
    end = re.search('</h2>', text).start()
    print (text[start:end])


    match = re.search('<P>', text)
    if match is None:
        continue
    start = match.end()
    end = re.search('</div>', text).start()
    print (text[start:end])


    match = re.search('<div id="text-interesting2">', text)
    if match is None:
        continue
    start = match.end()
    end = re.search('</div>', text).start()
    print (text[start:end])


    match = re.search('<h1>', text)
    if match is None:
        continue
    start = match.end()
    end = re.search('</h1>', text).start()
    print (text[start:end])

    print ('--------------------------------------')

出力は次のとおりです。

11/222-AA
This is the title


 Some really interesting text
--------------------------------------
22/4444-AA
22222 This is the title2


22222222222222222222222
--------------------------------------
33/4444-AA
3333 This is the title3


333333333333333333333333
--------------------------------------

なぜ

一部が機能しませんか？

score 1 · Accepted Answer

スタートは次のとおりです。

import os, re
path = 'C:/Temp/Folder1/allTexts'
listing = os.listdir(path)
for infile in listing:
    text = open(path + '/' + infile).read()
    match = re.search('<div id="text-interesting1">', text)
    if match is None:
        continue
    start = match.start()
    end = re.search('<div id="text-interesting2">', text).start()
    print text[start:end]

score 0 · Accepted Answer

もう1つの戦略は、XMLを解析することです。厳密なXMLには一致するタグ、大文字と小文字の整合性などが必要なため、ファイルを整理する必要があります。例を次に示します。

from xml.etree import ElementTree
from cStringIO import StringIO
import sys
tree = ElementTree.ElementTree()
tree.parse(StringIO(sys.stdin.read()))
print "All tags:"
for e in tree.getiterator():
    print e.tag
    print e.text
print "Only div:"
for i in tree.find("{http://www.w3.org/1999/xhtml}body").findall("{http://www.w3.org/1999/xhtml}div"):
    print i.text

ファイルを少し変更して実行します。

<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING 
<br></br>
<div id="text"><div id="text-interesting1">11/222-AA</div>
  <h2>This is the title</h2>
  <p>Here is some multiline desc-<br></br>
      cription about what is <br></br><br></br>
    going on here</p>
</div>
  <div id="text-interesting2">IV-VI</div>
      <br></br>
    <h1> Some really interesting text</h1>
</body>
</html>

出力例、

> cat file.xml | ./tb.py 
All tags:
{http://www.w3.org/1999/xhtml}html


{http://www.w3.org/1999/xhtml}head


{http://www.w3.org/1999/xhtml}body

HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING 

{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}div
None
{http://www.w3.org/1999/xhtml}div
11/222-AA
{http://www.w3.org/1999/xhtml}h2
This is the title
{http://www.w3.org/1999/xhtml}p
Here is some multiline desc-
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}div
IV-VI
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}h1
Some really interesting text
Only div:
None
IV-VI

ただし、多くのHTMLは厳密なXMLとして解析するのが難しいため、これを実装するのは難しい場合があります。

python - 複数のファイルでテキストブロックを検索し、それらのテキストブロックを別のファイルに書き込む方法

2 に答える 2

Related

Reference