1

「\n」でセグメント化されたファイルがあり、セグメントあたりの行数が不明です。ファイルのサンプルは次のようになります。

800004
The London and North-Western's Euston Station was first, but at the eastern end of Euston Road the Great Northern constructed their King's Cross terminal. 
Initially the Midland Railway ran into King's Cross but a quarrel over access led them to construct next door to King's Cross their St Pancras terminal, which was topped by a statue of Britannia, a <tag "510285">calculated</> snook-cocking exercise because Britannia was the company emblem of the Midland's hated rival, the London and North-Western. 

800005
GROWTH in Malaysia's gross domestic product this year is expected to be 8.5 per cent.
Nearly two percentage points higher than the Treasury's estimate, Bank Negara, the central bank, reported yesterday. 
Last year's growth, <tag "510270">calculated</> by the bank, was 8.7 per cent, compared with 7.6 per cent by the Treasury.   

800006
He was a Catholic. 
When he visited the Pope, even then, he couldn't help <tag "510270">calculating</> the Pope's worldly riches (life-proprietor of the Sistine Chapel, landlord of the Vatican and contents &ellip. ). 

テキストファイルからセグメントを取得する簡単な方法はありますか?

私はこのようにしてきました:

doc = []
segments = []
for line in open(trainfile):
    if line == "\n":
        doc.append(segments)
        segments = []
    else:
        segments.append(line.strip())

for i in doc:
    print i
4

3 に答える 3

6

ジェネレーター関数を使用します。

def per_section(it):
    section = []
    for line in it:
        if line.strip():
            section.append(line)
        else:
            yield ''.join(section)
            section = []
    # yield any remaining lines as a section too
    if section:
       yield ''.join(section)

これにより、空白行で区切られた各セクションが 1 つの文字列として生成されます。

with open(sectionedfile, 'r') as inputfile:
    for section in per_section(inputfile):
        print section
于 2013-06-05T13:22:52.933 に答える
0

ファイルが大きくない場合は、以下を使用str.splitして分割することもでき'\n\n'ます。

ファイルが巨大な場合は、@Martijn Pieters が提案する方法を使用してください

with open('abc') as f:
    data = f.read()
    segments = data.split('\n\n')
...     
for x in segments:
    print '--->',x

出力:

---> 800004
The London and North-Western's Euston Station was first, but at the eastern end of Euston Road the Great Northern constructed their King's Cross terminal. 
Initially the Midland Railway ran into King's Cross but a quarrel over access led them to construct next door to King's Cross their St Pancras terminal, which was topped by a statue of Britannia, a <tag "510285">calculated</> snook-cocking exercise because Britannia was the company emblem of the Midland's hated rival, the London and North-Western. 
---> 800005
GROWTH in Malaysia's gross domestic product this year is expected to be 8.5 per cent.
Nearly two percentage points higher than the Treasury's estimate, Bank Negara, the central bank, reported yesterday. 
Last year's growth, <tag "510270">calculated</> by the bank, was 8.7 per cent, compared with 7.6 per cent by the Treasury.   
---> 800006
He was a Catholic. 
When he visited the Pope, even then, he couldn't help <tag "510270">calculating</> the Pope's worldly riches (life-proprietor of the Sistine Chapel, landlord of the Vatican and contents &ellip. ). 
于 2013-06-05T13:27:29.020 に答える