2

I'm trying to open the latest Japanese Wikipedia database for reading in Python 3.3.1 on Linux, but am getting a Segmentation fault (core dumped) error with this short program:

with open("jawiki-latest-pages-articles.xml") as f:
    text = f.read()

The file itself is quite large:

-rw-r--r-- 1 fredrick users 7368183805 May 17 20:19 jawiki-latest-pages-articles.xml

So it seems like there is an upper limit to just how long a string I can store. What's the best way to tackle this situation?

My end goal is to count the most common characters in the file, sort of like a modern version of Jack Halpern's "Most Commonly Used Kanji in Newspapers". :)

4

2 に答える 2

0

これは、誰かが興味を持っていれば、私が最終的に使用したプログラムです。

from collections import Counter

counter = Counter()

progress = 0
with open("jawiki-latest-pages-articles.xml") as f:
    for line in f:
        progress += 1
        counter.update(line)
        if not progress%10000: print("Processing line {0}..., number {1}".format(line[:10], progress))

output = open("output.txt", "w+")

for k, v in counter.items():
    print("{0}\t{1}".format(k, v), file=output)

output.close()
于 2013-05-18T03:59:18.983 に答える