I'm trying to open the latest Japanese Wikipedia database for reading in Python 3.3.1 on Linux, but am getting a Segmentation fault (core dumped)
error with this short program:
with open("jawiki-latest-pages-articles.xml") as f:
text = f.read()
The file itself is quite large:
-rw-r--r-- 1 fredrick users 7368183805 May 17 20:19 jawiki-latest-pages-articles.xml
So it seems like there is an upper limit to just how long a string I can store. What's the best way to tackle this situation?
My end goal is to count the most common characters in the file, sort of like a modern version of Jack Halpern's "Most Commonly Used Kanji in Newspapers". :)