eclipse - Nutch が UTF-8 文字を取得しない

Question

Nutchを使用してページをクロールし、インデックスを作成する前にコンテンツをFetcherクラスの別のファイルに保存するので、インデックスファイルからコンテンツを取得するために-readsegを使用しません。ただし、「ü」や「ç」などの特殊文字は「?」として保存されます。

Nutch Wiki ページで推奨されていることはすべて実行しました。を編集したタグのエンコーディング属性を UTF-8 に変更しても、まだ機能しません。システムファイルの言語変更に関するいくつかの推奨事項に遭遇しました。私はUbuntu 11.10で働いています。

score 1 · Accepted Answer

There are 3 possibilities that I can think of:

Nutch works fine and your code writes things correctly to files but your environment (terminal/editor) is not displaying the characters properly on output console.
Your code for writing out the content (crawled by nutch) is not taking care of UTF-8 encoding.
Nutch not handling UTF-8 encoding correctly.

I had crawled pages, which had Chinese characters in it, with Nucth and I was able to see some garbage characters in the readseg output (this was with nutch 1.0). Later after I installed some language plugins and tweaked the settings in the terminal, I could see the characters. So, I think that #3 is not likely and you must focus on #1 and #2.

score 0 · Accepted Answer

あなたに続いて、Fetcherクラスを変更し、コンテンツをmysqlデータベースに直接保存するためのサポートを追加することにしました。それははるかに良くそしてより速く働きます。

score 0 · Accepted Answer

私はエンコーディングの問題を解決したと思っています。以下のコードを参照してください。

co = true;
fe = true;
ge = true;
pa = true;
pd = true;
pt = true;
SegmentReader segmentReader2 = new SegmentReader(crwlNutchCommon.nutch_conf, co, fe, ge, pa, pd, pt);
HashMap<String, List<Writable>> hm = new HashMap<String, List<Writable>>();
segmentReader2.get(path, new Text("some_url"),

new OutputStreamWriter(new FileOutputStream("somefile1"), "UTF-8"),hm);

ファイル somefile1 のエンコーディングはおそらく間違っていますが、さらに進んでください。

FileOutputStream fos; 
DataOutputStream dos;
File file= new File("somefile2");
fos = new FileOutputStream(file);
dos=new DataOutputStream(fos);
hm.get("co").get(0).write(dos); // look

そして、それは働いています！somefile2 は、エンコーディングの変更なしで「生」になりますが、最初と最後にいくつかの余分なデータがあります-「Content.java」ソースファイルを分析してそれらを解析できると思います。

eclipse - Nutch が UTF-8 文字を取得しない

3 に答える 3

Related

Reference