java - Android と Java の外国語の文字

Question

外国語 (中国語) の文字を含む Web ページをダウンロードして解析しようとしていました。「utf-8」を使用する必要があるのか、それとも他のものを使用する必要があるのかわかりません。しかし、これらのどれも私にはうまくいかないようです。のウィキショナリーコードのサンプルを使用しましたgetUrlContent()。

public void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.main);
    mText = (TextView) findViewById(R.id.textview1);
    huaren.prepareUserAgent(this);
    String test = new String("fail");

    try {
        test = getUrlContent("http://huaren.us/");
    } catch (ApiException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    byte[] b = new byte[100000];

    try {
          b = test.getBytes("utf-8");
    } catch (UnsupportedEncodingException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    char[] charArr = (new String(b)).toCharArray();
    CharSequence seq = java.nio.CharBuffer.wrap(charArr); 

    mText.setText(charArr, 0, 1000);//.setText(seq);
}

protected static synchronized String getUrlContent(String url) throws ApiException {
    if (sUserAgent == null) {
        throw new ApiException("User-Agent string must be prepared");
    }

    // Create client and set our specific user-agent string
    HttpClient client = new DefaultHttpClient();
    HttpGet request = new HttpGet(url);
    request.setHeader("User-Agent", sUserAgent);

    try {
        HttpResponse response = client.execute(request);

        // Check if server response is valid
        StatusLine status = response.getStatusLine();
        if (status.getStatusCode() != HTTP_STATUS_OK) {
            throw new ApiException("Invalid response from server: " +
                    status.toString());
        }

        // Pull content stream from response
        HttpEntity entity = response.getEntity();
        InputStream inputStream = entity.getContent();

        ByteArrayOutputStream content = new ByteArrayOutputStream();

        // Read response into a buffered stream
        int readBytes = 0;
        while ((readBytes = inputStream.read(sBuffer)) != -1) {
            content.write(sBuffer, 0, readBytes);
        }

        // Return result from buffered stream
        return new String(content.toByteArray(), "utf-8");
    } catch (IOException e) {
        throw new ApiException("Problem communicating with API", e);
    }
}

score 2 · Accepted Answer

文字セットはページ自体で定義されています。

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

一般に、HTTP サーバーの HTML ページのエンコーディングを指定するには、次の 3 つの方法があります。

HTTP の Content-Type ヘッダー

Content-Type: text/html; charset=utf-8

XML 宣言での疑似属性のエンコード

<?xml version="1.0" encoding="utf-8" ?>

頭の中のメタタグ

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

詳細については、文字エンコーディングを参照してください

したがって、適切なエンコーディングを見つけるために、可能な宣言をそれぞれ評価するようにしてください。Content-Type 宣言のメタタグに遭遇した場合は、utf-8 でページを解析して再起動することができます。

score 1 · Accepted Answer

GuessEncodingライブラリを試してください。100% 防弾ではありませんが、多くの場合に役立ちます。

java - Android と Java の外国語の文字

2 に答える 2

Related

Reference