javascript - utf-8バイト位置で部分文字列を抽出します

Question

文字列と、部分文字列を抽出するための開始と長さがあります。両方の位置（開始位置と長さ）は、元のUTF8文字列のバイトオフセットに基づいています。

ただし、問題があります。

開始と長さはバイト単位であるため、「サブストリング」は使用できません。UTF8文字列には、複数のマルチバイト文字が含まれています。これを行うための超効率的な方法はありますか？（バイトをデコードする必要はありません...）

例：var orig ='你好吗？'

2番目の文字（好）を抽出する場合、s、eは3,3になる可能性があります。を探しています

var result = orig.substringBytes(3,3);

ヘルプ！

アップデート＃1 C / C ++では、バイト配列にキャストするだけですが、javascriptに同等のものがあるかどうかはわかりません。ところで、はい、それをバイト配列に解析して文字列に戻すことはできますが、適切な場所でそれを切り取る簡単な方法があるはずです。'orig'が1000000文字で、s=6バイトおよびl=3バイトであると想像してください。

アップデート＃2 zerkmsの有用なリダイレクトのおかげで、私は次のようになりました。これは正しく機能しません-マルチバイトでは正しく機能しますが、シングルバイトでは混乱します。

function substrBytes(str, start, length)
{
    var ch, startIx = 0, endIx = 0, re = '';
    for (var i = 0; 0 < str.length; i++)
    {
        startIx = endIx++;

        ch = str.charCodeAt(i);
        do {
            ch = ch >> 8;   // a better way may exist to measure ch len
            endIx++;
        }
        while (ch);

        if (endIx > start + length)
        {
            return re;
        }
        else if (startIx >= start)
        {
            re += str[i];
        }
    }
}

アップデート＃3文字コードのシフトは実際には機能しないと思います。正解が3のときに2バイトを読んでいます...どういうわけか私はいつもこれを忘れています。コードポイントはUTF8とUTF16で同じですが、エンコードに使用されるバイト数はエンコードによって異なります!!! したがって、これはこれを行う正しい方法ではありません。

score 11 · Accepted Answer

私はこれをいじって楽しい時間を過ごしました。お役に立てれば。

Javascript では文字列への直接のバイトアクセスが許可されていないため、開始位置を見つける唯一の方法は順方向スキャンです。

更新 #3 char コードのシフトが実際に機能するとは思わない。正解が 3 のときに 2 バイトを読み取っている... なぜかこれをいつも忘れてしまいます。コードポイントは UTF8 と UTF16 で同じですが、エンコーディングに使用されるバイト数はエンコーディングによって異なります!!! したがって、これは正しい方法ではありません。

これは正しくありません。実際には、JavaScript には UTF-8 文字列はありません。ECMAScript 262 仕様によると、入力エンコーディングに関係なく、すべての文字列は内部的に UTF-16 ("[sequence of] 16-bit unsigned integers") として保存する必要があります。

これを考慮すると、8 ビットシフトは正しい (ただし不要)。

文字が 3 バイトシーケンスとして格納されているという仮定は間違っています。
実際、 JS (ECMA-262) 文字列内のすべての文字は 16 ビット (2 バイト) の長さです。

これは、以下のコードに示すように、マルチバイト文字を手動で utf-8 に変換することで回避できます。

私のコード例で説明されている詳細を参照してください。

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你&quot; is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗？';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你&quot;
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"

score 8 · Accepted Answer

@Kaiiの答えはほぼ正しいですが、バグがあります。Unicode の 128 から 255 までの文字を処理できません。改訂版は次のとおりです (256 を 128 に変更するだけです)。

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你" is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >= 128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >= 128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗？©';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"
alert('res: ' + substr_utf8_bytes(orig, 15, 2)); // alerts: "©"

ちなみに、これはバグ修正であり、同じ問題を抱えている人に役立つはずです. 変更が「多すぎる」または「小さすぎる」という理由で、レビュアーが私の編集提案を却下したのはなぜですか? @Adam Eberlin @Kjuly @Jasonw

score 2 · Accepted Answer

function substrBytes(str, start, length)
{
    var buf = new Buffer(str);
    return buf.slice(start, start+length).toString();
}

AYB

score 0 · Accepted Answer

System.ArraySegment は便利ですが、配列入力とオフセットとインデクサーを使用してコンストラクターを作成する必要があります。

javascript - utf-8バイト位置で部分文字列を抽出します

6 に答える 6

Related

Reference