c# - RegEx を使用してマルチバイト文字列を検索する

Question

WebBrowser コントロールを使用して html ドキュメントに取り組んでいます。単語を検索してブラウザで強調表示するユーティリティを作成する必要があります。文字列が英語の場合はうまく機能しますが、韓国語などの他の言語の文字列の場合は機能しないようです。

以下のコードが機能するシナリオは-

ユーザーが Web ページで "Example" という単語を選択したとします。ここで、この単語とその出現箇所をすべて強調表示する必要があります。また、byteOffset を計算する必要があります (コードスニペットはそれのみを行います)。

現在、英語の場合、以下のコードは正常に機能しますが、韓国語などの言語の場合はまったく機能しません。

for-eachループに入らない

foreach (Match m in reg.Matches(this._documentContent))

ここで _documentContent には、Web ページのソースが文字列として含まれています。発生番号は番号です。選択した単語が文書内で出現する頻度

コードは次のとおりです。strTemp には韓国語の文字列が含まれています。

string strTemp = myRange.text;
string strExp =@">(([^<])*?)" + strTemp + "(([^<])*?)<";

int intCount =0;
Regex reg = new Regex(strExp);
Regex reg1 = new Regex(strTemp);
foreach (Match m in reg.Matches(this._documentContent))
{ 
    string strMatch = m.Value;
    foreach (Match m2 in reg.Matches(strMatch))
    { 
        intCount += 1;
        if (intCount==OccurenceNo)
        {
            int intCharOffset = m.Index + m2.Index;
            System.Text.UTF8Encoding d = new System.Text.UTF8Encoding(); 
            int intByteOffset = d.GetBytes( _documentContent.Substring(1, intCharOffset)).Length;
        }
    }
}

score 0 · Accepted Answer

コードが英語の単語では機能するが、韓国語では結果が返されない場合は、文化の問題であると考えられるため、RegexOptions を CultureInvariant に設定してみてください。

Regex reg = new Regex(strExp, RegexOptions.CultureInvariant);
Regex reg1 = new Regex(strTemp, RegexOptions.CultureInvariant);

score 0 · Accepted Answer

韓国語には次の正規表現コードを使用しています。

private static readonly Regex regexKorean = new Regex(@"[가-힣]");
public static bool IsKorean(this char s)
{
    return regexKorean.IsMatch(s.ToString());
}

if (someText.Any(z => z.IsKorean()))
{
    DoSomething();
}

c# - RegEx を使用してマルチバイト文字列を検索する

2 に答える 2

Related

Reference