python - 英単語と漢字の文の長さを求める

Question

文には、中国語などの英語以外の文字が含まれる場合があります。

你好,hello world

長さの期待値は5(2 つの漢字、2 つの英単語、および 1 つのコンマ) です。

score 2 · Accepted Answer

ほとんどの中国語の文字が Unicode 範囲0x4e00 - 0x9fccにあることを使用できます。

# -*- coding: utf-8 -*-
import re

s = '你好 hello, world'
s = s.decode('utf-8')

# First find all 'normal' words and interpunction
# '[\x21-\x2f]' includes most interpunction, change it to ',' if you only need to match a comma
count = len(re.findall(r'\w+|[\x21-\x2]', s))

for word in s:
    for ch in word:
        # see https://stackoverflow.com/a/11415841/1248554 for additional ranges if needed
        if 0x4e00 < ord(ch) < 0x9fcc:
            count += 1

print count

score 0 · Accepted Answer

ここでロジックに問題があります。

你好
,

これらはすべて文字であり、言葉ではありません。漢字の場合、おそらく正規表現で何かをする必要があります

ここでの問題は、漢字が単語の一部または単語である可能性があることです。

大好

正規表現では、それは 1 つまたは 2 つの単語ですか? 各文字は単独でも単語ですが、一緒にすると 1 つの単語でもあります。

hello world

これをスペースで数えると、2 単語になりますが、中国語の正規表現も機能しない可能性があります。

これを「単語」で機能させる唯一の方法は、中国語と英語を別々に作成することだと思います。

python - 英単語と漢字の文の長さを求める

3 に答える 3

Related

Reference