12

このトピックは、 link1 、 link2 、 link3 のテキストベースの絵文字で取り上げられます。ただし、単純な絵文字を一致させるのとは少し異なることをしたいと思います。顔文字のアイコンを含むツイートを整理しています。次の Unicode 情報には、まさにそのような顔文字が含まれています: pdf

pdfのこれらの絵文字も含む英単語の文字列を使用して、絵文字の数と単語の数を比較できるようにしたいと思います。

私が向かっていた方向は最善の選択肢ではないようで、助けを求めていました。以下のスクリプトでわかるように、コマンド ラインから作業を行うことを計画していました。

$cat <file containing the strings with emoticons> | ./emo.py

emo.py 疑似スクリプト:

import re
import sys

for row in sys.stdin:
    print row.decode('utf-8').encode("ascii","replace")
    #insert regex to find the emoticons
    if match:
       #do some counting using .split(" ")
       #print the counting

私が直面している問題は、デコード/エンコードです。アイコンを正しく見つけることができるように、文字列をエンコード/デコードする方法の適切なオプションが見つかりませんでした。単語と顔文字の数を見つけるために検索したい文字列の例は次のとおりです。

「スマイリーの絵文字が最高!ここに画像の説明を入力あなたが好きですここに画像の説明を入力。」

課題:この文字列に含まれる単語と顔文字の数をカウントするスクリプトを作成できますか? 絵文字が両方とも単語の隣にスペースなしで配置されていることに注意してください。

4

4 に答える 4

19

First, there is no need to encode here at all. You're got a Unicode string, and the re engine can handle Unicode, so just use it.

A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U escape sequences. So:

import re

s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))

Or, if the string is big enough that building up the whole findall list seems wasteful:

emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)

Counting words, you can do separately:

wordcount = len(s.split())

If you want to do it all at once, you can use an alternation group:

word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))

As @strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000 to U+FFFF. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.

Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:

(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
 lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
 low_lead < lead < high_lead and DC00 <= trail <= DFFF)

You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.

If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf] in UTF-16-BE:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])

Of course if your range is small enough that low_lead == high_lead this gets simpler. For example, the original question's range can be searched with:

\ud83d[\ude00-\ude50]

One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)
于 2013-10-03T01:52:32.247 に答える