python - Pythonで文字列からハッシュタグを取得するためのエレガントな方法は？

Question

指定された文字列内で始まる単語のセット（リスト、配列など）を取得するためのクリーンな方法を探してい#ます。

C＃では、私は書くでしょう

var hashtags = input
    .Split (' ')
    .Where (s => s[0] == '#')
    .Select (s => s.Substring (1))
    .Distinct ();

Pythonでこれを行うための比較的エレガントなコードは何ですか？

編集

サンプル入力："Hey guys! #stackoverflow really #rocks #rocks #announcement"
期待される出力：["stackoverflow", "rocks", "announcement"]

score 24 · Accepted Answer

@inspectorG4dget's answerを使用すると、重複が必要ない場合は、リスト内包表記の代わりに集合内包表記を使用できます。

>>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement"
>>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")}
set(['announcement', 'rocks', 'stackoverflow'])

{ }集合内包表記の構文は、Python 2.7 以降でのみ機能することに注意してください。
古いバージョンで作業している場合は、フィードリスト内包表記 ( [ ]) の出力が@Bertrand の提案どおりsetに機能します。

score 15 · Accepted Answer

[i[1:] for i in line.split() if i.startswith("#")]

このバージョンでは、空の文字列 (コメントでそのような懸念を読んだことがあります) と"#". また、Bertrand Marronのコードのように、これを次のようにセットに変換することをお勧めします (重複を避け、O(1) ルックアップ時間のために):

set([i[1:] for i in line.split() if i.startswith("#")])

score 10 · Accepted Answer

正規表現オブジェクトのfindallメソッドはそれらを一度に取得できます:

>>> import re
>>> s = "this #is a #string with several #hashtags"
>>> pat = re.compile(r"#(\w+)")
>>> pat.findall(s)
['is', 'string', 'hashtags']
>>>

score 8 · Accepted Answer

私は言うだろう

hashtags = [word[1:] for word in input.split() if word[0] == '#']

編集: これにより、重複のないセットが作成されます。

set(hashtags)

score 1 · Accepted Answer

別のオプションは regEx です:

import re

inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement"

re.findall(r'(?i)\#\w+', inputLine) # will includes #
re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #

score 1 · Accepted Answer

ここで提示された回答にはいくつかの問題があります。

{tag.strip("#") for tag in tags.split() if tag.startswith("#")}

[i[1:] for i in line.split() if i.startswith("#")]

「#one#two#」のようなハッシュタグがある場合は機能しません

2re.compile(r"#(\w+)")多くの Unicode 言語では機能しません (re.UNICODE を使用しても)

ハッシュタグを抽出する方法をもっと見たことがありますが、すべての場合に答えているわけではありません

そのため、ほとんどのケースを処理するためにいくつかの小さな Python コードを書きました。わたしにはできる。

def get_hashtagslist(string):
    ret = []
    s=''
    hashtag = False
    for char in string:
        if char=='#':
            hashtag = True
            if s:
                ret.append(s)
                s=''           
            continue

        # take only the prefix of the hastag in case contain one of this chars (like on:  '#happy,but i..' it will takes only 'happy'  )
        if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
            ret.append(s)
            s=''
            hashtag=False 

        if hashtag:
            s+=char

    if s:
        ret.append(s)

    return set(ret)

python - Pythonで文字列からハッシュタグを取得するためのエレガントな方法は？

6 に答える 6

Related

Reference