python - Pythonでの文（または他の長い文字列）の解析（ProblemSetQuestion）続行するにはどうすればよいですか？

Question

了解しました。Pythonで長い文字列（または必要に応じて文）を解析するためにWebサイトを検索できませんでした。同じ性質の以前に回答された質問がある場合は、私をそれにリダイレクトしてください！とにかく、こんにちは！私は初心者プログラマー（インターネットを使用してPythonを自己学習）であり、（一見簡単に見える）問題の解決策を探しています。この問題について何か意見があれば、遠慮なく質問に答えてください。解決策やコーディング例を少し詳しく説明していただければ、本当に助かります。さらに、この問題を解決するための私の唯一のアイデアは、ASCII値を使用してすべてのパントを削除することです。ifステートメントは、リストに追加するときに残ったスペースを使用して残りのテキストを分割します。あなたの時間を節約し、私が何か新しいことを学ぶために私は dむしろ、これまでで最も長い表現ステートメントを見ないでください！また、これはリストを返す関数であるため、文字列や辞書などの別のデータ型に（逆に）変換する必要はありません。あなたが提供するどんな助けにも前もって感謝します！

これ以上苦労することなく、ここに質問があります：

文字列を解析する

文字列を入力として受け取り、文字列内のすべての単語のリストを返す関数を作成します。ダッシュをスペースに置き換えて、すべての句読点を削除する必要があります。

例（呼び出し）：

    >>> parse("Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.") 
   [Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, for, a, system, of, government, Supreme, executive, power, derives, from, a, mandate, from, the, masses, not, from, some, farcical, aquatic, ceremony] 
    >>> parse("What... is the air-speed velocity of an unladen swallow?") 
    [What, is, the, air, speed, velocity, of, an, unladen, swallow]

コードの長さで実行して申し訳ありません！とにかく、質問自体から何がなされるべきか、皆さんは理解していると思います。どんな提案やユニークで効果的な解決策も大歓迎です！-ウィンクルソン

追伸：連続文と「テキストの壁」をお詫びします。私は少しおしゃべりです...とにかく助けてくれてありがとう！

出力はリストではないことに注意してください！さらに多くの記号を回答に含めることはできません！それを忘れないでください！あなたの助けにもう一度感謝します！ご不便をおかけして申し訳ありませんが、質問の作成者が回答を間違えました。

score 3 · Accepted Answer

Natural Language Toolkit（nltk）を使用すると非常に簡単です。

import nltk, string
text = "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."

tokens = nltk.word_tokenize(text)

# remove punctuation
tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]

使用中で：

>>> text = "Listen, strange women lyin' in ponds distributin' swords is no basis
 for a system of government. Supreme executive power derives from a mandate from
 the masses, not from some farcical aquatic ceremony."
>>> tokens = nltk.word_tokenize(text)
>>> tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]
>>> tokens
['Listen', 'strange', 'women', 'lyin', 'in', 'ponds', 'distributin', 'swords', '
is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'execu
tive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', 'not
', 'from', 'some', 'farcical', 'aquatic', 'ceremony']

どうやら、希望する出力はかなり不明確ですが、その出力の文字列バージョンを探している場合は、そのtokens変数を取得して次のようにすることができます。

print '[' + ', '.join(tokens) + ']'

次のようになります：

>>> print '['+', '.join(tokens)+']'
[Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, fo
r, a, system, of, government., Supreme, executive, power, derives, from, a, mand
ate, from, the, masses, not, from, some, farcical, aquatic, ceremony]

あなたの「テキストの壁」はあなたが何を望んでいるかを理解するのを難しくします。

score 2 · Accepted Answer

In [133]: punc = set('.,<>!@#$%^&*()-_+=]}{[\\|')

In [134]: [''.join(char for char in word if char not in punc) for word in "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.".split()]
Out[134]: 
['Listen',
 'strange',
 'women',
 "lyin'",
 'in',
 'ponds',
 "distributin'",
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony']

score 1 · Accepted Answer

このような正規表現を使用することをお勧めします

import re

re.findall(r'[a-zA-Z]+',input_string)

または、複数の文字列を実行する場合は、最初に正規表現をコンパイルします

regexp=re.compile(r'[a-zA-Z]+')
regexp.findall(test)

基本的に、これは文字を含むすべての文字を文字ごとにグループ化して要求します。たとえば、短縮語を含めたい場合は、次のように式に'を追加するだけです。

re.findall(r'[a-zA-Z']+',input_string)

python - Pythonでの文（または他の長い文字列）の解析（ProblemSetQuestion）続行するにはどうすればよいですか？

3 に答える 3

Related

Reference