python - 一定量の文字で間にあるすべての単語を抽出する

Question

テキストのブロックを取り、与えられた量の文字からできるだけ多くの単語を抽出したいと思います。これを実現するためにどのツール/ライブラリを使用できますか？

たとえば、指定されたテキストブロックでは次のようになります。

Have you managed to get your hands on Nikon's elusive D4 full-frame DSLR? 
It should be smooth sailing from here, with the occasional firmware update being 
your only critical acquisition going forward. D4 firmware 1.02 brings a handful of 
minor fixes, but if you're in need of any of the enhancements listed below, it's 
surely a must have:

それを文字列に割り当ててから作成するとstring = string[0:100]、最初の100文字が取得されますが、「sailing」という単語は「sailin」に切り取られ、テキストは正しく切り取られます。「セーリング」の前のスペースの前または後。

score 3 · Accepted Answer

正規表現の使用：

>>> re.match(r'(.{,100})\W', text).group(1)
"Have you managed to get your hands on Nikon's elusive D4 full-frame DSLR? It should be smooth"

このアプローチでは、単語間の句読点（スペースだけでなく）を検索できます。100文字以下に一致します。

小さな文字列を処理するには、次の正規表現が適しています。

re.match(r'(.{,100})(\W|$)', text).group(1)

score 1 · Accepted Answer

本当にスペースの文字列を壊したい場合は、これを使用してください：

my_string = my_string[:100].rsplit(None, 1)[0]

ただし、実際には、単なるスペース以上のものを利用したい場合があることを覚えておいてください。

score 0 · Accepted Answer

これにより、最初の100文字の最後のスペースがあればそれが切り捨てられます。

lastSpace = string[:100].rfind(' ')
string = string[:lastSpace] if (lastSpace != -1) else string[:100]

python - 一定量の文字で間にあるすべての単語を抽出する

3 に答える 3

Related

Reference