python - 与えられた文字数以下の文章を返す関数

Question

次の段落があるとします。

"This is the first sentence. This is the second sentence? This is the third
 sentence!"

特定の文字数以下の文の数のみを返す関数を作成する必要があります。1 文に満たない場合は、最初の文のすべての文字が返されます。

例えば：

>>> reduce_paragraph(100)
"This is the first sentence. This is the second sentence? This is the third
 sentence!"

>>> reduce_paragraph(80)
"This is the first sentence. This is the second sentence?"

>>> reduce_paragraph(50)
"This is the first sentence."

>>> reduce_paragraph(5)
"This "

私はこのようなものから始めましたが、それを終了する方法を理解できないようです:

endsentence = ".?!"
sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
for number,(truth, sentence) in enumerate(sentences):
    if truth:
        first_sentence = previous+''.join(sentence).replace('\n',' ')
    previous = ''.join(sentence)

score 6 · Accepted Answer

英語の構文構造が原因で、文の処理は非常に困難です。誰かがすでに指摘しているように、略語などの問題は、最良の正規表現であっても終わりのない頭痛の種になります。

Natural Laungauge Toolkitを検討する必要があります。具体的にはpunktモジュール。これは文のトークナイザーであり、面倒な作業を代行してくれます。

score 2 · Accepted Answer

@BigHandsomeで言及され punktているモジュールを使用して、段落を切り捨てる方法は次のとおりです。

from nltk.tokenize.punkt import PunktSentenceTokenizer

def truncate_paragraph(text, maxnchars,
                       tokenize=PunktSentenceTokenizer().span_tokenize):
    """Truncate the text to at most maxnchars number of characters.

    The result contains only full sentences unless maxnchars is less
    than the first sentence length.
    """
    sentence_boundaries = tokenize(text)
    last = None
    for start_unused, end in sentence_boundaries:
        if end > maxnchars:
            break
        last = end
    return text[:last] if last is not None else text[:maxnchars]

例

text = ("This is the first sentence. This is the second sentence? "
        "This is the third\n sentence!")
for limit in [100, 80, 50, 5]:
    print(truncate_paragraph(text, limit))

出力

これが最初の文です。これは2番目の文ですか？これは3番目です
 文！
これが最初の文です。これは2番目の文ですか？
これが最初の文です。
これ

score 0 · Accepted Answer

この問題をより簡単な手順に分解できます。

与えられた段落を文に分割する
文字数制限内で何文まで繋げられるか計算してみよう
少なくとも 1 つの文が収まる場合は、それらの文を結合します。
最初の文が長すぎる場合は、最初の文を切り捨てます。

サンプルコード (未テスト):

    def reduce_paragraph(para, max_len):
        # Split into list of sentences
        # A sentence is a sequence of characters ending with ".", "?", or "!".
        sentences = re.split(r"(?<=[\.?!])", para)

        # Figure out how many sentences we can have and stay under max_len
        num_sentences = 0
        total_len = 0
        for s in sentences:
            total_len += len(s)
            if total_len > max_len:
                break
            num_sentences += 1

        if num_sentences > 0:
            # We can fit at least one sentence, so return whole sentences
            return ''.join(sentences[:num_sentences])
        else:
            # Return a truncated first sentence
            return sentences[0][:max_len]

score 0 · Accepted Answer

自然言語の問題 (つまり、「.?!」で区切られた完全なチャンクを返すアルゴリズムで、合計が k 未満の場合) を無視すると、次の基本的なアプローチが機能します。

def sentences_upto(paragraph, k):
    sentences = []
    current_sentence = ""
    stop_chars = ".?!"
    for i, c in enumerate(paragraph):
        current_sentence += c
        if(c in stop_chars):
            sentences.append(current_sentence)
            current_sentence = ""
        if(i == k):
            break
    return sentences
        return sentences

itertools ソリューションは次のように完成できます。

def sentences_upto_2(paragraph, size):
    stop_chars = ".?!"
    sentences = itertools.groupby(paragraph, lambda x: any(x.endswith(punct) for punct in stop_chars))  
    for k, s in sentences:
        ss = "".join(s)
        size -= len(ss)
        if not k:
            if size < 0:
                return
            yield ss

python - 与えられた文字数以下の文章を返す関数

4 に答える 4

例

出力

Related

Reference