php - マルチバイト文字列をn文字に切り捨てます

Question

文字列フィルターでこのメソッドを機能させようとしています：

public function truncate($string, $chars = 50, $terminator = ' …');

私はこれを期待します

$in  = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890";
$out = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …";

そしてこれも

$in  = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ";
$out = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …";

それは文字列$charsの文字を引いたもの$terminatorです。

さらに、フィルターは$chars制限を下回る最初の単語の境界でカットすることになっています。

$in  = "Answer to the Ultimate Question of Life, the Universe, and Everything.";
$out = "Answer to the Ultimate Question of Life, the …";

私はこれがこれらのステップでうまくいくはずだとかなり確信しています

最大文字数からターミネータの文字数を差し引く
文字列が計算された制限よりも長いことを検証するか、変更せずに返します
計算された制限を下回る文字列の最後のスペース文字を見つけて、単語の境界を取得します
最後のスペースで文字列をカットするか、最後のスペースが見つからない場合は計算された制限
文字列にターミネータを追加
文字列を返す

しかし、私は今、str*とmb_*関数のさまざまな組み合わせを試しましたが、すべて間違った結果になりました。これはそれほど難しいことではないので、私は明らかに何かが欠けています。誰かがこのための実用的な実装を共有するか、それとも私が最終的にそれを行う方法を理解できるリソースを私に教えてくれますか？

ありがとう

PSはい、前にhttps://stackoverflow.com/search?q=truncate+string+phpをチェックしました:)

score 5 · Accepted Answer

PHPにはすでにマルチバイトの切り捨てがあります

mb_strimwidth—指定された幅の切り捨てられた文字列を取得します

ただし、単語の境界には従いません。しかし、それでも便利です！

score 3 · Accepted Answer

これを試して：

function truncate($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

ただし、内部エンコーディングが適切に設定されていることを確認する必要があります。

score 0 · Accepted Answer

私は通常、このような質問に対する回答全体をコーディングするのは好きではありません。しかし、私は目が覚めたばかりで、あなたの質問が私をその日の残りのプログラムに行く良い気分にさせるかもしれないと思いました。

私はこれを実行しようとしませんでしたが、それは機能するか、少なくともそこまでの道のりの90％を取得するはずです。

function truncate( $string, $chars = 50, $terminate = ' ...' )
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

score 0 · Accepted Answer

tldr;

十分に短い文字列には省略記号を付けないでください。
改行文字もブレークポイントを修飾する必要があります。
正規表現は、一度分解して説明すると、それほど怖くはありません。

この質問と現在の一連の回答に関して、指摘すべき重要なことがいくつかあると思います。ゴードンのサンプルデータといくつかの追加のケースに基づいて、回答と正規表現の回答の比較をデモして、いくつかの異なる結果を公開します。

まず、入力値の品質を明確にします。Gordonは、関数はマルチバイトセーフであり、単語の境界を尊重する必要があると述べています。サンプルデータは、切り捨て位置を決定する際の非スペース、非単語文字（句読点など）の望ましい処理を公開していないため、空白文字をターゲットにするだけで十分であると想定する必要があります。文字列は、切り捨てるときに句読点を尊重することを心配する傾向がありません。

第二に、改行文字を含むテキストの大きな本文に省略記号を適用する必要がある、かなり一般的なケースがあります。

第三に、次のようなデータの基本的な標準化に任意に同意しましょう。

文字列は、すべての先頭/末尾の空白文字で既にトリミングされています
の値$charsは常にmb_strlen()の値よりも大きくなります$terminator

（デモ）

関数：

function truncateGumbo($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

function truncateGordon($string, $chars = 50, $terminator = ' …') {
    return mb_strimwidth($string, 0, $chars, $terminator);
}

function truncateSoapBox($string, $chars = 50, $terminate = ' …')
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

function truncateMickmackusa($string, $max = 50, $terminator = ' …') {
    $trunc = $max - mb_strlen($terminator, 'UTF-8');
    return preg_replace("~(?=.{{$max}})(?:\S{{$trunc}}|.{0,$trunc}(?=\s))\K.+~us", $terminator, $string);
}

テストケース：

$tests = [
    [
        'testCase' => "Answer to the Ultimate Question of Life, the Universe, and Everything.",
        // 50th char ---------------------------------------------------^
        'expected' => "Answer to the Ultimate Question of Life, the …&quot;,
    ],
    [
        'testCase' => "A single line of text to be followed by another\nline of text",
        // 50th char ----------------------------------------------------^
        'expected' => "A single line of text to be followed by another …&quot;,
    ],
    [
        'testCase' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ",
        // 50th char ---------------------------------------------------^
        'expected' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …&quot;,
    ],
    [
        'testCase' => "123456789 123456789 123456789 123456789 123456789",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "1234567890123456789012345678901234567890123456789",
    ],
    [
        'testCase' => "Hello worldly world",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "Hello worldly world",
    ],
    [
        'testCase' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890",
        // 50th char ---------------------------------------------------^
        'expected' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …&quot;,
    ],
];

実行：

foreach ($tests as ['testCase' => $testCase, 'expected' => $expected]) {
    echo "\tSample Input:\t\t$testCase\n";
    echo "\n\ttruncateGumbo:\t\t" , truncateGumbo($testCase);
    echo "\n\ttruncateGordon:\t\t" , truncateGordon($testCase);
    echo "\n\ttruncateSoapBox:\t" , truncateSoapBox($testCase);
    echo "\n\ttruncateMickmackusa:\t" , truncateMickmackusa($testCase);
    echo "\n\tExpected Result:\t{$expected}";
    echo "\n-----------------------------------------------------\n";
}

出力：

    Sample Input:           Answer to the Ultimate Question of Life, the Universe, and Everything.

    truncateGumbo:          Answer to the Ultimate Question of Life, the …
    truncateGordon:         Answer to the Ultimate Question of Life, the Uni …
    truncateSoapBox:        Answer to the Ultimate Question of Life, the …
    truncateMickmackusa:    Answer to the Ultimate Question of Life, the …
    Expected Result:        Answer to the Ultimate Question of Life, the …
-----------------------------------------------------
    Sample Input:           A single line of text to be followed by another
line of text

    truncateGumbo:          A single line of text to be followed by …
    truncateGordon:         A single line of text to be followed by another
 …
    truncateSoapBox:        A single line of text to be followed by …
    truncateMickmackusa:    A single line of text to be followed by another …
    Expected Result:        A single line of text to be followed by another …
-----------------------------------------------------
    Sample Input:           âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ

    truncateGumbo:          âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateGordon:         âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateSoapBox:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateMickmackusa:    âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    Expected Result:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
-----------------------------------------------------
    Sample Input:           123456789 123456789 123456789 123456789 123456789

    truncateGumbo:          123456789 123456789 123456789 123456789 12345678 …
    truncateGordon:         123456789 123456789 123456789 123456789 123456789
    truncateSoapBox:        123456789 123456789 123456789 123456789 …
    truncateMickmackusa:    123456789 123456789 123456789 123456789 123456789
    Expected Result:        123456789 123456789 123456789 123456789 123456789
-----------------------------------------------------
    Sample Input:           Hello worldly world

    truncateGumbo:          
Warning: mb_strpos(): Offset not contained in string in /in/ibFH5 on line 4
Hello worldly world …
    truncateGordon:         Hello worldly world
    truncateSoapBox:        Hello worldly …
    truncateMickmackusa:    Hello worldly world
    Expected Result:        Hello worldly world
-----------------------------------------------------
    Sample Input:           abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890

    truncateGumbo:          abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateGordon:         abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateSoapBox:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateMickmackusa:    abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    Expected Result:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
-----------------------------------------------------

私のパターンの説明：

見苦しいように見えますが、文字化けしたパターン構文のほとんどは、動的な数量詞として数値を挿入することです。

私はそれを次のように書くこともできたでしょう：

'~(?:\S{' . $trunc . '}|(?=.{' . $max . '}).{0,' . $trunc . '}(?=\s))\K.+~us'

$trunc簡単にするために、と48を$maxに置き換えます50。

~                 #opening pattern delimiter
(?=.{50})         #lookahead to ensure that the string has a minimum of 50 characters
(?:               #start of non-capturing group -- to maintain pattern logic only
  \S{48}          #the string starts with at least 48 non-white-space characters
  |               #or
  .{0,48}(?=\s)   #the string starts with upto 48 characters followed by a whitespace
)                 #end of non-capturing group
\K                #restart the fullstring match (aka "forget" the previously matched characters)
.+                #match the remaining characters (these characters will be replaced)
~                 #closing pattern delimiter
us                #pattern modifiers: unicode/multibyte flag & dot matches newlines flag

php - マルチバイト文字列をn文字に切り捨てます

4 に答える 4

Related

Reference