ruby - Ruby：UTF-8文字列をバイト長で制限する

Question

このRabbitMQページには次のように記載されています。

キュー名は、最大255バイトのUTF-8文字です。

ruby（1.9.3）では、文字の途中で途切れることなく、UTF-8文字列をバイトカウントで切り捨てるにはどうすればよいですか？結果の文字列は、バイト制限に収まる最長の有効なUTF-8文字列である必要があります。

score 20 · Accepted Answer

Rails> = 3.0の場合、ActiveSupport :: Multibyte::Chars制限メソッドがあります。

APIドキュメントから：

- (Object) limit(limit)

文字列のバイトサイズを、文字を壊さずにバイト数に制限します。なんらかの理由で文字列の保存が制限されている場合に使用できます。

例：

'こんにちは'.mb_chars.limit(7).to_s # => "こん"

score 10 · Accepted Answer

bytesize文字列の長さをバイト単位で示しますが、（文字列のエンコーディングが適切に設定されている限り）スライスなどの操作によって文字列がマングルされることはありません。

簡単なプロセスは、文字列を反復処理することです。

s.each_char.each_with_object('') do|char, result| 
  if result.bytesize + char.bytesize > 255
    break result
  else
    result << char
  end
end

巧妙な場合は、最初の63文字を直接コピーします。これは、任意のUnicode文字がutf-8で最大4バイトであるためです。

これはまだ完全ではないことに注意してください。たとえば、文字列の最後の4バイトが文字「e」であり、アキュートアクセントを組み合わせているとします。最後の2バイトをスライスすると、utf8のままの文字列が生成されますが、ユーザーに表示される内容に関しては、出力が「é」から「e」に変更され、テキストの意味が変わる可能性があります。これは、RabbitMQキューに名前を付けるだけの場合はおそらく大したことではありませんが、他の状況では重要になる可能性があります。たとえば、フランス語のニュースレターの見出しには、「Unpoliciertué」は「警官が殺された」という意味ですが、「Unpoliciertue」は「警官が殺された」という意味です。

score 5 · Accepted Answer

うまくいくものを見つけたと思います。

def limit_bytesize(str, size)
  str.encoding.name == 'UTF-8' or raise ArgumentError, "str must have UTF-8 encoding"

  # Change to canonical unicode form (compose any decomposed characters).
  # Works only if you're using active_support
  str = str.mb_chars.compose.to_s if str.respond_to?(:mb_chars)

  # Start with a string of the correct byte size, but
  # with a possibly incomplete char at the end.
  new_str = str.byteslice(0, size)

  # We need to force_encoding from utf-8 to utf-8 so ruby will re-validate
  # (idea from halfelf).
  until new_str[-1].force_encoding('utf-8').valid_encoding?
    # remove the invalid char
    new_str = new_str.slice(0..-2)
  end
  new_str
end

使用法：

>> limit_bytesize("abc\u2014d", 4)
=> "abc"
>> limit_bytesize("abc\u2014d", 5)
=> "abc"
>> limit_bytesize("abc\u2014d", 6)
=> "abc—"
>> limit_bytesize("abc\u2014d", 7)
=> "abc—d"

アップデート...

active_supportなしの分解された動作：

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 7)
=> "abcéd"

active_supportによる分解された動作：

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abc"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcéd"

score 1 · Accepted Answer

これはどう：

s = "δogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδog"
count = 0
while true
  more_truncate = "a" + (255-count).to_s
  s2 = s.unpack(more_truncate)[0]
  s2.force_encoding 'utf-8'

  if s2[-1].valid_encoding?
    break
  else
    count += 1
  end
end

s2.force_encoding 'utf-8'
puts s2

score 1 · Accepted Answer

Rails 6は、のように動作するString＃truncate_bytestruncateを提供しますが、文字数ではなくバイト数を取ります。そしてもちろん、それは有効な文字列を返します（マルチバイト文字の途中で盲目的に切り取られることはありません）。

ドキュメントから取得：

>> "".size
=> 20
>> "".bytesize
=> 80
>> "".truncate_bytes(20)
=> "…"

score 0 · Accepted Answer

レールなし

Fredrick Cheungの答えは、このソリューションに影響を与えた優れたO(n)出発点です。O(log n)

def limit_bytesize(str, max_bytesize)
  return str unless str.bytesize > max_bytesize

  # find the minimum index that exceeds the bytesize, then subtract 1
  just_over = (0...str.size).bsearch { |l| str[0..l].bytesize > max_bytesize }
  str[0..(just_over - 1)]
end

途中からスタートするので、これもmax_bytesize / 4その答えにある自動スピードアップを実現していると思います。bsearch

score 0 · Accepted Answer

RubyのString＃bytesliceは範囲で使用できます。次のことを試すことをお勧めします。

string.bytslice(0...max_bytesize)

3つのドットにより、max_bytesize値を含めることができます。

ruby - Ruby：UTF-8文字列をバイト長で制限する

7 に答える 7

レールなし

Related

Reference