string - 2 つの異なるエンコーディングを使用して Ruby にファイルをロードする

Question

2 つの異なるエンコーディングを持つ大きなファイルがあります。「メイン」ファイルは UTF-8 ですが、<80>(isoxxx の €) や<9F>(isoxxx の ß) などの一部の文字は ISO-8859-1 エンコーディングです。これを使用して、無効な文字を置き換えることができます。

 string.encode("iso8859-1", "utf-8", {:invalid => :replace, :replace => "-"}).encode("utf-8")

問題は、この間違ったエンコード文字が必要なため、「-」に置き換えても役に立たないことです。ruby で文書内の間違ったエンコード文字を修正するにはどうすればよいですか?

編集：私は:fallbackオプションを試しましたが、成功しませんでした（代替品はありません）：

 string.encode("iso8859-1", "utf-8",
     :fallback => {"\x80" => "123"}
 )

score 1 · Accepted Answer

次のコード（Ruby 1.8.7）を使用しました。各char>=128 ASCIIをテストして、それが有効なutf-8シーケンスの始まりであるかどうかを確認します。そうでない場合は、iso8859-1であると見なされ、utf-8に変換されます。

ファイルが大きいため、この手順は非常に遅くなる可能性があります。

class String
  # Grants each char in the final string is utf-8-compliant.
  # based on http://php.net/manual/en/function.utf8-encode.php#39986
  def utf8
    ret = ''

    # scan the string
    # I'd use self.each_byte do |b|, but I'll need to change i
    a = self.unpack('C*')
    i = 0
    l = a.length
    while i < l
      b = a[i]
      i += 1

      # if it's ascii, don't do anything.
      if b < 0x80
        ret += b.chr
        next
      end

      # check whether it's the beginning of a valid utf-8 sequence
      m = [0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe]
      n = 0
      n += 1 until n > m.length || (b & m[n]) == m[n-1]

      # if not, convert it to utf-8
      if n > m.length
        ret += [b].pack('U')
        next
      end

      # if yes, check if the rest of the sequence is utf8, too
      r = [b]
      u = false

      # n bytes matching 10bbbbbb follow?
      n.times do
        if i < l
          r << a[i]
          u = (a[i] & 0xc0) == 0x80
          i += 1
        else
          u = false
        end
        break unless u
      end

      # if not, converts it!
      ret += r.pack(u ? 'C*' : 'U*')
    end

    ret
  end

  def utf8!
    replace utf8
  end
end

# let s be the string containing your file.
s2 = s.utf8

# or
s.utf8!

score 1 · Accepted Answer

これは、Ruby 1.8 および 1.9 と互換性のある、以前のコードの非常に高速なバージョンです。

正規表現で無効な utf8 文字を特定でき、それらのみを変換しました。

class String

  # Regexp for invalid UTF8 chars.
  # $1 will be valid utf8 sequence;
  # $3 will be the invalid utf8 char.
  INVALID_UTF8 = Regexp.new(
    '(([\xc0-\xdf][\x80-\xbf]{1}|' +
    '[\xe0-\xef][\x80-\xbf]{2}|' +
    '[\xf0-\xf7][\x80-\xbf]{3}|' +
    '[\xf8-\xfb][\x80-\xbf]{4}|' +
    '[\xfc-\xfd][\x80-\xbf]{5})*)' +
    '([\x80-\xff]?)', nil, 'n')

  if RUBY_VERSION >= '1.9'
    # ensure each char is utf8, assuming that
    # bad characters are in the +encoding+ encoding
    def utf8_ignore!(encoding)

      # avoid bad characters errors and encoding incompatibilities
      force_encoding('ascii-8bit')

      # encode only invalid utf8 chars within string
      gsub!(INVALID_UTF8) do |s|
        $1 + $3.force_encoding(encoding).encode('utf-8').force_encoding('ascii-8bit')
      end

      # final string is in utf-8
      force_encoding('utf-8')
    end

  else
    require 'iconv'

    # ensure each char is utf8, assuming that
    # bad characters are in the +encoding+ encoding
    def utf8_ignore!(encoding)

      # encode only invalid utf8 chars within string
      gsub!(INVALID_UTF8) do |s|
        $1 + Iconv.conv('utf-8', encoding, $3)
      end

    end
  end

end

# "\xe3" = "ã" in iso-8859-1
# mix valid with invalid utf8 chars, which is in iso-8859-1
a = "ãb\xe3"

a.utf8_ignore!('iso-8859-1')

puts a   #=> ãbã

score 0 · Accepted Answer

このようなものを探していますか？

http://jalada.co.uk/2011/12/07/solving-latin1-and-utf8-errors-for-good-in-ruby.html

string - 2 つの異なるエンコーディングを使用して Ruby にファイルをロードする

3 に答える 3

Related

Reference