emacs - エンコーディングが混在するファイルを修正するには?

Question

混合エンコーディング (utf-8 と latin-1 など) で破損したファイルがある場合、ファイルを保存するときにすべてのシンボルを単一のエンコーディング (utf-8 など) に「投影」するように Emacs を構成するにはどうすればよいですか?

クリーニングの一部を自動化するために次の関数を実行しましたが、この関数を改善するために、あるエンコーディングの記号「é」をutf-8の「é」にマップするための情報をどこかで見つけることができると思います（またはその誰かがすでにそのような関数を書いています)。

  (defun jyby/cleanToUTF ()
    "Cleaning to UTF"
    (interactive)
    (progn
         (save-excursion (replace-regexp "अ" ""))
         (save-excursion (replace-regexp "आ" ""))
         (save-excursion (replace-regexp "ॆ" ""))
       )
  )

  (global-unset-key [f11])
  (global-set-key [f11] 'jyby/cleanToUTF)

エンコーディングが混在しているために多くのファイルが「破損」しており (フォント構成が不適切なブラウザーからコピーして貼り付けたため)、以下のエラーが発生します。問題のある各シンボルを検索して "" または適切な文字に置き換えるか、エンコーディングとして "utf-8-unix" をより迅速に指定することで、手動でクリーンアップすることがあります (次に編集して保存するときに同じメッセージが表示されます)。ファイル）。このような破損したファイルでは、強調された文字が、保存するたびにサイズが 2倍になるシーケンスに置き換えられ、最終的にファイルのサイズが 2 倍になるため、問題になっています。GNU Emacs 24.2.1 を使用しています

These default coding systems were tried to encode text
in the buffer `test_accents.org':
(utf-8-unix (30 . 4194182) (33 . 4194182) (34 . 4194182) (37
. 4194182) (40 . 4194181) (41 . 4194182) (42 . 4194182) (45
. 4194182) (48 . 4194182) (49 . 4194182) (52 . 4194182))
However, each of them encountered characters it couldn't encode:
utf-8-unix cannot encode these:           ...

Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).

raw-text emacs-mule no-conversion

score 2 · Accepted Answer

私はemacsでこれに何度も苦労しました。raw-text-unix モードなどでめちゃくちゃになったファイルがあり、utf-8 として保存すると、emacs は既にクリーンな utf-8 であるテキストについても文句を言います。非 utf-8 についてのみ不平を言うようにする方法が見つかりませんでした。

recode を使用して合理的な半自動化されたアプローチを見つけました。

f=mixed-file
recode -f ..utf-8 $f > /tmp/recode.out
diff $f recode.out | cat -vt

# manually fix lines of text that can't be converted to utf-8 in $f,
# and re-run recode and diff until the output diff is empty.

途中で役立つツールの 1 つは、http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=342+200+224&mode=obytesです。

次に、emacs でファイルを再度開くだけで、きれいな Unicode として認識されます。

score 1 · Accepted Answer

ここにあなたが始めるための何かがあります：

(put 'eof-error 'error-conditions '(error eof-error))
(put 'eof-error 'error-message "End of stream")
(put 'bad-byte 'error-conditions '(error bad-byte))
(put 'bad-byte 'error-message "Not a UTF-8 byte")

(defclass stream ()
  ((bytes :initarg :bytes :accessor bytes-of)
   (position :initform 0 :accessor position-of)))

(defun logbitp (byte bit) (not (zerop (logand byte (ash 1 bit)))))

(defmethod read-byte ((this stream) &optional eof-error eof)
  (with-slots (bytes position) this
    (if (< position (length bytes))
        (prog1 (aref bytes position) (incf position))
      (if eof-error (signal eof-error (list position)) eof))))

(defmethod unread-byte ((this stream))
  (when (> (position-of this) 0) (decf (position-of this))))

(defun read-utf8-char (stream)
  (let ((byte (read-byte stream 'eof-error)))
    (if (not (logbitp byte 7)) byte
      (let ((numbytes
             (cond
              ((not (logbitp byte 5))
               (setf byte (logand #2r11111 byte)) 1)
              ((not (logbitp byte 4))
               (setf byte (logand #2r1111 byte)) 2)
              ((not (logbitp byte 3))
               (setf byte (logand #2r111 byte)) 3))))
        (dotimes (b numbytes byte)
          (let ((next-byte (read-byte stream 'eof-error)))
            (if (and (logbitp next-byte 7) (not (logbitp next-byte 6)))
                (setf byte (logior (ash byte 6) (logand next-byte #2r111111)))
              (signal 'bad-byte (list next-byte)))))
        (signal 'bad-byte (list byte))))))

(defun load-corrupt-file (file)
  (interactive "fFile to load: ")
  (with-temp-buffer
    (set-buffer-multibyte nil)
    (insert-file-literally file)
    (with-output-to-string
      (set-buffer-multibyte t)
      (loop with stream = (make-instance 'stream :bytes (buffer-string))
            for next-char =
            (condition-case err
                (read-utf8-char stream)
              (bad-byte (message "Fix this byte %d" (cdr err)))
              (eof-error nil))
            while next-char
            do (write-char next-char)))))

このコードの動作 - ファイルを変換せずにロードし、UTF-8 を使用してエンコードされているかのように読み取ろうとしますが、UTF-8 に属していないように見えるバイトに遭遇すると、エラーが発生し、何らかの方法で処理する必要があり"Fix this byte"ます。メッセージがある場所です)。しかし、それを修正する方法については創意工夫が必要です...

emacs - エンコーディングが混在するファイルを修正するには?

2 に答える 2

Related

Reference