ruby - この悪魔的なExcel生成特殊文字をRubyのcsvから必死に削除しようとしています

Question

私のコンピューターは、この文字が何であるかわかりません。エクセルから来ました。

Excel では奇妙な空間でしたが、今では文字どおり複数の記号で表されています。私のコンピューターはそれが何であるかわかりません。

この文字は、Excel では Ê で表されます (csv では、xls はある種のスペースです)。OS X の TextEdit は、これをこの長い " " の大きなスペースとして扱います。Ruby の CSV パーサーは、通常の utf-8 を使用して解析しようとすると失敗します:encoding => "windows-1251:utf-8"。解析するために追加する必要があります。その場合、Ruby はそれを「K」に変換します。この K は私の CSV に 9、12、15、18 のグループ (KKKKKKKKK など) で表示され、 (K のグループ、/KKKKKKKKK/ なども削除できません)経由で削除することはできません! gsub(/K/)オープンソースツールCSVfixも使用しましたが、その「先頭と末尾のスペースを削除する」コマンドは Ks に影響しませんでした。

Remove non-ascii characters from csvsedで提案されているように使用しようとしましたが、次のようなエラーが発生しました

sed: 1: "output.csv": 無効なコマンドコード o

sed -i 's/[\d128-\d255]//' input.csvMacのようなものを実行するとき。

score 0 · Accepted Answer

**self-answers (different account, same person)

1st solution attempt:

evil_string_from_csv_cell = "KKKKKKKKK"
encoding_opts = {
  :invalid => :replace, :undef => :replace,
  :replace => '', :universal_newline => true }
evil_string_from_csv_cell.encode Encoding.find('ASCII'), encoding_opts
#=> ""

2nd solution attempt:

Don't use 'windows-1251:utf-8' for encoding, use 'iso-8859-1' instead, which will turn those (cyrillic) K's into '\xCA', which can then be removed with

string.gsub!(/\xCA/, '')

** I have not solved this problem yet.

3rd solution attempt:

trying to match array of K's as if they were actual K's is foolish. Copy and paste in the actual cyrillic K and see how that works-- here is the character, notice the little curl on the end

К

ruby treats it by making it a little bit bolder than normal K's

4th solution/strategy attempt (success):

use regular expressions to capture the characters, so long as you can encode the weird spaces (or whatever they are) into something, you can then ignore them using regular expressions
also try to take advantage of any spatial (matrix-like) patterns amongst the document types.

score 0 · Accepted Answer

私は仕事をすることができませんでしsedたが、最終的にVimでこれを行うことができました:

vim myhorriblefile.csv

# Once vim is open:
:s/Ê/ /g
:wq

# Done!

再利用のための一般化された関数として、これは次のようになります。

clean_weird_character () {
  vim "$1" -c ":%s/Ê/ /g" -c "wq"
}

score 0 · Accepted Answer

この問題の答えは、

A.) これは非常に難しい問題です。これまでのところ、キリル文字の Kを「物理的に」削除する方法を知っている人はいません。

しかし

B.) csv ファイルは、エスケープされていないコンマで区切られた単なる文字列であるため、正規表現を使用した文字列の照合は、エンコーディングがプログラムを壊さない限り、検索だけで機能します。

したがって、ファイルを読み取るには

f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
parsed = CSV.parse(f)

次に、正規表現リテラル文字列の一致を介して特定の行を見つけます (キリル文字の K を見落とします)。

parsed.each do |p|           #here, p[0] is the metatag column
  @specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
end

score 0 · Accepted Answer

次のようにcsvを解析して、「悪」の文字を削除します

.encode!("ISO-8859-1", :invalid => :replace)

ruby - この悪魔的なExcel生成特殊文字をRubyのcsvから必死に削除しようとしています

4 に答える 4

Related

Reference