php - CSV ファイルから重複行を削除するには?

Question

CSV ファイルから重複行を見つけて削除する簡単な方法はありますか?

サンプルの test.csv ファイル:

row1 test tyy......
row2 tesg ghh
row2 tesg ghh
row2 tesg ghh
....
row3 tesg ghh
row3 tesg ghh
...
row4 tesg ghh

予想された結果：

row1 test tyy......
row2 tesg ghh
....
row3 tesg ghh
...
row4 tesg ghh

PHP でこれを実現するには、どこから始めればよいでしょうか。

score 12 · Accepted Answer

直接的な方法は、ファイルを行ごとに読み取り、以前に見た各行を追跡することです。現在の行が既に表示されている場合は、スキップします。

次のような (テストされていない) コードが機能する可能性があります。

<?php
// array to hold all "seen" lines
$lines = array();

// open the csv file
if (($handle = fopen("test.csv", "r")) !== false) {
    // read each line into an array
    while (($data = fgetcsv($handle, 8192, ",")) !== false) {
        // build a "line" from the parsed data
        $line = join(",", $data);

        // if the line has been seen, skip it
        if (isset($lines[$line])) continue;

        // save the line
        $lines[$line] = true;
    }
    fclose($handle);
}

// build the new content-data
$contents = '';
foreach ($lines as $line => $bool) $contents .= $line . "\r\n";

// save it to a new file
file_put_contents("test_unique.csv", $contents);
?>

このコードfgetcsv()では、~~スペースのコンマを列区切り文字として使用しています (~~質問コメントのサンプルデータに基づいています)。

上記のように、表示されたすべての行を保存すると、ファイル内のすべての重複行が、互いに直接続いているかどうかに関係なく、確実に削除されます。それらが常に背中合わせになる場合、より単純な方法 (およびよりメモリを意識した方法) は、最後に見た行のみを保存し、現在の行と比較することです。

更新(フルラインではなく、SKU 列による重複行)
コメントで提供されたサンプルデータに基づくと、「重複行」は実際には等しくありません (似ていますが、かなりの数の列が異なります)。それらの間の類似性は、単一の列にリンクできますsku。

以下は、上記のコードの拡張バージョンです。このブロックは、CSV ファイルの最初の行 (列リスト) を解析して、どの列にskuコードが含まれているかを判断します。そこから、表示された SKU コードの一意のリストを保持し、現在の行に「新しい」コードがある場合は、次を使用してその行を新しい「一意の」ファイルに書き込みますfputcsv()。

<?php
// array to hold all unique lines
$lines = array();

// array to hold all unique SKU codes
$skus = array();

// index of the `sku` column
$skuIndex = -1;

// open the "save-file"
if (($saveHandle = fopen("test_unique.csv", "w")) !== false) {
    // open the csv file
    if (($readHandle = fopen("test.csv", "r")) !== false) {
        // read each line into an array
        while (($data = fgetcsv($readHandle, 8192, ",")) !== false) {
            if ($skuIndex == -1) {
                // we need to determine what column the "sku" is; this will identify
                // the "unique" rows
                foreach ($data as $index => $column) {
                    if ($column == 'sku') {
                        $skuIndex = $index;
                        break;
                    }
                }
                if ($skuIndex == -1) {
                    echo "Couldn't determine the SKU-column.";
                    die();
                }
                // write this line to the file
                fputcsv($saveHandle, $data);
            }

            // if the sku has been seen, skip it
            if (isset($skus[$data[$skuIndex]])) continue;
            $skus[$data[$skuIndex]] = true;

            // write this line to the file
            fputcsv($saveHandle, $data);
        }
        fclose($readHandle);
    }
    fclose($saveHandle);
}
?>

全体として、この方法はメモリ内のすべての行のコピーを保存する必要がないため (SKU コードのみ)、はるかにメモリに優しい方法です。

score 0 · Accepted Answer

0

1 行のソリューション:

file_put_contents('newdata.csv', array_unique(file('data.csv')));

于 2020-09-04T10:25:01.713 に答える

php - CSV ファイルから重複行を削除するには?

2 に答える 2

Related

Reference