bash - 行ではなく文字列の差分

Question

睡眠中にこれを行うことができるはずだと思いますが、それぞれに apache モジュールの名前の単一の列が順不同である 2 つのテキストファイルがあるとします。1 つのファイルには、46 個の一意の (それ自体に) 文字列があります。もう 1 つには 67 行と 67 個の (ファイルへの) uniq 文字列があります。多くの共通の文字列があります。

短い最初のファイルではなく、2番目の長いファイルにあるApacheモジュールの名前を見つける必要があります。

文字列を検索して比較することでこれを行いたいです。行番号、順序、または位置はまったく関係ありません。長いファイルにのみリストされているどのモジュールをインストールする必要があるかを知りたいだけです。

デフォルトでは、uniq、comm、および diff は、行と行番号で動作します。並べて比較したくありません。リストが欲しいだけです。

score 2 · Accepted Answer

文字列を行に分割し、並べ替えて一意化commし、分析に使用します。( BashFAQ #36を参照)。

LoadModule例として、 2 つの Apache 構成ファイルのディレクティブを比較したいとします。

ファイル1:

...other stuff...
LoadModule foo modules/foo.so
LoadModule bar modules/bar.so
LoadModule baz modules/baz.so
...other stuff...

ファイル2:

...other stuff...
LoadModule foo modules/foo.so
...other stuff...

したがって、これを行うには：

comm -2 -3 \
  <(gawk '/LoadModule/ { print $2 }' file1 | sort -u)
  <(gawk '/LoadModule/ { print $2 }' file2 | sort -u)

...短いファイルの両方または両方で見つかった行を抑制し、3番目のファイルで見つかったモジュール名を提供して、次の出力を生成します。

bar
baz

より興味深いユースケースを念頭に置いてこの質問を見ている人々にとって、残念ながら、GNU sort の-zフラグは NUL 区切り文字を処理できますが (改行を含む文字列の比較を可能にするため)、commできません。ただし、comm次の例のように、NUL 区切り文字をサポートするシェルで独自の実装を作成できます。

#!/bin/bash
exec 3<"$1" 4<"$2"

IFS='' read -u 4 -d ''; input_two="$REPLY"

while IFS='' read -u 3 -d '' ; do
    input_one="$REPLY"
    while [[ $input_two < $input_one ]] ; do
        IFS='' read -u 4 -d '' || exit 0
        input_two="$REPLY"
    done
    if [[ $input_two = "$input_one" ]] ; then
        printf '%s\0' "$input_two"
    fi
done

score 1 · Accepted Answer

I would run a little bash script like this (differ.bash):

#!/bin/bash
f1=$1; # longer file
f2=$2; # shorter file

for item in `cat $f1`
do
    match=0
    for other in `cat $f2`
    do
        if [ "$item" == "$other" ]
        then
            match=1
            break
        fi
    done
    if [ $match != 1 ]
    then
        echo $item
    fi
done

exit 0

Run it like so:

$ ./differ.bash file1 file2

Basically, I am just setting up a double for loop with the longer file on the outer loop and the shorter file on the inner loop. That way each item in the longer list gets compared with the items in the shorter list. This allows us to find all the items that don't match something in the smaller list.

Edit: I have tried to address Charles' first comment with this updated script:

#!/bin/bash
f1=$1; # longer file
f2=$2; # shorter file

while read item
do
    others=( "${others[@]}" "$item" )
done < $f2

while read item
do
    match=0
    for other in $others
    do
        if [ "$item" == "$other" ]
        then
            match=1
            break
        fi
    done
    if [ $match != 1 ]
    then
        echo $item
    fi
done < $f1

exit 0

bash - 行ではなく文字列の差分

2 に答える 2

Related

Reference