perl - ダブルゼロバイトセパレータ入力ファイル内のパス名に一致

Question

昨年書いた重複ファイルを一覧表示するスクリプトを改善しています (リンクをたどる場合は、2 番目のスクリプトを参照してください)。

出力のレコード区切り文字はduplicated.log、キャリッジリターンではなくゼロバイト\nです。例：

$> tr '\0' '\n' < duplicated.log
         12      dir1/index.htm
         12      dir2/index.htm
         12      dir3/index.htm
         12      dir4/index.htm
         12      dir5/index.htm

         32      dir6/video.m4v
         32      dir7/video.m4v

(この例では、5 つのファイルdir1/index.htm, ...dir5/index.htmが同じmd5sumで、サイズは 12 バイトです。他の 2 つのファイルdir6/video.m4vとファイルdir7/video.m4vは同じmd5sumで、コンテンツのサイズ ( du) は 32 バイトです。)

各行は\0キャリッジリターン記号 ( ) ではなくゼロバイト ( )で終了するため\n、空白行は 2 つの連続するゼロバイト ( \0\0) として表されます。

path-file-name には改行記号が含まれている可能性があるため、行区切りとしてゼロバイトを使用します。

しかし、私はこの問題に直面しています:指定されたファイルのすべての複製をから「grep」
する方法は? duplicated.log
(例: の重複を取得する方法はdir1/index.htm?)

私は欲しい：

$> ./youranswer.sh  "dir1/index.htm"  < duplicated.log | tr '\0' '\n'
         12      dir1/index.htm 
         12      dir2/index.htm 
         12      dir3/index.htm 
         12      dir4/index.htm 
         12      dir5/index.htm 
$> ./youranswer.sh  "dir4/index.htm"  < duplicated.log | tr '\0' '\n'
         12      dir1/index.htm 
         12      dir2/index.htm 
         12      dir3/index.htm 
         12      dir4/index.htm 
         12      dir5/index.htm 
$> ./youranswer.sh  "dir7/video.m4v"  < duplicated.log | tr '\0' '\n'
         32      dir6/video.m4v 
         32      dir7/video.m4v

私は次のようなことを考えていました：

awk 'BEGIN { RS="\0\0" } #input record separator is double zero byte 
     /filepath/ { print $0 }' duplicated.log

...ただしfilepath、スラッシュ記号/やその他の多くの記号 (引用符、キャリッジリターンなど) を含めることができます。

perlこの状況に対処するために使用する必要があるかもしれません...

提案、質問、その他のアイデアをお待ちしています...

score 1 · Accepted Answer

もう少しです: マッチング演算子を使用してください~:

awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log

score 0 · Accepted Answer

md5sum新しいバージョンのスクリプトでは情報を保持しているため、パス名の代わりにを使用できることに気付きましたmd5sum。

これは私が現在使用している新しい形式です。

$> tr '\0' '\n' < duplicated.log
     12      89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm 
     12      89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm 
     12      89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm 
     12      89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm 
     12      89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm 

     32      fc191f86efabfca83a94d33aad2f87b4 dir6/video.m4v 
     32      fc191f86efabfca83a94d33aad2f87b4 dir7/video.m4v

gawknawk希望の結果が得られます:

$> awk 'BEGIN { RS="\0\0" } 
   /89e8a208e5f06c65e6448ddeb40ad879/ { print $0 }' duplicated.log | 
   tr '\0' '\n'
     12      89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm 
     12      89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm 
     12      89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm 
     12      89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm 
     12      89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm

しかし、私はまだあなたの答えについてオープンです:-)
(この現在の答えは単なる回避策です)

興味深いことに、作成中の新しい（恐ろしい）スクリプトの下に...

#!/bin/bash

fifo=$(mktemp -u) 
fif2=$(mktemp -u)
dups=$(mktemp -u)
dirs=$(mktemp -u)
menu=$(mktemp -u)
numb=$(mktemp -u)
list=$(mktemp -u)

mkfifo $fifo $fif2


# run processing in background
find . -type f -printf '%11s %P\0' |  #print size and filename
tee $fifo |                           #write in fifo for dialog progressbox
grep -vzZ '^          0 ' |           #ignore empty files
LC_ALL=C sort -z |                    #sort by size
uniq -Dzw11 |                         #keep files having same size
while IFS= read -r -d '' line
do                                    #for each file compute md5sum
  echo -en "${line:0:11}" "\t" $(md5sum "${line:12}") "\0"
                                      #file size + md5sim + file name + null terminated instead of '\n'
done |                                #keep the duplicates (same md5sum)
tee $fif2 |
uniq -zs12 -w46 --all-repeated=separate | 
tee $dups  |
#xargs -d '\n' du -sb 2<&- |          #retrieve size of each file
gawk '
function tgmkb(size) { 
  if(size<1024) return int(size)    ; size/=1024; 
  if(size<1024) return int(size) "K"; size/=1024;
  if(size<1024) return int(size) "M"; size/=1024;
  if(size<1024) return int(size) "G"; size/=1024;
                return int(size) "T"; }
function dirname (path)
      { if(sub(/\/[^\/]*$/, "", path)) return path; else return "."; }
BEGIN { RS=ORS="\0" }
!/^$/ { sz=substr($0,0,11); name=substr($0,48); dir=dirname(name); sizes[dir]+=sz; files[dir]++ }
END   { for(dir in sizes) print tgmkb(sizes[dir]) "\t(" files[dir] "\tfiles)\t" dir }' |
LC_ALL=C sort -zrshk1 > $dirs &
pid=$!


tr '\0' '\n' <$fifo |
dialog --title "Collecting files having same size..."    --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)


tr '\0' '\n' <$fif2 |
dialog --title "Computing MD5 sum" --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)


wait $pid
DUPLICATES=$( grep -zac -v '^$' $dups) #total number of files concerned
UNIQUES=$(    grep -zac    '^$' $dups) #number of files, if all redundant are removed
DIRECTORIES=$(grep -zac     .   $dirs) #number of directories concerned
lins=$(tput lines)
cols=$(tput cols)
cat > $menu <<EOF
--no-shadow 
--no-lines 
--hline "After selection of the directory, you will choose the redundant files you want to remove"
--menu  "There are $DUPLICATES duplicated files within $DIRECTORIES directories.\nThese duplicated files represent $UNIQUES unique files.\nChoose directory to proceed redundant file removal:"
$lins 
$cols
$DIRECTORIES
EOF
tr '\n"' "_'" < $dirs |
gawk 'BEGIN { RS="\0" } { print FNR " \"" $0 "\" " }' >> $menu

dialog --file $menu 2> $numb
[[ $? -eq 1 ]] && exit
set -x
dir=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f4- )
md5=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f2  )

grep -zao "$dir/[^/]*$" "$dups" | 
while IFS= read -r -d '' line
do
  file="${line:47}"
  awk 'BEGIN { RS="\0\0" } '"/$md5/"' { print $0 }' >> $list
done

echo -e "
fifo $fifo \t dups $dups \t menu $menu
fif2 $fif2 \t dirs $dirs \t numb $numb \t list $list"

#rm -f $fifo $fif2 $dups $dirs $menu $numb

perl - ダブルゼロバイトセパレータ入力ファイル内のパス名に一致

2 に答える 2

Related

Reference