go - Why is the md5 hash of the tar-part of a tar.gz via TeeReader wrong?

Question

I was just experimenting with archive/tar and compress/gzip, for automated processing of some backups I have.

My problem hereby is: I have various .tar files and .tar.gz files floating around, and thus I want to extract the hash (md5) of the .tar.gz file, and the hash (md5) of the .tar file as well, ideally in one run.

The example code I have so far, works perfectly fine for the hashes of the files in the .tar.gz as well for the .gz, but the hash for the .tar is wrong and I can't find out what the problem is.

I looked at the tar/reader.go file and I saw that there is some skipping in there, yet I thought everything should run over the io.Reader interface and thus the TeeReader should still catch all the bytes.

package main

import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "os"
)

func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tr := tar.NewReader(io.TeeReader(gz, tarMd5))
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

Now for the following example:

$ echo "a" > a.txt
$ echo "b" > b.txt
$ tar cf tb.tar a.txt b.txt 
$ gzip -c tb.tar > tb.tar.gz
$ md5sum a.txt b.txt tb.tar tb.tar.gz

60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
501352dcd8fbd0b8e3e887f7dafd9392  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

On Linux Mint 14 (based on Ubuntu 12.04) with go 1.02 from the Ubuntu repositories the result for my go program is:

$ go run tarmd5.go 
60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
a26ddab1c324780ccb5199ef4dc38691  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

So all hashes except for tb.tar are as expected. (Of course if you retry that example your .tar and .tar.gz will be different from this, because of different timestamps)

Any hint about how to get it work would be greatly appreciated, I really would prefer to have it in 1 run though (with the TeeReaders).

score 5 · Accepted Answer

この問題は、tarがリーダーからすべてのバイトを読み取るわけではないために発生します。各ファイルをハッシュした後、すべてのバイトが読み取られてハッシュされるように、リーダーを空にする必要があります。私が通常これを行う方法は、io.Copy()EOFまで読み取るために使用することです。

package main

import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "io/ioutil"
    "os"
)

func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tee := io.TeeReader(gz, tarMd5) // need the reader later
    tr := tar.NewReader(tee)
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    io.Copy(ioutil.Discard, tee) // read unused portions of the tar file
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

もう1つのオプションはio.Copy(tarMd5, gz)、tarMd5.Sum（）呼び出しの直前に追加することです。1行ではなく4行を追加/変更する必要がある場合でも、最初の方法の方が明確だと思います。

go - Why is the md5 hash of the tar-part of a tar.gz via TeeReader wrong?

1 に答える 1

Related

Reference