war - 大規模な HTML データセットをアーカイブして取得するにはどうすればよいですか?

Question

私は初心者で、今週末にコンテストに参加する予定です。問題は、大きな HTML データセットのアーカイブと取得に関するもので、私にはわかりません。私の友人は、Web アーカイブと一般的なクロールを使用するよう提案してくれました。HTML データセットを Web アーカイブに変換する方法と、それらをインデックス化する方法を提案してください。前もって感謝します。

score 0 · Accepted Answer

The WARC format is a widely used standard, definitely a good decisions to archive web pages. Also the HTTP headers are contained in the WARC file. As a consequence, you need a crawler to create a WARC file. If the HTML pages are provided as a collection of files, you would need to crawl the file system (ev. via a local HTTP server) to get the content into a WARC file.

Everything else depends on the concrete task: there are many tools and libraries

to crawl and export the content as WARC: the simplest is wget --warc-file but there are many more
to read WARC files and process the content.

See The WARC Ecosystem for a collection of tools. If you just need a serious WARC file to start with, fetch one from Common Crawl, e.g., https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-30/segments/1469257824853.47/warc/CC-MAIN-20160723071024-00101-ip-10-185-27-174.ec2.internal.warc.gz

war - 大規模な HTML データセットをアーカイブして取得するにはどうすればよいですか?

1 に答える 1

Related

Reference