java - JavaでMHTML（.mht）ファイルを読み取ったり解析したりする方法

Question

次のような既知のドキュメントファイルのほとんどのコンテンツをマイニングする必要があります。

pdf
html
doc/docxなど

これらのファイル形式のほとんどについて、私が使用することを計画しています。

http://tika.apache.org/

ただし、現時点でTikaはMHTML（* .mht）ファイルをサポートしていません。（http://en.wikipedia.org/wiki/MHTML ）C＃（ http://www.codeproject.com/KB/）にはいくつかの例があります。 files / MhtBuilder.aspx）が、Javaでは見つかりませんでした。

* .mhtファイルを7Zipで開こうとしましたが、失敗しました... WinZipはファイルを画像とテキスト（CSS、HTML、スクリプト）にテキストおよびバイナリファイルとして解凍できましたが...

MSDNページ（http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content）およびcode project前述のページによると...mhtファイルはGZip圧縮を使用します... 。

Javaで解凍しようとすると、次の例外が発生します。java.uti.zip.GZIPInputStream

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)

そしてとjava.util.zip.ZipFile

 java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)

解凍する方法を提案してください。

ありがとう....

score 13 · Accepted Answer

率直に言って、私は近い将来に解決策を期待していなかったので、あきらめようとしていましたが、このページでつまずいたことがあります。

http://en.wikipedia.org/wiki/MIME#Multipart_messages

http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx

ただし、一見したところあまりキャッチーではありません。しかし、注意深く見ると手がかりが得られます。これを読んだ後、IEを起動し、ランダムにページを*.mhtファイルとして保存し始めました。行ごとに行かせてください...

しかし、私の最終的な目標はコンテンツを分離/抽出して解析することであったことを事前に説明しておきます...それは保存中に選択するかhtmlによって異なるため、ソリューション自体は完全ではありません。しかし、それはマイナーな問題で個々のファイルを抽出しますが...character setencoding

これがファイルを解析/解凍しようとしている人に役立つことを願ってい*.mht/MHTMLます:)

=======説明========**mhtファイルから取得**

From: "Saved by Windows Internet Explorer 7"

ファイルの保存に使用するソフトウェアです

Subject: Google
Date: Tue, 13 Jul 2010 21:23:03 +0530
MIME-Version: 1.0

件名、日付、mimeバージョン…メール形式によく似ています

  Content-Type: multipart/related;
type="text/html";

これは、それがmultipart文書であることを私たちに告げる部分です。マルチパートドキュメントには、1つ以上の異なるデータセットが1つの本文に結合されており、multipartコンテンツタイプフィールドがエンティティのヘッダーに表示される必要があります。ここでは、タイプをとして表示することもできます"text/html"。

boundary="----=_NextPart_000_0007_01CB22D1.93BBD1A0"

これらすべての中で、これが最も重要な部分です。これは、2つの異なる部分（html、images、css、scriptなど）を分割する一意の区切り文字です。これを手に入れると、すべてが簡単になります...今、私はドキュメントを繰り返し処理し、さまざまなセクションを見つけて、それらをContent-Transfer-Encoding（base64、quoted-printableなど）に従って保存する必要があります...。。。

サンプル

 ------=_NextPart_000_0007_01CB22D1.93BBD1A0
 Content-Type: text/html;
 charset="utf-8"
 Content-Transfer-Encoding: quoted-printable
 Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" =
.
.
.

**Javaコード**

定数を定義するためのインターフェース。

public interface IConstants 
{
    public String BOUNDARY = "boundary";
    public String CHAR_SET = "charset";
    public String CONTENT_TYPE = "Content-Type";
    public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding";
    public String CONTENT_LOCATION = "Content-Location";

    public String UTF8_BOM = "=EF=BB=BF";

    public String UTF16_BOM1 = "=FF=FE";
    public String UTF16_BOM2 = "=FE=FF";
}

メインのパーサークラス...

/**
 * This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0
 * which accompanies this distribution, and is available at
 * http://www.eclipse.org/legal/epl-v10.html
 */
package com.test.mht.core;

import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import sun.misc.BASE64Decoder;

/**
 * File to parse and decompose *.mts file in its constituting parts.
 * @author Manish Shukla 
 */

public class MHTParser implements IConstants
{
    private File mhtFile;
    private File outputFolder;

    public MHTParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;
    }

    /**
     * @throws Exception
     */
    public void decompress() throws Exception
    {
        BufferedReader reader = null;

        String type = "";
        String encoding = "";
        String location = "";
        String filename = "";
        String charset = "utf-8";
        StringBuilder buffer = null;

        try
        {
            reader = new BufferedReader(new FileReader(mhtFile));

            final String boundary = getBoundary(reader);
            if(boundary == null)
                throw new Exception("Failed to find document 'boundary'... Aborting");

            String line = null;
            int i = 1;
            while((line = reader.readLine()) != null)
            {
                String temp = line.trim();
                if(temp.contains(boundary)) 
                {
                    if(buffer != null) {
                        writeBufferContentToFile(buffer,encoding,filename,charset);
                        buffer = null;
                    }

                    buffer = new StringBuilder();
                }else if(temp.startsWith(CONTENT_TYPE)) {
                    type = getType(temp);
                }else if(temp.startsWith(CHAR_SET)) {
                    charset = getCharSet(temp);
                }else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) {
                    encoding = getEncoding(temp);
                }else if(temp.startsWith(CONTENT_LOCATION)) {
                    location = temp.substring(temp.indexOf(":")+1).trim();
                    i++;
                    filename = getFileName(location,type);
                }else {
                    if(buffer != null) {
                        buffer.append(line + "\n");
                    }
                }
            }

        }finally 
        {
            if(null != reader)
                reader.close();
        }

    }

    private String getCharSet(String temp) 
    {
        String t = temp.split("=")[1].trim();
        return t.substring(1, t.length()-1);
    }

    /**
     * Save the file as per character set and encoding 
     */
    private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset) 
    throws Exception
    {

        if(!outputFolder.exists())
            outputFolder.mkdirs();

        byte[] content = null; 

        boolean text = true;

        if(encoding.equalsIgnoreCase("base64")){
            content = getBase64EncodedString(buffer);
            text = false;
        }else if(encoding.equalsIgnoreCase("quoted-printable")) {
            content = getQuotedPrintableString(buffer);         
        }
        else
            content = buffer.toString().getBytes();

        if(!text)
        {
            BufferedOutputStream bos = null;
            try
            {
                bos = new BufferedOutputStream(new FileOutputStream(filename));
                bos.write(content);
                bos.flush();
            }finally {
                bos.close();
            }
        }else 
        {
            BufferedWriter bw = null;
            try
            {
                bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset));
                bw.write(new String(content));
                bw.flush();
            }finally {
                bw.close();
            }
        }
    }

    /**
     * When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF'</br>
     * @see http://en.wikipedia.org/wiki/Byte_order_mark
     */
    private byte[] getQuotedPrintableString(StringBuilder buffer) 
    {
        //Set<String> uniqueHex = new HashSet<String>();
        //final Pattern p = Pattern.compile("(=\\p{XDigit}{2})*");

        String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=\n", "");

        //Matcher m = p.matcher(temp);
        //while(m.find()) {
        //  uniqueHex.add(m.group());
        //}

        //System.out.println(uniqueHex);

        //for (String hex : uniqueHex) {
            //temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1)));
        //}     

        return temp.getBytes();
    }

    /*private String getASCIIValue(String hex) {
        return ""+(char)Integer.parseInt(hex, 16);
    }*/
    /**
     * Although system dependent..it works well
     */
    private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception {
        return new BASE64Decoder().decodeBuffer(buffer.toString());
    }

    /**
     * Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL.
     * Otherwise it returns 'unknown.<type>'
     */
    private String getFileName(String location, String type) 
    {
        final Pattern p = Pattern.compile("(\\w|_|-)+\\.\\w+");
        String ext = "";
        String name = "";
        if(type.toLowerCase().endsWith("jpeg"))
            ext = "jpg";
        else
            ext = type.split("/")[1];

        if(location.endsWith("/")) {
            name = "main";
        }else
        {
            name = location.substring(location.lastIndexOf("/") + 1);

            Matcher m = p.matcher(name);
            String fname = "";
            while(m.find()) {
                fname = m.group();
            }

            if(fname.trim().length() == 0)
                name = "unknown";
            else
                return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length()));
        }
        return getUniqueName(name,ext);
    }

    /**
     * Returns a qualified unique output file path for the parsed path.</br>
     * In case the file already exist it appends a numarical value a continues
     */
    private String getUniqueName(String name,String ext)
    {
        int i = 1;
        File file = new File(outputFolder,name + "." + ext);
        if(file.exists())
        {
            while(true)
            {
                file = new File(outputFolder, name + i + "." + ext);
                if(!file.exists())
                    return file.getAbsolutePath();
                i++;
            }
        }

        return file.getAbsolutePath();
    }

    private String getType(String line) {
        return splitUsingColonSpace(line);
    }

    private String getEncoding(String line){
        return splitUsingColonSpace(line);
    }

    private String splitUsingColonSpace(String line) {
        return line.split(":\\s*")[1].replaceAll(";", "");
    }

    /**
     * Gives you the boundary string
     */
    private String getBoundary(BufferedReader reader) throws Exception 
    {
        String line = null;

        while((line = reader.readLine()) != null)
        {
            line = line.trim();
            if(line.startsWith(BOUNDARY)) {
                return line.substring(line.indexOf("\"") + 1, line.lastIndexOf("\""));
            }
        }

        return null;
    }
}

よろしく、

score 2 · Accepted Answer

自分でやる必要はありません。

依存関係あり

<dependency>
    <groupId>org.apache.james</groupId>
    <artifactId>apache-mime4j</artifactId>
    <version>0.7.2</version>
</dependency>

mhtファイルをロールします

public static void main(String[] args)
{
    MessageTree.main(new String[]{"YOU MHT FILE PATH"});
}

MessageTree意思

/**
 * Displays a parsed Message in a window. The window will be divided into
 * two panels. The left panel displays the Message tree. Clicking on a
 * node in the tree shows information on that node in the right panel.
 *
 * Some of this code have been copied from the Java tutorial's JTree section.
 */

次に、それを調べることができます。

;-)

score 2 · Accepted Answer

JavaMailAPIを使用したよりコンパクトなコード

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Properties;

import javax.mail.BodyPart;
import javax.mail.Session;
import javax.mail.internet.MimeMessage;
import javax.mail.internet.MimeMultipart;

import org.apache.commons.io.IOUtils;

public class MhtParser {

    private File mhtFile;
    private File outputFolder;

    public MhtParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;
    }

    public void decompress() throws Exception {
        MimeMessage message = 
            new MimeMessage(
                    Session.getDefaultInstance(new Properties(), null),
                    new FileInputStream(mhtFile));

        if (message.getContent() instanceof MimeMultipart) {
            outputFolder.mkdir();
            MimeMultipart mimeMultipart = (MimeMultipart) message.getContent();

            for (int i = 0; i < mimeMultipart.getCount(); i++) {
                BodyPart bodyPart = mimeMultipart.getBodyPart(i);
                String fileName = bodyPart.getFileName();

                if (fileName == null) {
                    String[] locationHeader = bodyPart.getHeader("Content-Location");
                    if (locationHeader != null && locationHeader.length > 0) {
                        fileName = 
                            new File(new URL(locationHeader[0]).getFile()).getName();
                    }
                }

                if (fileName != null) {
                    FileOutputStream out = 
                        new FileOutputStream(new File(outputFolder, fileName));

                    IOUtils.copy(bodyPart.getInputStream(), out);
                    out.flush();
                    out.close();
                }
            }
        }
    }
}

score 0 · Accepted Answer

Uはhttp://www.chilkatsoft.com/mht-features.aspを試すことができ、パック/アンパックでき、後で通常のファイルとして処理できます。ダウンロードリンクは次のとおりです。http：//www.chilkatsoft.com/java.asp

score 0 · Accepted Answer

私はhttp://jtidy.sourceforge.netを使用してmhtファイルを解析/読み取り/インデックス付けしました（ただし、圧縮ファイルではなく通常のファイルとして）

score 0 · Accepted Answer

パーティーに遅れましたが、これに遭遇した他の人のために@wenerの答えを拡大しています。

Apache Mime4Jライブラリは、 EMLまたはMHTML処理のための最も簡単にアクセスできるソリューションを備えているようで、自分で作成するよりもはるかに簡単です。

以下の私のプロトタイプ' parseMhtToFile'関数は、Cognosアクティブレポート' mht'ファイルからhtmlファイルやその他のアーティファクトをリッピングしますが、他の目的に合わせて調整することもできます。

これはGroovyで記述されており、Apache Mime4Jの「コア」および「dom」jar（現在は0.7.2）が必要です。

import org.apache.james.mime4j.dom.Message
import org.apache.james.mime4j.dom.Multipart
import org.apache.james.mime4j.dom.field.ContentTypeField
import org.apache.james.mime4j.message.DefaultMessageBuilder
import org.apache.james.mime4j.stream.MimeConfig

/**
 * Use Mime4J MessageBuilder to parse an mhtml file (assumes multipart) into
 * separate html files.
 * Files will be written to outDir (or parent) as baseName + partIdx + ext.
 */
void parseMhtToFile(File mhtFile, File outDir = null) {
    if (!outDir) {outDir = mhtFile.parentFile }
    // File baseName will be used in generating new filenames
    def mhtBaseName = mhtFile.name.replaceFirst(~/\.[^\.]+$/, '')

    // -- Set up Mime parser, using Default Message Builder
    MimeConfig parserConfig  = new MimeConfig();
    parserConfig.setMaxHeaderLen(-1); // The default is a mere 10k
    parserConfig.setMaxLineLen(-1); // The default is only 1000 characters.
    parserConfig.setMaxHeaderCount(-1); // Disable the check for header count.
    DefaultMessageBuilder builder = new DefaultMessageBuilder();
    builder.setMimeEntityConfig(parserConfig);

    // -- Parse the MHT stream data into a Message object
    println "Parsing ${mhtFile}...";
    InputStream mhtStream = mhtFile.newInputStream()
    Message message = builder.parseMessage(mhtStream);

    // -- Process the resulting body parts, writing to file
    assert message.getBody() instanceof Multipart
    Multipart multipart = (Multipart) message.getBody();
    def parts = multipart.getBodyParts();
    parts.eachWithIndex { p, i ->
        ContentTypeField cType = p.header.getField('content-type')
        println "${p.class.simpleName}\t${i}\t${cType.mimeType}"

        // Assume mime sub-type is a "good enough" file-name extension 
        // e.g. text/html = html, image/png = png, application/json = json
        String partFileName = "${mhtBaseName}_${i}.${cType.subType}"
        File partFile = new File(outDir, partFileName)

        // Write part body stream to file
        println "Writing ${partFile}...";
        if (partFile.exists()) partFile.delete();
        InputStream partStream = p.body.inputStream;
        partFile.append(partStream);
    }
}

使用法は単純です：

File mhtFile = new File('<path>', 'Report-en-au.mht')
parseMhtToFile(mhtFile)
println 'Done.'

出力は次のとおりです。

Parsing <path>\Report-en-au.mht...
BodyPart    0   text/html
Writing <path>\Report-en-au_0.html...
BodyPart    1   image/png
Writing <path>\Report-en-au_1.png...
Done.

その他の改善点についての考え：

'text' mimeパーツの場合、OPが要求したテキストマイニングに適している可能性のあるのReader代わりにアクセスできます。Stream
生成されたファイル名拡張子については、mimeサブタイプが適切であるとは想定せずに、別のライブラリを使用して適切な拡張子を検索します。
シングルボディ（非マルチパート）および再帰マルチパートmhtmlファイルおよびその他の複雑さを処理します。これらには、カスタムコンテンツハンドラー実装を備えたMimeStreamParserが必要になる場合があります。

java - JavaでMHTML（.mht）ファイルを読み取ったり解析したりする方法

6 に答える 6

Related

Reference