apache-tika - Tika -- 複合文書から個別の項目を抽出する

Question

質問: 添付ファイル付きの電子メールメッセージを想定します (JPEG 添付ファイルを想定します)。電子メールメッセージを (Tika ファサードクラスを使用せずに) 解析し、個別の部分 (a) 電子メールテキストの内容と b) 電子メールの添付ファイルを返すにはどうすればよいですか?

構成: Tika 1.2 Java 1.7

詳細: 基本的な電子メールメッセージ形式の電子メールメッセージを適切に解析できました。ただし、解析後は、a) メールのテキストの内容と、b) メールの添付ファイルの内容を知る必要があります。これらのアイテムは、基本的に親の電子メールと子の添付ファイルとしてデータベースに保存します。

私が理解できないのは、個別の部分を「取り戻す」方法と、親メールに添付ファイルがあることを知り、メールに参照されている添付ファイルを個別に保存できる方法です。これは、基本的に ZipFile コンテンツの抽出と似ていると思います。

コード例:

 private Message processDocument(String fullfilepath) {
     try {
         File filename = new File(fullfilepath) ;
         return this.processDocument(filename) ;
     } catch (NullPointerException npe) {
        Message error = new Message(false) ;
         error.appendErrorMessage("The file name was null.") ;
         return error ;
     } 
 }

private Message processDocument(File filename) {
    InputStream stream = null;
    try {
       stream = new FileInputStream(filename) ;
    } catch (FileNotFoundException fnfe) {
        // TODO Auto-generated catch block
        fnfe.printStackTrace();
        System.out.println("FileNotFoundException") ;
        return diag ;
    }

int writelimit = -1 ; 
ContentHandler texthandler = new BodyContentHandler(writelimit); 
this.safehandlerbodytext = new SafeContentHandler(texthandler);
this.meta = new Metadata() ;
ParseContext context = new ParseContext() ;

AutoDetectParser autodetectparser = new AutoDetectParser() ;

try {

    autodetectparser.parse(
        stream,
        texthandler,
        meta,
        context) ;

    this.documenttype = meta.get("Content-Type") ;

    diag.setSuccessful(true);

} catch (IOException ioe) {
    // if the document stream could not be read
    System.out.println("TikaTextExtractorHelper IOException " + ioe.getMessage()) ;
    //FIXME -- add real handling

} catch (SAXException se) {
    // if the SAX events could not be processed
    System.out.println("TikaTextExtractorHelper SAXException " + se.getMessage()) ;
  //FIXME -- add real handling

} catch (TikaException te) {
    // if the document could not be parsed
    System.out.println("TikaTextExtractorHelper TikaException " + te.getMessage()) ;
    System.out.println("Exception Filename = " + filename.getName()) ;
  //FIXME -- add real handling

}

}

score 1 · Accepted Answer

Tika が埋め込みドキュメントにヒットすると、ParseContext に移動して、再帰パーサーが提供されているかどうかを確認します。ある場合は、それを使用して埋め込みリソースを処理します。そうでない場合は、スキップされます。

したがって、おそらくやりたいことは次のようなものです。

public static class HandleEmbeddedParser extends AbstractParser {
   public List<File> found = new ArrayList<File>();
   Set<MediaType> getSupportedTypes(ParseContext context) {
       // Return what you want to handle
       HashSet<MediaType> types = new HashSet<MediaType>();
       types.put(MediaType.application("pdf"));
       types.put(MediaType.application("zip"));
       return types;
   }
   void parse(
        InputStream stream, ContentHandler handler,
        Metadata metadata, ParseContext context
   ) throws IOException {
       // Do something with the child documents
       // eg save to disk
       File f = File.createTempFile("tika","tmp");
       found.add(f);

       FileOutputStream fout = new FileOutputStream(f);
       IOUtils.copy(stream,fout);
       fout.close();
   }
}

ParseContext context = new ParseContext();
context.set(Parser.class, new HandleEmbeddedParser();
parser.parse(....);

apache-tika - Tika -- 複合文書から個別の項目を抽出する

1 に答える 1

Related

Reference