xpath - Tika または JAXP またはその両方

Question

私のジレンマをよりよく理解するために、バックグラウンドスレッドを参照してください;)

上記のスレッドで述べたように、私は Tika を使用して、ドキュメントを解析するための汎用インターフェイスを用意することにしました。そして内容を抽出します。これを行うために、適切な ContentHandler を使用して各ドキュメントを XML/HTML に変換することにしました。

以下は出力例です。

    File type is application/vnd.openxmlformats-officedocument.wordprocessingml.document
    Handler <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="cp:revision" content="2" />
    <meta name="meta:last-author" content="ogilvie.f" />
    <meta name="Last-Author" content="ogilvie.f" />
    <meta name="meta:save-date" content="2012-04-24T15:24:00Z" />
    <meta name="Application-Name" content="Microsoft Office Word" />
    <meta name="Author" content="ogilvie.f" />
    <meta name="dcterms:created" content="2012-04-24T15:24:00Z" />
    <meta name="Application-Version" content="12.0000" />
    <meta name="Character-Count-With-Spaces" content="21667" />
    <meta name="date" content="2012-04-24T15:24:00Z" />
    <meta name="extended-properties:Template" content="Normal" />
    <meta name="meta:line-count" content="153" />
    <meta name="creator" content="ogilvie.f" />
    <meta name="publisher" content="Procter &amp; Gamble" />
    <meta name="Word-Count" content="3240" />
    <meta name="meta:paragraph-count" content="43" />
    <meta name="Creation-Date" content="2012-04-24T15:24:00Z" />
    <meta name="extended-properties:AppVersion" content="12.0000" />
    <meta name="meta:author" content="ogilvie.f" />
    <meta name="Line-Count" content="153" />
    <meta name="extended-properties:Application" content="Microsoft Office Word" />
    <meta name="Paragraph-Count" content="43" />
    <meta name="Last-Save-Date" content="2012-04-24T15:24:00Z" />
    <meta name="Last-Printed" content="2012-03-29T15:06:00Z" />
    <meta name="Revision-Number" content="2" />
    <meta name="meta:print-date" content="2012-03-29T15:06:00Z" />
    <meta name="meta:creation-date" content="2012-04-24T15:24:00Z" />
    <meta name="dcterms:modified" content="2012-04-24T15:24:00Z" />
    <meta name="Template" content="Normal" />
    <meta name="Page-Count" content="15" />
    <meta name="meta:character-count" content="18470" />
    <meta name="dc:creator" content="ogilvie.f" />
    <meta name="meta:word-count" content="3240" />
    <meta name="extended-properties:Company" content="Procter &amp; Gamble" />
    <meta name="Last-Modified" content="2012-04-24T15:24:00Z" />
    <meta name="custom:ContentTypeId" content="0x010100832DCE57D1DD144A851051A25C75E147" />
    <meta name="modified" content="2012-04-24T15:24:00Z" />
    <meta name="xmpTPg:NPages" content="15" />
    <meta name="dc:publisher" content="Procter &amp; Gamble" />
    <meta name="Character Count" content="18470" />
    <meta name="meta:page-count" content="15" />
    <meta name="meta:character-count-with-spaces" content="21667" />
    <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
    <title></title>
    </head>
    <body><p class="body_Text"><b>CONFIDENTIAL</b></p>
    <table><tbody><tr>  <td><p>principle</p>
</td>   <td><p>optimum</p>
</td>   <td><p>rationale</p>
</td></tr>
<tr>    <td><p>Number of  suppliers</p>
</td>   <td><p class="list_Paragraph">2-3 per plant</p>
<p class="list_Paragraph">&gt;80% with 5 per region/country cluster</p>
</td>   <td><p class="list_Paragraph">Competition is local</p>
<p class="list_Paragraph">Scale the spend with central accounts</p>
</td></tr>
<tr>    <td><p>Global/local suppliers</p>
</td>   <td><p>Regional is sufficient</p>
</td>   <td><p class="list_Paragraph">No advantage to global as scale is regional only and there is limited IP to transfer.</p>
<p class="list_Paragraph">Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.</p>
</td></tr>
<tr>    <td><p>Approach to suppliers</p>
</td>   <td><p>collaborative</p>
</td>   <td><p>Competition to drive price is clear; preferential and value-add deals require collaboration</p>
</td></tr>
<tr>    <td><p>Make v buy</p>
</td>   <td><p>buy</p>
</td>   <td><p>Multiple suppliers; commoditised technologies</p>
</td></tr>
<tr>    <td><p>Distance of suppliers to plant</p>
</td>   <td><p class="list_Paragraph">Max 300km for boxes (300miles in NA); up to 1000km for paper reels.</p>
<p class="list_Paragraph">Can be longer for specialist print grades or to countries with no high quality local supply</p>
</td>   <td><p class="list_Paragraph">Economic max as high volume product (air in the fluting)</p>
<p class="list_Paragraph">Need recent built paper machines to produce paper strong enough to run on high-speed corrugators</p>
</td></tr>
<tr>    <td><p>Type of suppliers</p>
</td>   <td><p class="list_Paragraph">Integrated with containerboard making</p>
<p />
<p class="list_Paragraph">Corrugators on-site</p>
</td>   <td><p class="list_Paragraph">To assure supply and avoid being leveraged by paper making scale</p>
<p class="list_Paragraph">Cost structure not competitive if have to buy in board (shipping air)</p>
</td></tr>
<tr>    <td><p>Purchase of feedstocks</p>
</td>   <td><p>Not if integrated suppliers</p>
</td>   <td><p>Integrated suppliers have 20x our scale</p>
</td></tr>
<tr>    <td><p>Length and nature of contracts</p>
</td>   <td><p>Multiple year (2-3), but with fixed glidepath pricing/value every year</p>
</td>   <td><p>Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.</p>
</td></tr>
<tr>    <td><p>Specifications</p>
</td>   <td><p class="list_Paragraph">Standard board weights</p>
<p />
<p />
<p class="list_Paragraph">Tailored box sizes</p>
</td>   <td><p class="list_Paragraph">Paper scale much higher so uneconomic to make tailored weight</p>
<p class="list_Paragraph">Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.</p>
</td></tr>
<tr>    <td><p>Terms</p>
</td>   <td><p>Standard, including payment terms</p>
</td>   <td><p>High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.</p>
</td></tr>
</tbody></table>
    <p>date</p>
    </td></tr>
    </tbody></table>
    <p />
    <p />
    <p>1</p>
    <p class="footer" />
    </body></html>

課題は、たとえばハンドラーから要素を抽出したいときに始まります。XPath を使用し、正規表現を介してテーブルを取得するよう提案されました。コンセプトはわかったのですが、ここで説明したようにTikaを使用してそれを行うことができませんでした。

このようなスレッドを読んだ後、Tika を完全に終了して JAXP を使用するか、組み合わせを使用するか (?) を考えています。

私の仮定、指示が間違っている場所、そしてどのように進めるべきかについて誰かが私を導くことができますか?

xpath - Tika または JAXP またはその両方

0 に答える 0

Related

Reference