java - tesseract hOCR XML 出力から選択する XPathExpression

Question

おおよそ次の形状のファイルがあります。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.02' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "D:\DPC2\converted\60\60.tiff"; bbox 0 0 2479 3508; ppageno 0'>
       <!-- LOTS OF CONTENT -->
  </div>
 </body>
</html>

次に、次の XPath クエリで JDOM 2.x を使用しています。

//htmlFile is an input variable of type java.nio.Path
Document document = xmlBuilder.build(htmlFile.toFile());

XPathFactory factory = XPathFactory.instance();
XPathExpression<Element> xpePages = 
    factory.compile("//html/body/div[@class='ocr_page']", Filters.element());
List<Element> pages = xpePages.evaluate(document);

しかし、要素を見つけることができません。クエリで何が間違っていますか?

score 2 · Accepted Answer

<html xmlns="http://www.w3.org/1999/xhtml"

htmlのような要素が名前空間にあることを意味しますhttp://www.w3.org/1999/xhtml

あなたにはいくつかの方法があります

(~名前空間マネージャー)を登録するNamespaceContextには、このテクノロジスタックではかなり面倒に見えます: https://stackoverflow.com/a/6390494/314291
または、名前空間に依存しない xpath を使用します。

 //*[local-name()=='html' and namespace-uri()='http://www.w3.org/1999/xhtml']
 /*[local-name()='body' and namespace-uri()='http://www.w3.org/1999/xhtml']
 /* ... etc.

要素の名前空間に競合がないことが確実な場合は、単に使用することを選択できますlocal-name()

//*[local-name()=='html']/*[local-name()='body']* ...

java - tesseract hOCR XML 出力から選択する XPathExpression

2 に答える 2

Related

Reference