python - HOCR出力を文字列に変換する戦略は何ですか(正規表現の目的で)?

Question

私は Pytesseract を使用しており、HOCR 出力を文字列に変換したいと考えています。もちろん、そのような機能はPytesseractに実装されていますが、それを実現するための可能な戦略についてもっと知りたいですthx

from pytesseract import image_to_pdf_or_hocr
hocr_output = image_to_pdf_or_hocr(image, extension='hocr')

score 0 · Accepted Answer

hOCRは .xml の一種であるため、.xml パーサーを使用できます。

しかし、最初に tesseract のバイナリ出力を str に変換する必要があります。

from pytesseract import image_to_pdf_or_hocr

hocr_output = image_to_pdf_or_hocr(image, extension='hocr')
hocr = hocr_output.decode('utf-8')

これで、 xml.etreeを使用して解析できます。

import xml.etree.ElementTree as ET

root = ET.fromstring(hocr)

xml.etree は、結果を 1 つの文字列に結合できるテキストイテレータを提供します。

text = ''.join(root.itertext())

1 に答える 1