search - PowerShell で PDF ドキュメント/PDX カタログを検索するにはどうすればよいですか

Question

ドキュメントライブラリを一連の PDF ファイル (およびいくつかの CHM ファイル) として提供し、.PDX カタログも含めているベンダーがあります。

フロントエンドに PowerShell スクリプトを作成したい (PowerShell フォームを使用するか、asp.net で PowerShell をホストする)。

私は初期段階にあり、PDF ストリーム (PDF ファイルの終わり近くにある xmpmeta XML メタデータブロック - ファイル内の数少ない平文のストリームの 1 つ) からドキュメント情報を取得する方法を考え出しました。このような：

    <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04 
       "><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="
" xmlns:pdf="http://ns.adobe.com/pdf/1.3/"><pdf:Producer>GPL Ghostscript 8.64</pdf:Producer><pdf:Keywo
rds>86000056-413</pdf:Keywords></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.ad
obe.com/xap/1.0/"><xmp:ModifyDate>2011-03-03T17:38:34-05:00</xmp:ModifyDate><xmp:CreateDate>2011-01-28
T23:12:07+05:30</xmp:CreateDate><xmp:CreatorTool>PScript5.dll Version 5.2</xmp:CreatorTool><xmp:Metada
taDate>2011-03-03T17:38:34-05:00</xmp:MetadataDate></rdf:Description><rdf:Description rdf:about="" xml
ns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"><xmpMM:DocumentID>6cb2263d-2d61-11e0-0000-1390d57dcfcb</xmp
MM:DocumentID><xmpMM:InstanceID>uuid:1a0e68ba-14ad-4a03-b7a1-0a0e127b8753</xmpMM:InstanceID></rdf:Desc
ription><rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"><dc:format>applicati
on/pdf</dc:format><dc:title><rdf:Alt><rdf:li xml:lang="x-default">I/O Subsystem Programming Guide</rdf
:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>Unisys Information Development</rdf:li></rdf:Seq
></dc:creator><dc:description><rdf:Alt><rdf:li xml:lang="x-default">ClearPath MCP 13.1,Application Dev
elopment,Administration,ClearPath MCP</rdf:li></rdf:Alt></dc:description></rdf:Description></rdf:RDF><
/x:xmpmeta>

次のコードを使用します (powershell v3、v2 では、プロパティを選択して展開する必要があります[string]$title = ($rdf.GetElementsByTagName('dc:title')| Select -expand Alt|Select -expand li)."#text"):

$file = ".\Downloads\68698703-007\PDF\86000056-413.pdf"

#determine what line in file the xmpmeta string starts
[int]$startln = (select-string -pattern '^<x:' $file).ToString().Split(":")[2]

#determine what line in file the xmpmeta string ends
[int]$endln = (select-string -pattern '^</x:' $file).ToString().Split(":")[2]
$startln--

#grab the xmpmeta and cast as type xml
[xml]$xmp = (gc $file)["$startln".."$endln"]
[xml]$rdf = $xmp.xmpmeta.InnerXml

#get title/creator/description element text
[string]$title = $rdf.GetElementsByTagName('dc:title').Alt.li."#text"
[string]$creator = $rdf.GetElementsByTagName('dc:creator').Alt.li."#text"
[string]$description = $rdf.GetElementsByTagName('dc:description').Alt.li."#text"

ファイル名は 12345678-123.pdf の形式であり、実際のタイトルはメタデータ自体やドキュメントカテゴリなどにあるため、これは非常に重要です。

したがって、ドキュメントのリスト (実際のファイル名ではなく、適切なタイトルを表示) を作成して起動できるようにすることができますが、PDX ファイルを使用してすべてのドキュメントを検索できるようにしたいのですが、決してプレーンテキストではありません。 !

各 PDF をテキストに変換し、検索し、ドキュメントごとに繰り返し、ドキュメントごとに結果を返すために、さまざまなツールのいずれかを使用できると思います。

しかし、Adobe Reader は既にそれを行っているように思います。検索を開始するスイッチを指定して AcroRd32.exe を起動するか、AcroRd32 プログラムに渡した検索用語を使用するか、Adobe Search.API を使用できますか? Powershell内？

Adobe Reader での .PDX のロードの自動化と検索の開始、または PowerShell での Adobe の API の使用に関する具体的なアイデアはありますか?

編集:
コマンドラインと検索から acrobat を起動できるようになりました (これを PowerShell で模倣できます) が、検索は PDX カタログではなく PDF を検索する場合にのみ機能します。どちらも検索ペインを表示しますが、検索フィールドにデータが入力され、検索が実行されるのは PDF ドキュメントのみです。

C:\Program Files (x86)\Adobe\Reader 10.0\Reader>AcroRd32.exe /A "search=trim" "P:\Doc Library\PDF\00_home.pdx"

または

C:\Program Files (x86)\Adobe\Reader 10.0\Reader>AcroRd32.exe /A "search=trim" "P:\Doc Library\PDF\86000056-413.pdf"

よろしく、グラハム

score 0 · Accepted Answer

This is an old post, but be aware that the searching you do is potentially dangerous and that there is a better way to find the XMP metadata in a PDF file. XMP was designed specifically to be "findable" by text search. To that purpose it has a well defined begin and end code defined that is in there specifically so that you can extract the XMP data without having to parse the PDF format (or any other format the XMP metadata blob might be embedded in.

You can download the XMP specification here: http://www.adobe.com/devnet/xmp.html. Part 1 is the part where the explanation about XMP Packets explains how a text scanner can find the XMP packet with more accuracy.

Finally, PDF has an additional quirk that allows it to be incrementally updated. This might cause multiple XMP packets to appear in the file (where the last packet is normally the correct one). But annoyingly when the PDF is exported from applications like InDesign, images in the PDF (and other objects) might also have their own "object" XMP attached to it.

So consider where your files come from and how many strange things you might encounter and you want to provision for. But reading the XMP specification is not a bad idea for sure.

search - PowerShell で PDF ドキュメント/PDX カタログを検索するにはどうすればよいですか

1 に答える 1

Related

Reference