javascript - JavaScript を使用して Word 文書から画像を抽出する方法は?

Question

JavaScript で ActiveXObject を使用して Word 文書から画像を抽出しようとしています (IE のみ)。

Word オブジェクトの API リファレンスを見つけることができませんでした。インターネット上のいくつかのヒントだけです。

var filename = 'path/to/word/doc.docx'
var word = new ActiveXObject('Word.Application')
var doc = w.Documents.Open(filename)
// Displays the text
var docText = doc.Content

のようなものを使用して、Word 文書内の画像にアクセスするにはどうすればよいdoc.Contentですか?

また、誰かがAPIの決定的なソース（できればMicrosoftから）を持っている場合、それは非常に役立ちます.

score 0 · Accepted Answer

そのため、数週間の調査SaveAsの結果、Word ActiveXObject の一部である関数を使用して画像を抽出するのが最も簡単であることがわかりました。ファイルが HTML ドキュメントとして保存されている場合、Word は画像を含むフォルダーを作成します。

そこから、XMLHttp を使用して HTML ファイルを取得し、ブラウザーで表示できる新しい IMG タグを作成できます ( ActiveXObject は Internet Explorer でのみ動作するため、 IE (9) を使用しています)。

SaveAs次の部分から始めましょう。

// Define the path to the file
var filepath = 'path/to/the/word/doc.docx'
// Make a new ActiveXWord application
var word = new ActiveXObject('Word.Application')
// Open the document
var doc = word.Documents.Open(filepath)
// Save the DOCX as an HTML file (the 8 specifies you want to save it as an HTML document)
doc.SaveAs(filepath + '.htm', 8)

これで、同じディレクトリに画像ファイルが含まれるフォルダーが作成されます。

注: Word HTML では、画像は<v:imagedata>タグに保存されているタグを使用します<v:shape>。例えば：

<v:shape style="width: 241.5pt; height: 71.25pt;">
     <v:imagedata src="path/to/the/word/doc.docx_files/image001.png">
         ...
     </v:imagedata>
</v:shape>

Word が保存する不要な属性とタグを削除しました。

JavaScript を使用して HTML にアクセスするには、XMLHttpRequest オブジェクトを使用します。

 var xmlhttp = new XMLHttpRequest()
 var html_text = ""

私は何百もの Word ドキュメントにアクセスしているため、呼び出しを送信する前にonreadystatechangeXMLHttp のコールバックを定義するのが最善であることがわかりました。

// Define the onreadystatechange callback function
xmlhttp.onreadystatechange = function() {
    // Check to make sure the response has fully loaded
    if (xmlhttp.readyState==4 && xmlhttp.status==200) {
        // Grab the response text
        var html_text=xmlhttp.responseText
        // Load the HTML into the innerHTML of a DIV to add the HTML to the DOM
        document.getElementById('doc_html').innerHTML=html_text.replace("<html>", "").replace("</html>","")
        // Define a new array of all HTML elements with the "v:imagedata" tag
        var images =document.getElementById('doc_html').getElementsByTagName("v:imagedata")
        // Loop through each image
        for(j=0;j<images.length;j++) {
            // Grab the source attribute to get the image name
            var src = images[j].getAttribute('src')
            // Check to make sure the image has a 'src' attribute
            if(src!=undefined) {
                ...

srcIE が HTML 属性を innerHTML div にロードするときに HTML 属性をエスケープする方法が原因で、正しい属性をロードする際に多くの問題が発生したdoc_htmlため、以下の例では疑似パスを使用しsrc.split('/')[1]て画像名を取得しています1 つ以上のスラッシュがある場合は機能しません!):

                ...
                images[j].setAttribute('src', '/path/to/the/folder/containing/the/images/'+src.split('/')[1])
                ...

ここで、親(オブジェクト) の親 (たまたまオブジェクト)imgを使用して HTML div に新しいタグを追加します。画像から属性を取得し、要素から情報を取得して、新しいタグを innerHTML に追加します。v:shapepimgsrcstylev:shape

                ...
                images[j].parentElement.parentElement.innerHTML+="<img src='"+images[j].getAttribute('src')+"' style='"+images[j].parentElement.getAttribute('style')+"'>"

            }
        }       
    }
}
// Read the HTML Document using XMLHttpRequest
xmlhttp.open("POST", filepath + '.htm', false)
xmlhttp.send()

少し具体的ではありますが、上記の方法により、元のドキュメントにあった HTML に img タグを正常に追加できました。

javascript - JavaScript を使用して Word 文書から画像を抽出する方法は?

1 に答える 1

Related

Reference