python - Pythonを使用して.docからpdfに

Question

私は大量の.docファイルを.pdfに変換する任務を負っています。そして、上司が私にこれを行うことを望んでいる唯一の方法は、MSWord2010を使用することです。PythonCOM自動化でこれを自動化できるはずです。唯一の問題は、どこからどのように始めればよいかわからないことです。いくつかのチュートリアルを検索しようとしましたが、見つかりませんでした（あるかもしれませんが、何を探しているのかわかりません）。

今、私はこれを読んでいます。これがどれほど役立つかわからない。

score 91 · Accepted Answer

comtypesを使用した簡単な例、単一のファイルの変換、コマンドライン引数として指定された入力ファイル名と出力ファイル名：

import sys
import os
import comtypes.client

wdFormatPDF = 17

in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

pywin32を使用することもできます。これは、次の点を除いて同じです。

import win32com.client

その後：

word = win32com.client.Dispatch('Word.Application')

score 30 · Accepted Answer

Pythonパッケージを使用して、docx2pdfdocxをpdfに一括変換できます。CLIとPythonライブラリの両方として使用できます。Microsoft Officeがインストールされている必要があり、WindowsではCOMを使用し、macOSではAppleScript（JXA）を使用します。

from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")

pip install docx2pdf
docx2pdf input.docx output.pdf

免責事項：私はdocx2pdfパッケージを作成しました。https://github.com/AlJohri/docx2pdf

score 17 · Accepted Answer

私はこの問題に半日取り組んできたので、この問題に関する私の経験の一部を共有する必要があると思います。スティーブンの答えは正しいですが、私のコンピューターでは失敗します。ここで修正するための2つの重要なポイントがあります。

（1）。'Word.Application'オブジェクトを初めて作成したときは、ドキュメントを開く前に、そのオブジェクト（Wordアプリ）を表示する必要があります。（実際、私自身でもこれが機能する理由を説明することはできません。コンピューターでこれを行わないと、非表示モデルでドキュメントを開こうとするとプログラムがクラッシュし、「Word.Application」オブジェクトがによって削除されます。 OS。）

（2）。（1）を実行すると、プログラムは正常に動作する場合がありますが、失敗することがよくあります。クラッシュエラー"COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))"は、COMサーバーがそれほど迅速に応答できない可能性があることを意味します。そのため、ドキュメントを開こうとする前に遅延を追加します。

これらの2つの手順を実行すると、プログラムは完全に機能し、障害は発生しなくなります。デモコードは以下の通りです。同じ問題が発生した場合は、次の2つの手順に従ってください。それが役に立てば幸い。

    import os
    import comtypes.client
    import time


    wdFormatPDF = 17


    # absolute path is needed
    # be careful about the slash '\', use '\\' or '/' or raw string r"..."
    in_file=r'absolute path of input docx file 1'
    out_file=r'absolute path of output pdf file 1'

    in_file2=r'absolute path of input docx file 2'
    out_file2=r'absolute path of outputpdf file 2'

    # print out filenames
    print in_file
    print out_file
    print in_file2
    print out_file2


    # create COM object
    word = comtypes.client.CreateObject('Word.Application')
    # key point 1: make word visible before open a new document
    word.Visible = True
    # key point 2: wait for the COM Server to prepare well.
    time.sleep(3)

    # convert docx file 1 to pdf file 1
    doc=word.Documents.Open(in_file) # open docx file 1
    doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 1
    word.Visible = False
    # convert docx file 2 to pdf file 2
    doc = word.Documents.Open(in_file2) # open docx file 2
    doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 2   
    word.Quit() # close Word Application

score 13 · Accepted Answer

私は多くのソリューションをテストしましたが、Linuxディストリビューションで効率的に機能するソリューションはありません。

私はこの解決策をお勧めします：

import sys
import subprocess
import re


def convert_to(folder, source, timeout=None):
    args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]

    process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
    filename = re.search('-> (.*?) using filter', process.stdout.decode())

    return filename.group(1)


def libreoffice_exec():
    # TODO: Provide support for more platforms
    if sys.platform == 'darwin':
        return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
    return 'libreoffice'

そしてあなたはあなたの関数を呼び出します：

result = convert_to('TEMP Directory',  'Your File', timeout=15)

すべてのリソース：

https://michalzalecki.com/converting-docx-to-pdf-using-python/

score 7 · Accepted Answer

unoconv（Pythonで記述）とopenofficeがヘッドレスデーモンとして実行されています。 http://dag.wiee.rs/home-made/unoconv/

doc、docx、ppt、pptx、xls、xlsxで非常にうまく機能します。ドキュメントを変換したり、サーバー上で特定の形式に保存/変換したりする必要がある場合に非常に便利です

score 7 · Accepted Answer

SaveAs関数の代わりに、Wordで通常表示されるPDFオプションダイアログにアクセスできるExportAsFixedFormatを使用することもできます。これにより、ブックマークやその他のドキュメントのプロパティを指定できます。

doc.ExportAsFixedFormat(OutputFileName=pdf_file,
    ExportFormat=17, #17 = PDF output, 18=XPS output
    OpenAfterExport=False,
    OptimizeFor=0,  #0=Print (higher res), 1=Screen (lower res)
    CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
    DocStructureTags=True
    );

関数の引数の完全なリストは次のとおりです。'OutputFileName'、'ExportFormat'、'OpenAfterExport'、'OptimizeFor'、'Range'、'From'、'To'、'Item'、'IncludeDocProps'、'KeepIRM'、'CreateBookmarks '、' DocStructureTags'、' BitmapMissingFonts'、' UseISO19005_1'、' FixedFormatExtClassPtr '

score 4 · Accepted Answer

スティーブンスの回答が機能することは注目に値しますが、forループを使用して複数のファイルをエクスポートし、ループの前にClientObjectまたはDispatchステートメントを配置する場合は必ず確認してください（作成する必要があるのは1回だけです）。私の問題を参照してください：Python win32com.client.Dispatch looping Wordドキュメントを介してPDFにエクスポートします。次のループが発生すると失敗します

score 2 · Accepted Answer

PowerShellを使用してもかまわない場合は、このHey、ScriptingGuyをご覧ください。記事。wdFormatPDF提示されたコードは、の列挙値を使用するために採用できますWdSaveFormat（ここを参照）。このブログ記事では、同じアイデアの異なる実装を紹介しています。

score 2 · Accepted Answer

私は受け入れられた答えを試しましたが、Wordが生成している肥大化したPDFには特に熱心ではありませんでした。これは通常、予想よりも1桁大きいものでした。仮想PDFプリンターを使用するときにダイアログを無効にする方法を調べた後、Bullzip PDFプリンターに出くわし、その機能にかなり感銘を受けました。これで、以前使用していた他の仮想プリンターに取って代わりました。あなたは彼らのダウンロードページで「無料のコミュニティ版」を見つけるでしょう。

COM APIはここにあり、使用可能な設定のリストはここにあります。設定は「runonce」ファイルに書き込まれ、1つの印刷ジョブにのみ使用されてから自動的に削除されます。複数のPDFを印刷する場合は、各ファイルの設定が正しく使用されるように、別の印刷ジョブを開始する前に1つの印刷ジョブが完了していることを確認する必要があります。

import os, re, time, datetime, win32com.client

def print_to_Bullzip(file):
    util = win32com.client.Dispatch("Bullzip.PDFUtil")
    settings = win32com.client.Dispatch("Bullzip.PDFSettings")
    settings.PrinterName = util.DefaultPrinterName      # make sure we're controlling the right PDF printer

    outputFile = re.sub("\.[^.]+$", ".pdf", file)
    statusFile = re.sub("\.[^.]+$", ".status", file)

    settings.SetValue("Output", outputFile)
    settings.SetValue("ConfirmOverwrite", "no")
    settings.SetValue("ShowSaveAS", "never")
    settings.SetValue("ShowSettings", "never")
    settings.SetValue("ShowPDF", "no")
    settings.SetValue("ShowProgress", "no")
    settings.SetValue("ShowProgressFinished", "no")     # disable balloon tip
    settings.SetValue("StatusFile", statusFile)         # created after print job
    settings.WriteSettings(True)                        # write settings to the runonce.ini
    util.PrintFile(file, util.DefaultPrinterName)       # send to Bullzip virtual printer

    # wait until print job completes before continuing
    # otherwise settings for the next job may not be used
    timestamp = datetime.datetime.now()
    while( (datetime.datetime.now() - timestamp).seconds < 10):
        if os.path.exists(statusFile) and os.path.isfile(statusFile):
            error = util.ReadIniString(statusFile, "Status", "Errors", '')
            if error != "0":
                raise IOError("PDF was created with errors")
            os.remove(statusFile)
            return
        time.sleep(0.1)
    raise IOError("PDF creation timed out")

score 1 · Accepted Answer

いわゆる仮想PDFプリントドライバーの調査から始める必要があります。見つけたらすぐに、DOCファイルをPDFファイルに印刷するバッチファイルを作成できるようになります。おそらくPythonでもこれを行うことができます（プリンタードライバーの出力をセットアップし、MSWordでドキュメント/印刷コマンドを発行します。後でコマンドラインAFAIRを使用して行うことができます）。

score 0 · Accepted Answer

私はこのソリューションで作業していましたが、すべての.docx、.dotm、.docm、.odt、.doc、または.rtfを検索してから、それらをすべて.pdf（python 3.7.5）に変換する必要がありました。それがうまくいくことを願っています...

import os
import win32com.client

wdFormatPDF = 17

for root, dirs, files in os.walk(r'your directory here'):
    for f in files:

        if  f.endswith(".doc")  or f.endswith(".odt") or f.endswith(".rtf"):
            try:
                print(f)
                in_file=os.path.join(root,f)
                word = win32com.client.Dispatch('Word.Application')
                word.Visible = False
                doc = word.Documents.Open(in_file)
                doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
                doc.Close()
                word.Quit()
                word.Visible = True
                print ('done')
                os.remove(os.path.join(root,f))
                pass
            except:
                print('could not open')
                # os.remove(os.path.join(root,f))
        elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
            try:
                print(f)
                in_file=os.path.join(root,f)
                word = win32com.client.Dispatch('Word.Application')
                word.Visible = False
                doc = word.Documents.Open(in_file)
                doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
                doc.Close()
                word.Quit()
                word.Visible = True
                print ('done')
                os.remove(os.path.join(root,f))
                pass
            except:
                print('could not open')
                # os.remove(os.path.join(root,f))
        else:
            pass

試してみたのは、私が読むことができず、最後のドキュメントまでコードを終了しないドキュメントを除いたものです。

score 0 · Accepted Answer

pptサポート用にも変更しました。私のソリューションは、以下に指定されているすべての拡張機能をサポートしています。

word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]

私の解決策：Githubリンク

Docx2PDFからコードを変更しました

score -8 · Accepted Answer

スーパーバイザーを無視して、PythonAPIを備えたOpenOfficeを使用することをお勧めします。OpenOfficeにはPythonのサポートが組み込まれており、誰かがこの目的に固有のライブラリ（PyODConverter）を作成しました。

彼が出力に満足していない場合は、言葉でそれを行うのに数週間かかる可能性があることを彼に伝えてください。

python - Pythonを使用して.docからpdfに

13 に答える 13

Related

Reference