17

PHPで単語文書からテキストコンテンツを抽出したい。

Microsoft Word for Mac 2011 で新しい Word ドキュメントを作成しました。編集: Windows 7 の Microsoft Word で同じドキュメントを作成してテストしました。

文書の内容は、

The quick brown fox jumps over the lazy dog

Word 97-2004 ドキュメント (.doc) としてディスクに保存しました。

私はphpoffice/phpwordとこのコードを使用してテキストを抽出しています:

<?php

$source = "word.doc";

$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

$text = '';

$sections = $phpWord->getSections();

foreach ($sections as $s) {
    $els = $s->getElements();
    foreach ($els as $e) {
        if (get_class($e) === 'PhpOffice\PhpWord\Element\Text') {
            $text .= $e->getText();
        } elseif (get_class($e) === 'PhpOffice\PhpWord\Section\TextBreak') {
            $text .= " \n";
        } else {
            throw new Exception('Unknown class type ' . get_class($e));
        }
    }
}

print $text;

このコードの出力は、テキストの一部のみです。

The quick brown fox j

コードに問題がありますか、それとも何らかの互換性の問題ですか?

編集:

var_dump($els);出力の前に a を追加すると、次のようにforeach ($els as $e) {なります。

array(1) {
  [0]=>
  object(PhpOffice\PhpWord\Element\Text)#1265 (14) {
    ["text":protected]=>
    string(21) "The quick brown fox j"
    ["fontStyle":protected]=>
    object(PhpOffice\PhpWord\Style\Font)#1267 (25) {
      ["aliases":protected]=>
      array(1) {
        ["line-height"]=>
        string(10) "lineHeight"
      }
      ["type":"PhpOffice\PhpWord\Style\Font":private]=>
      string(4) "text"
      ["name":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["hint":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["size":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["color":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["bold":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["italic":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["underline":"PhpOffice\PhpWord\Style\Font":private]=>
      string(4) "none"
      ["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["scale":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
      object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
        ["aliases":protected]=>
        array(1) {
          ["line-height"]=>
          string(10) "lineHeight"
        }
        ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(6) "Normal"
        ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(0) ""
        ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(true)
        ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        int(0)
        ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        array(0) {
        }
        ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["borderTopSize":protected]=>
        NULL
        ["borderTopColor":protected]=>
        NULL
        ["borderLeftSize":protected]=>
        NULL
        ["borderLeftColor":protected]=>
        NULL
        ["borderRightSize":protected]=>
        NULL
        ["borderRightColor":protected]=>
        NULL
        ["borderBottomSize":protected]=>
        NULL
        ["borderBottomColor":protected]=>
        NULL
        ["styleName":protected]=>
        NULL
        ["index":protected]=>
        NULL
        ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
        bool(false)
      }
      ["shading":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["rtl":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["paragraphStyle":protected]=>
    object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
      ["aliases":protected]=>
      array(1) {
        ["line-height"]=>
        string(10) "lineHeight"
      }
      ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      string(6) "Normal"
      ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      string(0) ""
      ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(true)
      ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      int(0)
      ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      array(0) {
      }
      ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["borderTopSize":protected]=>
      NULL
      ["borderTopColor":protected]=>
      NULL
      ["borderLeftSize":protected]=>
      NULL
      ["borderLeftColor":protected]=>
      NULL
      ["borderRightSize":protected]=>
      NULL
      ["borderRightColor":protected]=>
      NULL
      ["borderBottomSize":protected]=>
      NULL
      ["borderBottomColor":protected]=>
      NULL
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["phpWord":protected]=>
    object(PhpOffice\PhpWord\PhpWord)#1247 (3) {
      ["sections":"PhpOffice\PhpWord\PhpWord":private]=>
      array(1) {
        [0]=>
        object(PhpOffice\PhpWord\Element\Section)#1261 (16) {
          ["container":protected]=>
          string(7) "Section"
          ["style":"PhpOffice\PhpWord\Element\Section":private]=>
          object(PhpOffice\PhpWord\Style\Section)#1262 (28) {
            ["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
            string(8) "portrait"
            ["paper":"PhpOffice\PhpWord\Style\Section":private]=>
            object(PhpOffice\PhpWord\Style\Paper)#1263 (8) {
              ["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
              array(6) {
                ["A3"]=>
                array(3) {
                  [0]=>
                  int(297)
                  [1]=>
                  int(420)
                  [2]=>
                  string(2) "mm"
                }
                ["A4"]=>
                array(3) {
                  [0]=>
                  int(210)
                  [1]=>
                  int(297)
                  [2]=>
                  string(2) "mm"
                }
                ["A5"]=>
                array(3) {
                  [0]=>
                  int(148)
                  [1]=>
                  int(210)
                  [2]=>
                  string(2) "mm"
                }
                ["Folio"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(13)
                  [2]=>
                  string(2) "in"
                }
                ["Legal"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(14)
                  [2]=>
                  string(2) "in"
                }
                ["Letter"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(11)
                  [2]=>
                  string(2) "in"
                }
              }
              ["size":"PhpOffice\PhpWord\Style\Paper":private]=>
              string(2) "A4"
              ["width":"PhpOffice\PhpWord\Style\Paper":private]=>
              int(11870)
              ["height":"PhpOffice\PhpWord\Style\Paper":private]=>
              int(16787)
              ["styleName":protected]=>
              NULL
              ["index":protected]=>
              NULL
              ["aliases":protected]=>
              array(0) {
              }
              ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
              bool(false)
            }
            ["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
            int(11906)
            ["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
            int(16838)
            ["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
            int(0)
            ["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1)
            ["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["borderTopSize":protected]=>
            NULL
            ["borderTopColor":protected]=>
            NULL
            ["borderLeftSize":protected]=>
            NULL
            ["borderLeftColor":protected]=>
            NULL
            ["borderRightSize":protected]=>
            NULL
            ["borderRightColor":protected]=>
            NULL
            ["borderBottomSize":protected]=>
            NULL
            ["borderBottomColor":protected]=>
            NULL
            ["styleName":protected]=>
            NULL
            ["index":protected]=>
            NULL
            ["aliases":protected]=>
            array(0) {
            }
            ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
            bool(false)
          }
          ["headers":"PhpOffice\PhpWord\Element\Section":private]=>
          array(0) {
          }
          ["footers":"PhpOffice\PhpWord\Element\Section":private]=>
          array(0) {
          }
          ["elements":protected]=>
          array(1) {
            [0]=>
            *RECURSION*
          }
          ["phpWord":protected]=>
          *RECURSION*
          ["sectionId":protected]=>
          int(1)
          ["docPart":protected]=>
          string(7) "Section"
          ["docPartId":protected]=>
          int(1)
          ["elementIndex":protected]=>
          int(1)
          ["elementId":protected]=>
          NULL
          ["relationId":protected]=>
          NULL
          ["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
          int(0)
          ["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
          NULL
          ["mediaRelation":protected]=>
          bool(false)
          ["collectionRelation":protected]=>
          bool(false)
        }
      }
      ["collections":"PhpOffice\PhpWord\PhpWord":private]=>
      array(5) {
        ["Bookmarks"]=>
        object(PhpOffice\PhpWord\Collection\Bookmarks)#1248 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Titles"]=>
        object(PhpOffice\PhpWord\Collection\Titles)#1249 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Footnotes"]=>
        object(PhpOffice\PhpWord\Collection\Footnotes)#1250 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Endnotes"]=>
        object(PhpOffice\PhpWord\Collection\Endnotes)#1251 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Charts"]=>
        object(PhpOffice\PhpWord\Collection\Charts)#1252 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
      }
      ["metadata":"PhpOffice\PhpWord\PhpWord":private]=>
      array(3) {
        ["DocInfo"]=>
        object(PhpOffice\PhpWord\Metadata\DocInfo)#1253 (12) {
          ["creator":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["lastModifiedBy":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["created":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          int(1483515248)
          ["modified":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          int(1483515248)
          ["title":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["description":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["subject":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["keywords":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["category":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["company":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["manager":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["customProperties":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          array(0) {
          }
        }
        ["Protection"]=>
        object(PhpOffice\PhpWord\Metadata\Protection)#1254 (1) {
          ["editing":"PhpOffice\PhpWord\Metadata\Protection":private]=>
          NULL
        }
        ["Compatibility"]=>
        object(PhpOffice\PhpWord\Metadata\Compatibility)#1255 (1) {
          ["ooxmlVersion":"PhpOffice\PhpWord\Metadata\Compatibility":private]=>
          int(12)
        }
      }
    }
    ["sectionId":protected]=>
    NULL
    ["docPart":protected]=>
    string(7) "Section"
    ["docPartId":protected]=>
    int(1)
    ["elementIndex":protected]=>
    int(1)
    ["elementId":protected]=>
    string(6) "5d531b"
    ["relationId":protected]=>
    NULL
    ["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
    int(0)
    ["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
    string(7) "Section"
    ["mediaRelation":protected]=>
    bool(false)
    ["collectionRelation":protected]=>
    bool(false)
  }
}
4

3 に答える 3

4

前にリーダーを作成してみてください

$source = "word.doc";
// create your reader object
$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');
// read source
if($phpWordReader->canRead($source)) {
$phpWord = $phpWordReader->load($source);
... // rest of your code
}

回答は、この例API ドキュメントに基づいています

于 2017-01-05T10:18:57.043 に答える
4

各クラスのテキストをチェックするのではなく、使用できます

                    $sections = $phpWord->getSections();

                    foreach ($sections as $s) {
                        $els = $s->getElements();
                        /** @var ElementTest $e */
                        foreach ($els as $e) {
                            $class = get_class($e);
                            if (method_exists($class, 'getText')) {
                                $text .= $e->getText();
                            } else {
                                $text .= "\n";
                            }
                        }
                    }

于 2019-05-29T16:23:21.247 に答える
2

catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/を使用して、Word 文書から txt を抽出できます。

を使用してUbuntuにインストールできます

sudo apt-get install catdoc

システムで catdoc を動作させたら、shell_exec() を使用して php から呼び出すことができます。

<?php

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

print $text;

?>

(fullpath) は、catdoc と Word doc への実際のパスに置き換えてください。

編集----追加

ファイルを.docではなく.docxとして保存できる場合は、少し簡単になります。catdocではなくunzipを使用できます。

単純に置き換えます:

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

$text = shell_exec("/(fullpath)/unzip -p /(fullpath)/word.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");

これと同じ手法を、他のほとんどのコマンド ライン ドキュメントからテキストへのコンバーターで使用できます。shell_exec() 内のコマンドを、システムで機能するコマンドに置き換えるだけです。.doc および .docx ファイルからプレーン テキストのみを抽出する方法を確認できますか? (unix)他の unix/linux の代替用

他の PHP の代替手段については、Word ファイル .doc,docx,.xlsx,.pptx php からテキストを抽出する方法を確認してください。

于 2017-01-05T15:50:45.763 に答える