html - HTMLからXMLへの変換

Question

XMLで検証する必要のあるHTMLファイルが何百もあります。これらのHTMLを使用してアプリケーションのコンテンツを提供していますが、これらのコンテンツをXMLとして提供する必要があります。

HTMLファイルには、テーブル、div、画像、p、b、または強力なタグなどが含まれます。

グーグルでいくつかのアプリケーションを見つけましたが、まだ達成できませんでした。

これらのファイルの内容をXMLに変換する方法を提案できますか？

score 19 · Accepted Answer

tidyコマンドラインユーティリティを使用して成功しました。Linuxでは、を使用してすばやくインストールしましたapt-get install tidy。次に、コマンド：

tidy -q -asxml --numeric-entities yes source.html >file.xml

xsltプロセッサで処理できるxmlファイルを提供しました。ただし、xhtml1dtdsを正しく設定する必要がありました。

これは彼らのホームページです：html-tidy.org（そしてレガシーのもの：HTML Tidy）

score 3 · Accepted Answer

私は（悪い）htmlを整形式のXMLに変換する方法を見つけました。私はこれをDOMloadHTML関数に基づいて始めました。しかし、その間にいくつかの問題が発生し、副作用を修正するためにパッチを最適化して追加しました。

  function tryToXml($dom,$content) {
    if(!$content) return false;

    // xml well formed content can be loaded as xml node tree
    $fragment = $dom->createDocumentFragment();
    // wonderfull appendXML to add an XML string directly into the node tree!

    // aappendxml will fail on a xml declaration so manually skip this when occurred
    if( substr( $content,0, 5) == '<?xml' ) {
      $content = substr($content,strpos($content,'>')+1);
      if( strpos($content,'<') ) {
        $content = substr($content,strpos($content,'<'));
      }
    }

    // if appendXML is not working then use below htmlToXml() for nasty html correction
    if(!@$fragment->appendXML( $content )) {
      return $this->htmlToXml($dom,$content);
    }

    return $fragment;
  }



  // convert content into xml
  // dom is only needed to prepare the xml which will be returned
  function htmlToXml($dom, $content, $needEncoding=false, $bodyOnly=true) {

    // no xml when html is empty
    if(!$content) return false;

    // real content and possibly it needs encoding
    if( $needEncoding ) {
      // no need to convert character encoding as loadHTML will respect the content-type (only)
      $content =  '<meta http-equiv="Content-Type" content="text/html;charset='.$this->encoding.'">' . $content;
    }

    // return a dom from the content
    $domInject = new DOMDocument("1.0", "UTF-8");
    $domInject->preserveWhiteSpace = false;
    $domInject->formatOutput = true;

    // html type
    try {
      @$domInject->loadHTML( $content );
    } catch(Exception $e){
      // do nothing and continue as it's normal that warnings will occur on nasty HTML content
    }
        // to check encoding: echo $dom->encoding
        $this->reworkDom( $domInject );

    if( $bodyOnly ) {
      $fragment = $dom->createDocumentFragment();

      // retrieve nodes within /html/body
      foreach( $domInject->documentElement->childNodes as $elementLevel1 ) {
       if( $elementLevel1->nodeName == 'body' and $elementLevel1->nodeType == XML_ELEMENT_NODE ) {
         foreach( $elementLevel1->childNodes as $elementInject ) {
           $fragment->insertBefore( $dom->importNode($elementInject, true) );
         }
        }
      }
    } else {
      $fragment = $dom->importNode($domInject->documentElement, true);
    }

    return $fragment;
  }



    protected function reworkDom( $node, $level = 0 ) {

        // start with the first child node to iterate
        $nodeChild = $node->firstChild;

        while ( $nodeChild )  {
            $nodeNextChild = $nodeChild->nextSibling;

            switch ( $nodeChild->nodeType ) {
                case XML_ELEMENT_NODE:
                    // iterate through children element nodes
                    $this->reworkDom( $nodeChild, $level + 1);
                    break;
                case XML_TEXT_NODE:
                case XML_CDATA_SECTION_NODE:
                    // do nothing with text, cdata
                    break;
                case XML_COMMENT_NODE:
                    // ensure comments to remove - sign also follows the w3c guideline
                    $nodeChild->nodeValue = str_replace("-","_",$nodeChild->nodeValue);
                    break;
                case XML_DOCUMENT_TYPE_NODE:  // 10: needs to be removed
                case XML_PI_NODE: // 7: remove PI
                    $node->removeChild( $nodeChild );
                    $nodeChild = null; // make null to test later
                    break;
                case XML_DOCUMENT_NODE:
                    // should not appear as it's always the root, just to be complete
                    // however generate exception!
                case XML_HTML_DOCUMENT_NODE:
                    // should not appear as it's always the root, just to be complete
                    // however generate exception!
                default:
                    throw new exception("Engine: reworkDom type not declared [".$nodeChild->nodeType. "]");
            }
            $nodeChild = $nodeNextChild;
        } ;
    }

これにより、自分で使用する必要のある1つのXMLにさらにHTMLを追加することもできます。一般に、次のように使用できます。

        $c='<p>test<font>two</p>';
    $dom=new DOMDocument('1.0', 'UTF-8');

$n=$dom->appendChild($dom->createElement('info')); // make a root element

if( $valueXml=tryToXml($dom,$c) ) {
  $n->appendChild($valueXml);
}
    echo '<pre/>'. htmlentities($dom->saveXml($n)). '</pre>';

この例'testtwo'では、整形式のXMLで' <info>testtwo</info>'として適切に出力されます。onetwo情報ルートタグが追加されました。これは、ルート要素が1つもないため、XMLではない''を変換することもできるためです。ただし、htmlに確実に1つのルート要素がある場合は、余分なルート<info>タグをスキップできます。

これにより、構造化されていない、さらには破損したHTMLから本当に素晴らしいXMLを取得できます。

それが少し明確で、他の人がそれを使用するのに役立つかもしれないことを願っています。

score 1 · Accepted Answer

HTMLとXMLは、マークアップ言語のツリーにおける2つの異なる概念であることを忘れないでください。HTMLをXMLに正確に置き換えることはできません。XMLはHTMLの一般化された形式と見なすことができますが、それでも不正確です。主にHTMLを使用してデータを表示し、XMLを使用してデータを伝送（または保存）します。

このリンクは役に立ちます：HTMLをXMLとして読み取る方法は？

詳細はこちら-HTMLとXMLの違い

html - HTMLからXMLへの変換

3 に答える 3

Related

Reference