1

私はetree.HTML( data )さまざまなコンテンツのために以下のように実行していますdatadataただし、特定のコンテンツでは、lxml.etree.HTMLそれを解析せず、無限ループに入り、100% の CPU を消費します。

dataこれを引き起こしている可能性のある以下の内容を正確に知っている人はいますか? さらに重要なことに、無数のランダムで壊れたものでこれが発生しないようにするにはどうすればよいdataですか?

編集:これはlxmlバージョン2.7.8以下(少なくとも)のバグであることが判明しました。lxml 2.9.0 に更新され、バグがなくなりました。

編集:これが無限ループを構成することは知っていますが、それは私が得ている悪い動作ではありません。正常なコンテンツで (無限ループとして) 正常に実行されdataます。以下のような不健全なdataコンテンツでは、ループが停止し、RAM がいっぱいになり始め、いっぱいになると、すべての CPU が WAIT 状態になります。元のデバッグについては、この質問を参照してください。

#!/usr/bin/python
# -*- coding: utf-8 -*-
#

import sys
from lxml import etree



data = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<meta charset="UTF-8">
    <title>The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked -- Grub Street New York</title>

    <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://feedproxy.google.com/nymag/grubstreet" />



    <meta name="Headline" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta name="keywords" content="april bloomfield, el gordo, frank bruni, gordon ramsay, lawsuits, lists, marcus samuelsson, mario batali, shitlist, spotted pig, sued" />

    <meta name="description" content="Racism, fat-shaming, and vegetarian trickery." />

    <meta name="Byline" content="Sierra Tishgart" />
    <meta name="Type_of_Feature" content="" />
    <meta name="Issue_Date" content="March  8, 2013 12:50 PM" />
    <meta name="related_stories" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta name="document_type" content="Blog" />
    <meta name="category" content="Lists" />

    <link rel="image_src" href="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg" />
    <link rel="canonical" href="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" id="canonical" />

    <script>
        var canonicalUrl = "http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html";
    </script>



    <meta name="content.tags.primary" content=";network - Grub Street,;city - New York City,;tag - lists" />
    <meta name="content.tags" content=";tag - april bloomfield,;tag - el gordo,;tag - frank bruni,;tag - gordon ramsay,;tag - lawsuits,;tag - marcus samuelsson,;tag - mario batali,;tag - shitlist,;tag - spotted pig,;tag - sued" />
    <meta name="content.hierarchy" content="New York City:Grub Street" />
    <meta name="content.type" content="Blog" />
    <meta name="content.subtype" content="Blog Entry" />    


    <meta property="fb:app_id" content="206283005644" />
    <meta property="og:title" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta property="og:description" content="Racism, fat-shaming, and vegetarian trickery." /> 
    <meta property="og:image" content="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg"/>
    <meta property="og:url" content="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" />
    <meta property="og:type" content="article" />
    <meta property="og:site_name" content="Grub Street New York" />





    <meta name="viewport" content="width=1020">

<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/grubstreet-core.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/section/daily/slideshow.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/echo.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/loginRegister.css" media="all" />
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/advertising.css" media="all" />
<link rel="shortcut icon" href="http://images.nymag.com/gfx/grubst/favicon.ico" />

<style type="text/css">
#adsplashtop,#pushdown {padding:5px 5px;}
#pushdown {border-top:1px solid #737373}
</style>











    <!--[if IE 6]>
    <link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie6.css" type="text/css" media="screen, projection" />
<![endif]-->

<!--[if IE 7]>
    <link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie7.css" type="text/css" media="screen, projection" />
<![endif]-->




    <script type="text/javascript">
        var NYM = {};
        NYM.config = {};
        NYM.config.membership = {
            "service":"nym"
        };
        NYM.config.advertising = {
            "sitename":"nym.grubstreet"
        };

    </script>




<script type="text/javascript">
    var date = 'March 12, 2013 12:42:38';
    var currDate=new Date(date);
    var GRUBST = {};
    if (!NYM) {  
        var NYM = {};
        NYM.config = {};
        NYM.config.membership = {
            "service":"nym"
        };
        NYM.config.advertising = {
             "sitename":"nym.grubstreet"
        };
    }
</script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/modernizr-1.7.min.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/jquery-ui-1.8.2.custom.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/ad_manager.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/js/2/global.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/skinTakeover.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/grubstreet-controls.js"></scr
'''







n = 0
while True:
    n += 1

    tree = etree.HTML( data )
    m = tree.xpath("//meta[@property]")

    print '-', n 
    for i in m:
        print n 
        #print (i.attrib['property'], i.attrib['content'])

クイック バージョンの場合は、次を使用できます。

import sys
from lxml import etree

print("%-20s: %s" % ('Python',           sys.version_info))
print("%-20s: %s" % ('lxml.etree',       etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used',      etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled',  etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used',     etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

私が持っている:

OS                  : Ubuntu 12.10 (AWS)
Python              : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree          : (3, 1, 0, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)
4

2 に答える 2

1

それは何の関係もありませんlxml.html-チェックしてください:

tree = lxml.html.fromstring( data )
print tree
# <Element html at 0x1bb5530>
print tree.xpath("//meta[@property]")
# []

代わりに、この部分を見てください....効果的に無限ループがある場所:

n = 0
while True:
    n += 1
    m = [] # never mind if you get results or not - looks like you don't though

    for i in m:
        print n
于 2013-03-12T16:58:00.147 に答える
1

lxml を使用して部分的な HTML を解析する方法を次に示します。libxml (2、7、8) 以前のバージョンで発生するように見えるハングの問題を回避するようです。

    parser = LH.HTMLParser()
    parser.feed(data)
    root = parser.close()
    m = root.xpath('//meta[@property]')

import sys
import lxml.html as LH
import lxml.etree as ET

data = '''
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="ie6"> <![endif]-->
<!--[if IE 7]>    <html class="ie7"> <![endif]-->
<!--[if IE 8]>    <html class="ie8"> <![endif]-->
<!--[if gt IE 8]><!--> <html> <!--<![endif]-->
<head profile="http://gmpg.org/xfn/11">
 <meta charset="UTF-8">
 <title>
     Erased US data shows 1 in 4 missiles in Afghan airstrikes now fired by drone: The Bureau of Investigative Journalism  </title>

 <meta name="description" content="Drone data has been wiped from the Air Force website.">

 <meta name="generator" content="Magicalia 2010" />
 <meta name="google-site-verification" content="bGFVI6kAZGjMNNiS6LGvBDWSGydwyWQI3gogCD4xP50" />

 <link href="http://cdn-images.mailchimp.com/embedcode/slim-081711.css" rel="stylesheet" type="text/css">
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/screen.css" type="text/css" media="screen, projection" />
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/print.css" type="text/css" media="print" />
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/style.css?3" type="text/css" media="screen, projection" />

 <!--[if IE]>
   <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/lib/ie.css" type="text/css" media="screen, projection" />
 <![endif]-->

 <!--[if lt IE 7]>
   <script defer type="text/javascript" src="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/pngfix.js"></script>
 <![endif]-->

 <!--[if gte IE 5.5]>
   <script language="javaScript" src="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/dhtml.js" type="text/javaScript"></script>
 <![endif]-->

 <link rel="alternate" type="application/rss+xml" title="The Bureau of Investigative Journalism RSS Feed" href="http://www.thebureauinvestigates.com/feed/" />
 <link rel="pingback" href="http://www.thebureauinvestigates.com/xmlrpc.php" />

 <link rel="alternate" type="application/rss+xml" title="The Bureau of Investigative Journalism &raquo; Erased US data shows 1 in 4 missiles in Afghan airstrikes now fired by drone Comments Feed" href="http://www.thebureauinvestigates.com/2013/03/12/erased-us-data-shows-1-in-4-missiles-in-afghan-airstrikes-now-fired-by-drone/feed/" />
<link rel='stylesheet' id='mailchimp-css'  href='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/mailchimp.dev.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='donate-css'  href='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/donate.dev.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='tubepress-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/css/tubepress.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='NextGEN-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/css/nggallery.css?ver=1.0.0' type='text/css' media='screen' />
<link rel='stylesheet' id='shutter-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/shutter/shutter-reloaded.css?ver=1.3.4' type='text/css' media='screen' />
<link rel='stylesheet' id='stbCSS-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/wp-special-textboxes/css/wp-special-textboxes.css.php?ver=4.3.72' type='text/css' media='all' />
<link rel='stylesheet' id='grid-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/grid.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='reveal-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/reveal.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='app-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/app.css?ver=3.5.1' type='text/css' media='all' />
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/js/tubepress.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/jquery.cycle.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/search.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/superfish.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/supersubs.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/home.js?ver=3.5.1'></sc
'''

if __name__ == '__main__':

    print("%-20s: %s" % ('Python',           sys.version_info))
    print("%-20s: %s" % ('lxml.etree',       ET.LXML_VERSION))
    print("%-20s: %s" % ('libxml used',      ET.LIBXML_VERSION))
    print("%-20s: %s" % ('libxml compiled',  ET.LIBXML_COMPILED_VERSION))
    print("%-20s: %s" % ('libxslt used',     ET.LIBXSLT_VERSION))
    print("%-20s: %s" % ('libxslt compiled', ET.LIBXSLT_COMPILED_VERSION))

    n = 0
    while True:
        n += 1
        print '-', n
        parser = LH.HTMLParser()
        parser.feed(data)
        root = parser.close()
        m = root.xpath('//meta[@property]')
        for i in m:
            print(n)

収量

% test.py
Python              : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree          : (2, 3, 0, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)
- 1
- 2
- 3
- 4
- 5
...
于 2013-03-12T16:55:39.230 に答える