私はetree.HTML( data )
さまざまなコンテンツのために以下のように実行していますdata
。data
ただし、特定のコンテンツでは、lxml.etree.HTML
それを解析せず、無限ループに入り、100% の CPU を消費します。
data
これを引き起こしている可能性のある以下の内容を正確に知っている人はいますか? さらに重要なことに、無数のランダムで壊れたものでこれが発生しないようにするにはどうすればよいdata
ですか?
編集:これはlxmlバージョン2.7.8以下(少なくとも)のバグであることが判明しました。lxml 2.9.0 に更新され、バグがなくなりました。
編集:これが無限ループを構成することは知っていますが、それは私が得ている悪い動作ではありません。正常なコンテンツで (無限ループとして) 正常に実行されdata
ます。以下のような不健全なdata
コンテンツでは、ループが停止し、RAM がいっぱいになり始め、いっぱいになると、すべての CPU が WAIT 状態になります。元のデバッグについては、この質問を参照してください。
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
import sys
from lxml import etree
data = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<meta charset="UTF-8">
<title>The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked -- Grub Street New York</title>
<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://feedproxy.google.com/nymag/grubstreet" />
<meta name="Headline" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
<meta name="keywords" content="april bloomfield, el gordo, frank bruni, gordon ramsay, lawsuits, lists, marcus samuelsson, mario batali, shitlist, spotted pig, sued" />
<meta name="description" content="Racism, fat-shaming, and vegetarian trickery." />
<meta name="Byline" content="Sierra Tishgart" />
<meta name="Type_of_Feature" content="" />
<meta name="Issue_Date" content="March 8, 2013 12:50 PM" />
<meta name="related_stories" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
<meta name="document_type" content="Blog" />
<meta name="category" content="Lists" />
<link rel="image_src" href="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg" />
<link rel="canonical" href="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" id="canonical" />
<script>
var canonicalUrl = "http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html";
</script>
<meta name="content.tags.primary" content=";network - Grub Street,;city - New York City,;tag - lists" />
<meta name="content.tags" content=";tag - april bloomfield,;tag - el gordo,;tag - frank bruni,;tag - gordon ramsay,;tag - lawsuits,;tag - marcus samuelsson,;tag - mario batali,;tag - shitlist,;tag - spotted pig,;tag - sued" />
<meta name="content.hierarchy" content="New York City:Grub Street" />
<meta name="content.type" content="Blog" />
<meta name="content.subtype" content="Blog Entry" />
<meta property="fb:app_id" content="206283005644" />
<meta property="og:title" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
<meta property="og:description" content="Racism, fat-shaming, and vegetarian trickery." />
<meta property="og:image" content="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg"/>
<meta property="og:url" content="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" />
<meta property="og:type" content="article" />
<meta property="og:site_name" content="Grub Street New York" />
<meta name="viewport" content="width=1020">
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/grubstreet-core.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/section/daily/slideshow.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/echo.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/loginRegister.css" media="all" />
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/advertising.css" media="all" />
<link rel="shortcut icon" href="http://images.nymag.com/gfx/grubst/favicon.ico" />
<style type="text/css">
#adsplashtop,#pushdown {padding:5px 5px;}
#pushdown {border-top:1px solid #737373}
</style>
<!--[if IE 6]>
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie6.css" type="text/css" media="screen, projection" />
<![endif]-->
<!--[if IE 7]>
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie7.css" type="text/css" media="screen, projection" />
<![endif]-->
<script type="text/javascript">
var NYM = {};
NYM.config = {};
NYM.config.membership = {
"service":"nym"
};
NYM.config.advertising = {
"sitename":"nym.grubstreet"
};
</script>
<script type="text/javascript">
var date = 'March 12, 2013 12:42:38';
var currDate=new Date(date);
var GRUBST = {};
if (!NYM) {
var NYM = {};
NYM.config = {};
NYM.config.membership = {
"service":"nym"
};
NYM.config.advertising = {
"sitename":"nym.grubstreet"
};
}
</script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/modernizr-1.7.min.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/jquery-ui-1.8.2.custom.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/ad_manager.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/js/2/global.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/skinTakeover.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/grubstreet-controls.js"></scr
'''
n = 0
while True:
n += 1
tree = etree.HTML( data )
m = tree.xpath("//meta[@property]")
print '-', n
for i in m:
print n
#print (i.attrib['property'], i.attrib['content'])
クイック バージョンの場合は、次を使用できます。
import sys
from lxml import etree
print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
私が持っている:
OS : Ubuntu 12.10 (AWS)
Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 1, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)