だから私はIMDbの賞ページから引っ張っているいくつかの文字列を持っています:
<table><tr><td><big>Academy Awards, USA</big> </td> </tr> <tr> <th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th> </tr> <tr> <td rowspan="11" align="center" valign="middle"> 1978 </td> <td rowspan="7" align="center" valign="middle"><b>Won</b></td> <td rowspan="6" align="center" valign="middle">Oscar</td> <td valign="top"> Best Art Direction-Set Decoration John Barry Norman Reynolds Leslie Dilley Roger Christian <small> </small> </td> </tr> <tr> <td valign="top"> Best Costume Design John Mollo <small> </small> </td> </tr> <tr> <td valign="top"> Best Effects, Visual Effects John Stears John Dykstra Richard Edlund Grant McCune Robert Blalack <small> </small> </td> </tr> <tr> <td valign="top"> Best Film Editing Paul Hirsch Marcia Lucas Richard Chew <small> </small> </td> </tr> <tr> <td valign="top"> Best Music, Original Score John Williams <small> </small> </td> </tr> <tr> <td valign="top"> Best Sound Don MacDougall Ray West Bob Minkler Derek Ball <small> Derek Ball was not present at the awards ceremony. </small> </td> </tr> <tr> <td rowspan="1" align="center" valign="middle">Special Achievement Award</td> <td valign="top"> Ben Burtt (as Benjamin Burtt Jr.) <small> For sound effects. (For the creation of the alien, creature and robot voices.) </small> </td> </tr> <tr> <td rowspan="4" align="center" valign="middle"><b>Nominated</b></td> <td rowspan="4" align="center" valign="middle">Oscar</td> <td valign="top"> Best Actor in a Supporting Role Alec Guinness <small> </small> </td> </tr> <tr> <td valign="top"> Best Director George Lucas <small> </small> </td> </tr> <tr> <td valign="top"> Best Picture Gary Kurtz <small> </small> </td> </tr> <tr> <td valign="top"> Best Writing, Screenplay Written Directly for the Screen George Lucas <small> </small> </td> </tr> <tr> </tr></table>
ヘッダー (Year、Result、Award、および Category/Recipient) をリストにプルしてから、各列をそれぞれ独自のリストにプルしたいと考えています。例 (アカデミー賞の表を使用) (参照用の Web サイト: http://www.imdb.com/title/tt0076759/awards ):
Columns = {"Year", "Result", "Award", "Category/Recipient"}
Years = {"1978", "1978", "1978", "1978", "1978", "1978", "1978"}
Results = {"Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Special Achievement Award"}
Categories/Recipients = {"Best Art Direction-Set Decoration (John Barry, Norman Reynolds, Leslie Dilley, Roger Christian)", "Best Costume Design (John Mollo)", "Best Effects, Visual Effects (John Stears, John Dykstra, Richard Edlund, Grant McCune, Robert Blalack)", Best Film Editing (Paul Hirsch, Marcia Lucas, Richard Chew)", "Best Music, Original Score (John Williams)", "Best Sound (Don MacDougall, Ray West, Bob Minkler, Derek Ball)", "(Ben Burtt (as Benjamin Burtt Jr.))"}
ご覧のとおり、表から不要なスペースを削除し、すべての名前を括弧で囲みました。すべての名前の周りにタグがありますが、私はそれらを削除しました (括弧で簡単に入れるのに役立つ場合はそのままにしておいてもかまいません)。また、列リストを除いて、各リストに同じ数の項目があります。
ここに私の現在のスクリプトがあるので、私がすでにそれをどのように操作しているかを知っています:
import shutil
import urllib2
import re
from lxml import etree
award_usock = urllib2.urlopen('http://www.imdb.com/title/tt0076759' + '/awards')
award_html = award_usock.read()
award_usock.close()
if "<big>" in award_html:
for a_show in re.finditer('<big>',award_html):
award_show_full_end = award_html.find('<td colspan="4"> </td>',a_show.end())
award_show_full = award_html[a_show.start():award_show_full_end]
award_show_full = award_show_full.replace('\n','')
# award_show_full = award_show_full.replace(' ','')
award_show_full = award_show_full.replace('</a>','')
award_show_full = award_show_full.replace('<br />','')
award_show_full = re.sub('<a href="/name/[^>]*>', '', award_show_full)
award_show_full = re.sub('<a href="/title/[^>]*>', '', award_show_full)
for a_s_title in re.finditer('<a href="',award_show_full):
award_title_loc = award_show_full.find('<a href="')
award_title_end = award_show_full.find('">',award_title_loc+10)
award_title_del = award_show_full[award_title_loc:award_title_end+2]
award_show_full = award_show_full.replace(award_title_del,'')
award_show_full = '<table><tr><td>' + award_show_full.replace('<br>','') + '</tr></table>'
award_show_loc = award_html.find('>',a_show.end())
award_show_end = award_html.find('</a></big>',a_show.end())
award_show = award_html[award_show_loc+1:award_show_end]
award_show_table = etree.XML(award_show_full)
award_show_rows = iter(award_show_table)
award_show_headers = [award_show_col.text for award_show_col in next(award_show_rows)]
for award_show_row in award_show_rows:
award_show_values = [award_show_col.text for award_show_col in award_show_row]
print dict(zip(award_show_headers,award_show_values))
しかし、これは結果を生成します:
{None: 'Year'}
{None: ' 1978 '}
{None: ' Best Costume Design John Mollo '}
{None: ' Best Effects, Visual Effects John Stears John Dykstra Richard Edlund Grant McCune Robert Blalack '}
{None: ' Best Film Editing Paul Hirsch Marcia Lucas Richard Chew '}
{None: ' Best Music, Original Score John Williams '}
{None: ' Best Sound Don MacDougall Ray West Bob Minkler Derek Ball '}
{None: 'Special Achievement Award'}
{None: None}
{None: ' Best Director George Lucas '}
{None: ' Best Picture Gary Kurtz '}
{None: ' Best Writing, Screenplay Written Directly for the Screen George Lucas '}
{}