0

だから私はIMDbの賞ページから引っ張っているいくつかの文字列を持っています:

<table><tr><td><big>Academy Awards, USA</big>        </td>      </tr>      <tr>        <th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th>      </tr>            <tr>        <td rowspan="11" align="center" valign="middle">          1978         </td>                          <td rowspan="7" align="center" valign="middle"><b>Won</b></td>                                                      <td rowspan="6" align="center" valign="middle">Oscar</td>                                      <td valign="top">          Best Art Direction-Set Decoration                                John Barry                                              Norman Reynolds                                              Leslie Dilley                                              Roger Christian                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Costume Design                                John Mollo                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Effects, Visual Effects                                John Stears                                              John Dykstra                                              Richard Edlund                                              Grant McCune                                              Robert Blalack                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Film Editing                                Paul Hirsch                                              Marcia Lucas                                              Richard Chew                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Music, Original Score                                John Williams                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Sound                                Don MacDougall                                              Ray West                                              Bob Minkler                                              Derek Ball                                                      <small>                                                                      Derek Ball was not present at the awards ceremony.          </small>        </td>      </tr>                                                      <tr>                                            <td rowspan="1" align="center" valign="middle">Special Achievement Award</td>                                      <td valign="top">                                          Ben Burtt             (as Benjamin Burtt Jr.)                                          <small>                                                                      For sound effects. (For the creation of the alien, creature and robot voices.)          </small>        </td>      </tr>                                                            <tr>                  <td rowspan="4" align="center" valign="middle"><b>Nominated</b></td>                                                      <td rowspan="4" align="center" valign="middle">Oscar</td>                                      <td valign="top">          Best Actor in a Supporting Role                                Alec Guinness                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Director                                George Lucas                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Picture                                Gary Kurtz                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Writing, Screenplay Written Directly for the Screen                                George Lucas                                                      <small>                                                                                </small>        </td>      </tr>                                                  <tr>        </tr></table>

ヘッダー (Year、Result、Award、および Category/Recipient) をリストにプルしてから、各列をそれぞれ独自のリストにプルしたいと考えています。例 (アカデミー賞の表を使用) (参照用の Web サイト: http://www.imdb.com/title/tt0076759/awards ):

Columns = {"Year", "Result", "Award", "Category/Recipient"}
Years = {"1978", "1978", "1978", "1978", "1978", "1978", "1978"}
Results = {"Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Special Achievement Award"}
Categories/Recipients = {"Best Art Direction-Set Decoration (John Barry, Norman Reynolds, Leslie Dilley, Roger Christian)", "Best Costume Design (John Mollo)", "Best Effects, Visual Effects (John Stears, John Dykstra, Richard Edlund, Grant McCune, Robert Blalack)", Best Film Editing (Paul Hirsch, Marcia Lucas, Richard Chew)", "Best Music, Original Score (John Williams)", "Best Sound (Don MacDougall, Ray West, Bob Minkler, Derek Ball)", "(Ben Burtt (as Benjamin Burtt Jr.))"}

ご覧のとおり、表から不要なスペースを削除し、すべての名前を括弧で囲みました。すべての名前の周りにタグがありますが、私はそれらを削除しました (括弧で簡単に入れるのに役立つ場合はそのままにしておいてもかまいません)。また、列リストを除いて、各リストに同じ数の項目があります。

ここに私の現在のスクリプトがあるので、私がすでにそれをどのように操作しているかを知っています:

import shutil
import urllib2
import re
from lxml import etree

award_usock = urllib2.urlopen('http://www.imdb.com/title/tt0076759' + '/awards')
award_html = award_usock.read()
award_usock.close()
if "<big>" in award_html:
    for a_show in re.finditer('<big>',award_html):
        award_show_full_end = award_html.find('<td colspan="4">&nbsp;</td>',a_show.end())
        award_show_full = award_html[a_show.start():award_show_full_end]
        award_show_full = award_show_full.replace('\n','')
        # award_show_full = award_show_full.replace('  ','')
        award_show_full = award_show_full.replace('</a>','')
        award_show_full = award_show_full.replace('<br />','')
        award_show_full = re.sub('<a href="/name/[^>]*>',  '', award_show_full)
        award_show_full = re.sub('<a href="/title/[^>]*>',  '', award_show_full)
        for a_s_title in re.finditer('<a href="',award_show_full):
            award_title_loc = award_show_full.find('<a href="')
            award_title_end = award_show_full.find('">',award_title_loc+10)
            award_title_del = award_show_full[award_title_loc:award_title_end+2]
            award_show_full = award_show_full.replace(award_title_del,'')
        award_show_full = '<table><tr><td>' + award_show_full.replace('<br>','') + '</tr></table>'
        award_show_loc = award_html.find('>',a_show.end())
        award_show_end = award_html.find('</a></big>',a_show.end())
        award_show = award_html[award_show_loc+1:award_show_end]
        award_show_table = etree.XML(award_show_full)
        award_show_rows = iter(award_show_table)
        award_show_headers = [award_show_col.text for award_show_col in next(award_show_rows)]
        for award_show_row in award_show_rows:
            award_show_values = [award_show_col.text for award_show_col in award_show_row]
            print dict(zip(award_show_headers,award_show_values))

しかし、これは結果を生成します:

{None: 'Year'}
{None: '          1978         '}
{None: '          Best Costume Design                                John Mollo                                                      '}
{None: '          Best Effects, Visual Effects                                John Stears                                              John Dykstra                                              Richard Edlund                                              Grant McCune                                              Robert Blalack                                                      '}
{None: '          Best Film Editing                                Paul Hirsch                                              Marcia Lucas                                              Richard Chew                                                      '}
{None: '          Best Music, Original Score                                John Williams                                                      '}
{None: '          Best Sound                                Don MacDougall                                              Ray West                                              Bob Minkler                                              Derek Ball                                                      '}
{None: 'Special Achievement Award'}
{None: None}
{None: '          Best Director                                George Lucas                                                      '}
{None: '          Best Picture                                Gary Kurtz                                                      '}
{None: '          Best Writing, Screenplay Written Directly for the Screen                                George Lucas                                                      '}
{}
4

2 に答える 2

3

It's not a good idea to parse HTML using regular expressions, better try using a parser, like Beautiful Soup.

于 2013-08-02T01:49:07.313 に答える