php - imdb ページを解析して名前を取得する正規表現

Question

私は正規表現が苦手で、できる限りあらゆる場所を探しました。このページ ( http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3 )を解析して、映画の名前を取得するのに役立ちます。PS: ダミーの正規表現も使用できます。

score 3 · Accepted Answer

簡潔な答え

これは前の質問とほぼ同じ問題であり、答えは同じです...正規表現が変更されていますが。

#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s

https://stackoverflow.com/a/19600974/2573622

拡大された答え

正規表現について

詳細については、次のリンクを参照してください。

http://www.regular-expressions.info/

上部のメニューバーにある[チュートリアル] をクリックすると、ほぼすべての正規表現に関する説明が表示されます。

正規表現の作成

まず、ページから関連する html (1 つの映画の) を取得する必要があります...

<td class="number">RANK.</td>
  <td class="image">
    <a href="/title/tt000000/" title="FILM TITLE (YEAR)"><img src="http://imdb.com/path-to-image.jpg" height="74" width="54" alt="FILM TITLE (YEAR)" title="FILM TITLE (YEAR)"></a>
  </td>
  <td class="title">
    

<span class="wlb_wrapper" data-tconst="tt000000" data-size="small" data-caller-name="search"></span>

    <a href="/title/tt000000/">FILM TITLE</a>

次に、ノイズ/変更可能な情報を取り除きます...

<td class="number">RANK.</td>.*?<a href="/title/tt\d+/">FILM TITLE</a>

次に、キャプチャグループを追加します...

<td class="number">(RANK).</td>.*?<a href="/title/tt\d+/">(FILM TITLE)</a>

以上です：

 #<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s

終了パターン区切り文字の後のs修飾子により、正規表現エンジンは.新しい行にも一致させます

コード付き

前の回答と同じ (修正された正規表現を使用)

$page = file_get_contents('http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3');

preg_match_all('#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s', $page, $matches);


$filmList = array_combine($matches[1], $matches[2]);

次に、次のことができます。

echo $filmList[1];

/**
Output:

Argo

*/

echo array_search("The Artist", $filmList);

/**
Output:

2

*/

http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://php.net/file_get_contents
http://php.net/preg_match_all
http://php.net/array_combine
http: //php.net/array_search

score 0 · Accepted Answer

必要なバックスラッシュと不要なバックスラッシュがわからない場合:

href=\"\/title\/tt.*height=\"74\" width=\"54\" alt=\"([^"]*)\"

有用な結果は\1または$1

php - imdb ページを解析して名前を取得する正規表現

2 に答える 2

簡潔な答え

拡大された答え

正規表現について

正規表現の作成

コード付き

Related

Reference