サイトから必要な情報を取得するために Goutte を使い始めましたが、これは素晴らしく、時間と手間を大幅に節約してくれます。しかし一方で、異常が発生することもあり、何が原因なのかわかりません。だから、私が今スクレイピングしているページは次のとおりです 。 C0305
私はこのサイトを調べ、リンクをたどりました。サイトのすべてのページには、独自の種類のテーブルがあります。この表はかなり単純に見えますが、一部の HTML は次のとおりです。
<table align="center" border="1" cellpadding="4" cellspacing="1" width="927">
<tbody><tr bgcolor="#3399FF">
<th align="center">ENTRY</th>
<th align="center">COMPOUND</th>
<th align="center">HERB</th>
<th align="center">TARGET ID</th>
<th align="center">TARGET NAME</th>
<th align="center">TARGET TYPE</th>
</tr>
</tbody><tbody id="table2">
<tr>
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0595">HIT001882</a></td>
<td align="center">linalool</td>
<td align="center">Ardisiae japonicae(ai di cha|矮地茶); Fructus Artemiiae argyi(ai shi|艾实); Folium Artemisiae Argyi(ai ye|艾叶); (bai cong|白葱); Dracoccephalum heterophyllum Benth(bai hua xia ku cao|白花夏枯草); Flos Micheliae albae(bai lan hua|白兰花); bai siu zi; Caulis Perillae frutescentis(bai su geng|白苏梗); Folium Perillae frutescentis(bai su ye|白苏叶); Magnolia denudata(bai yu lan|白玉兰); Fructus Litseae(bi chen jia|毕澄茄); Radix Bupleuri chinensis(chai hu|柴胡); Pericarpium Citri Reticulatae(chen pi|陈皮); Radix albiflorae(chou jie cao gen|臭节草根); Ligusticum brachylobum Franch(chuan fang feng|川防风); Radix chuanxiong; Rhizoma Chuanxiong(chuan xiong|川芎); chun sha hua; Abies nephrolepis(cou leng shan|臭冷杉); Basho(da ba jiao|大芭蕉); Herba Elsholtziae penduliflorae(da hei tou cao|大黑头草); Mosla dianthera (Ham.) Maxim(da ye xiang ru|大叶香薷); (di feng|地枫)</td>
<td align="center"><a href="Protein_ID.jsp?pid=T0595&protein=Katanin p60 ATPase-containing subunit A1">T0595</a></td>
<td align="center">Katanin p60 ATPase-containing subunit A1</td>
<td align="center">Direct Target</td>
</tr>
<tr>
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0596">HIT001883</a></td>
<td align="center">linalool</td>
<td align="center">Ardisiae japonicae(ai di cha|矮地茶); Fructus Artemiiae argyi(ai shi|艾实); Folium Artemisiae Argyi(ai ye|艾叶); (bai cong|白葱); Dracoccephalum heterophyllum Benth(bai hua xia ku cao|白花夏枯草); Flos Micheliae albae(bai lan hua|白兰花); bai siu zi; Caulis Perillae frutescentis(bai su geng|白苏梗); Folium Perillae frutescentis(bai su ye|白苏叶); Magnolia denudata(bai yu lan|白玉兰); Fructus Litseae(bi chen jia|毕澄茄); Radix Bupleuri chinensis(chai hu|柴胡); Pericarpium Citri Reticulatae(chen pi|陈皮); Radix albiflorae(chou jie cao gen|臭节草根); Ligusticum brachylobum Franch(chuan fang feng|川防风); Radix chuanxiong; Rhizoma Chuanxiong(chuan xiong|川芎); chun sha hua; Abies nephrolepis(cou leng shan|臭冷杉); Basho(da ba jiao|大芭蕉); Herba Elsholtziae penduliflorae(da hei tou cao|大黑头草); Mosla dianthera (Ham.) Maxim(da ye xiang ru|大叶香薷); (di feng|地枫)</td>
<td align="center"><a href="Protein_ID.jsp?pid=T0596&protein=Adenosine receptor A2a">T0596</a></td>
<td align="center">Adenosine receptor A2a</td>
<td align="center">Direct Target</td>
</tr>
<tr>
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0040">HIT001885</a></td>
<td align="center">linalool</td>
<td align="center">Ardisiae japonicae(ai di cha|矮地茶); Fructus Artemiiae argyi(ai shi|艾实); Folium Artemisiae Argyi(ai ye|艾叶); (bai cong|白葱); Dracoccephalum heterophyllum Benth(bai hua xia ku cao|白花夏枯草); Flos Micheliae albae(bai lan hua|白兰花); bai siu zi; Caulis Perillae frutescentis(bai su geng|白苏梗); Folium Perillae frutescentis(bai su ye|白苏叶); Magnolia denudata(bai yu lan|白玉兰); Fructus Litseae(bi chen jia|毕澄茄); Radix Bupleuri chinensis(chai hu|柴胡); Pericarpium Citri Reticulatae(chen pi|陈皮); Radix albiflorae(chou jie cao gen|臭节草根); Ligusticum brachylobum Franch(chuan fang feng|川防风); Radix chuanxiong; Rhizoma Chuanxiong(chuan xiong|川芎); chun sha hua; Abies nephrolepis(cou leng shan|臭冷杉); Basho(da ba jiao|大芭蕉); Herba Elsholtziae penduliflorae(da hei tou cao|大黑头草); Mosla dianthera (Ham.) Maxim(da ye xiang ru|大叶香薷); (di feng|地枫)</td>
<td align="center"><a href="Protein_ID.jsp?pid=T0040&protein=Nitric oxide synthase, inducible">T0040</a></td>
<td align="center">Nitric oxide synthase, inducible</td>
<td align="center">Direct Target</td>
</tr>
<tr>
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0054">HIT001884</a></td>
<td align="center">linalool</td>
<td align="center">Ardisiae japonicae(ai di cha|矮地茶); Fructus Artemiiae argyi(ai shi|艾实); Folium Artemisiae Argyi(ai ye|艾叶); (bai cong|白葱); Dracoccephalum heterophyllum Benth(bai hua xia ku cao|白花夏枯草); Flos Micheliae albae(bai lan hua|白兰花); bai siu zi; Caulis Perillae frutescentis(bai su geng|白苏梗); Folium Perillae frutescentis(bai su ye|白苏叶); Magnolia denudata(bai yu lan|白玉兰); Fructus Litseae(bi chen jia|毕澄茄); Radix Bupleuri chinensis(chai hu|柴胡); Pericarpium Citri Reticulatae(chen pi|陈皮); Radix albiflorae(chou jie cao gen|臭节草根); Ligusticum brachylobum Franch(chuan fang feng|川防风); Radix chuanxiong; Rhizoma Chuanxiong(chuan xiong|川芎); chun sha hua; Abies nephrolepis(cou leng shan|臭冷杉); Basho(da ba jiao|大芭蕉); Herba Elsholtziae penduliflorae(da hei tou cao|大黑头草); Mosla dianthera (Ham.) Maxim(da ye xiang ru|大叶香薷); (di feng|地枫)</td>
<td align="center"><a href="Protein_ID.jsp?pid=T0054&protein=Prostaglandin G/H synthase 2">T0054</a></td>
<td align="center">Prostaglandin G/H synthase 2</td>
<td align="center">Indirect Target</td>
</tr>
</tbody>
</table>
だから私が最初にしたことは<tbody>
、必要なすべてのデータがそこにあるので、を選択することです:
$tbody = $compoundPage->filter('tbody#table2');
私が実行すると、すべてがうまく見えます:
> exit(dump($compoundPage->filter('tbody#table2')->html()));
私が得るものは次のとおりです。
<tr>\n
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0595">HIT001882</a></td>\r\n
<td align="center">linalool</td>\r\n
<td align="center">Ardisiae japonicae(ai di cha|矮地茶); Fructus Artemiiae argyi(ai shi|艾实); Folium Artemisiae Argyi(ai ye|艾叶); (bai cong|白葱); Dracoccephalum heterophyllum Benth(bai hua xia ku cao|白花夏枯草); Flos Micheliae albae(bai lan hua|白兰花); bai siu zi; Caulis Perillae frutescentis(bai su geng|白苏梗); Folium Perillae frutescentis(bai su ye|白苏叶); Magnolia denudata(bai yu lan|白玉兰); Fructus Litseae(bi chen jia|毕澄茄); Radix Bupleuri chinensis(chai hu|柴胡); Pericarpium Citri Reticulatae(chen pi|陈皮); Radix albiflorae(chou jie cao gen|臭节草根); Ligusticum brachylobum Franch(chuan fang feng|川防风); Radix chuanxiong; Rhizoma Chuanxiong(chuan xiong|川芎); chun sha hua; Abies nephrolepis(cou leng shan|臭冷杉); Basho(da ba jiao|大芭蕉); Herba Elsholtziae penduliflorae(da hei tou cao|大黑头草); Mosla dianthera (Ham.) Maxim(da ye xiang ru|大叶香薷); (di feng|地枫)</td>\r\n
<td align="center"><a href="Protein_ID.jsp?pid=T0595&protein=Katanin%20p60%20ATPase-containing%20subunit%20A1">T0595</a></td>\r\n
<td align="center">Katanin p60 ATPase-containing subunit A1</td>\r\n
<td align="center">Direct Target</td>\r\n
</tr>\n
<tr>\n
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0596">HIT001883</a></td>\r\n
<td align="center">linalool</td>\r\n
<td align="center">Ardisiae japonicae(ai di cha|矮地茶); Fructus Artemiiae Maxim(da ye xiang ru|大叶香薷); (di feng|地枫)</td>\r\n
<td align="center"><a href="Protein_ID.jsp?pid=T0596&protein=Adenosine%20receptor%20A2a">T0596</a></td>\r\n
<td align="center">Adenosine receptor A2a</td>\r\n
<td align="center">Direct Target</td>\r\n
</tr>\n
<tr>\n
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0040">HIT001885</a></td>\r\n
<td align="center">linalool</td>\r\n
<td align="center">Ardisiae japonicae(ai di cha|矮地茶); Fructus Artemiiae argyi(ai shi|艾实); Folium Artemisiae Argyi(ai ye|艾叶); (bai cong|白葱); Dracoccephalum heterophyllum Radix Bupleuri chinensis(</td>\r\n
<td align="center"><a href="Protein_ID.jsp?pid=T0040&protein=Nitric%20oxide%20synthase,%20inducible">T0040</a></td>\r\n
<td align="center">Nitric oxide synthase, inducible</td>\r\n
<td align="center">Direct Target</td>\r\n
</tr>\n
<tr>\n
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0054">HIT001884</a></td>\r\n
<td align="center">linalool</td>\r\n
<td align="center">Ardisiae japonicae</td>\r\n
<td align="center"><a href="Protein_ID.jsp?pid=T0054&protein=Prostaglandin%20G/H%20synthase%202">T0054</a></td>\r\n
<td align="center">Prostaglandin G/H synthase 2</td>\r\n
<td align="center">Indirect Target</td>\r\n
</tr>
4行見えます。
しかし、今度は行を反復処理する必要があるため、さらに一歩進める必要があります。
$tbody = $compoundPage->filter('tbody#table2 > tr');
私が実行すると:exit(dump($compoundPage->filter('tbody#table2 > tr')->html()));
1 行のみを出力します。
<td align="center"><a href="detail.jsp?compoundid=C0305&pid=T0595">HIT001882</a></td>\r\n
<td align="center">linalool</td>\r\n
<td align="center">Ardisiae japonicae(ai di cha|矮地茶);</td>\r\n
<td align="center"><a href="Protein_ID.jsp?pid=T0595&protein=Katanin%20p60%20ATPase-containing%20subunit%20A1">T0595</a></td>\r\n
<td align="center">Katanin p60 ATPase-containing subunit A1</td>\r\n
<td align="center">Direct Target</td>\r\n
しかし、次のように行を数えると:
exit(dump($compoundPage->filter('tbody#table2 > tr')->count()));
4
列があると言います。
次のように、行を反復処理すると問題が発生します。
$data = $rows->each(function($row,$i) use ($client) {
$tds = $row->filter('td');
$keys = array('entry','compound','herb','target_id','target_name','target_type');
foreach($keys as $td_i => $key) {
$data[$key] = $tds->eq($i)->text();
}
});
まず、td 要素を数えると、1 つしかないと表示されます。実際に td ノードで何かをしようとすると、次のエラーが発生します 。
私がこれを行う場合:
exit(dump($row->filter('td')));
ループ内では、24 個のノードがあることがわかります。
Crawler {#1121 ▼
#uri: "http://lifecenter.sgst.cn/hit/search.jsp?key1='a'&key2='b'&key3='c'&key4='d'&key5=C0305"
-defaultNamespacePrefix: "default"
-namespaces: []
-baseHref: "http://lifecenter.sgst.cn/hit/search.jsp?key1='a'&key2='b'&key3='c'&key4='d'&key5=C0305"
-document: DOMDocument {#52 ▶}
-nodes: array:24 [▼
0 => DOMElement {#1123 ▶}
1 => DOMElement {#1124 ▶}
2 => DOMElement {#1125 ▶}
3 => DOMElement {#1126 ▶}
4 => DOMElement {#1127 ▶}
5 => DOMElement {#1128 ▶}
6 => DOMElement {#1129 ▶}
7 => DOMElement {#1130 ▶}
8 => DOMElement {#1131 ▶}
9 => DOMElement {#1132 ▶}
... (omitted the rest cuz I have to do this indentation manually, its so time consuming.
]
-isHtml: true