python - BeautifulSoupを使用してテーブルから選択した列を抽出する

Question

BeautifulSoupを使用して、このデータテーブルの最初と3番目の列を抽出しようとしています。HTMLを見ると、最初の列に<th>タグが付いています。関心のある他の列には<td>タグがあります。いずれにせよ、私が得ることができたのは、タグが付いた列のリストだけです。しかし、私はただテキストが欲しいだけです。

tableはすでにリストになっているので使用できませんfindAll(text=True)。別の形式で最初の列のリストを取得する方法がわかりません。

from BeautifulSoup import BeautifulSoup
from sys import argv
import re

filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one

print table

score 38 · Accepted Answer

このコードを試すことができます：

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())

for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

ご覧のとおり、コードはURLに接続してhtmlを取得し、BeautifulSoupは最初のテーブルを見つけ、次にすべての「tr」を選択して最初の列（「th」）を選択し、3番目の列を選択します。 'td'。

score 3 · Accepted Answer

@jonhkrの回答に加えて、私が思いついた別の解決策を投稿すると思いました。

 #!/usr/bin/python

 from BeautifulSoup import BeautifulSoup
 from sys import argv

 filename = argv[1]
 #get HTML file as a string
 html_doc = ''.join(open(filename,'r').readlines())
 soup = BeautifulSoup(html_doc)
 table = soup.findAll('table')[0].tbody

 data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
 print data

Webページにダイヤルインするjonhkrの回答とは異なり、私の回答は、コンピュータに保存してコマンドライン引数として渡すことを前提としています。例えば：

python file.py table.html

score 0 · Accepted Answer

このコードも試すことができます

import requests
from bs4 import BeautifulSoup
page =requests.get("http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm")
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print (first_column, third_column)

python - BeautifulSoupを使用してテーブルから選択した列を抽出する

3 に答える 3

Related

Reference