python - Python 初心者: 1 つのファイルの要素を読み取り、それらを使用して別のファイルを変更します

Question

私はプログラミング経験のない経済学者です。Python は Web サイトからのデータを解析するのに非常に強力であると言われたので、Python の使用方法を学ぼうとしています。現時点では、次のコードで立ち往生しています。何か提案があれば、非常に感謝しています。

まず、このテーブルのデータを解析するコードを書きました。

http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146

私が書いたコードは次のとおりです。

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os

def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
        col = row.findAll('td')
        year = col[0].div.b.font.string
        detrazione = col[1].div.b.font.string
        ordinaria = col[2].div.b.font.string
        principale = col[3].div.b.font.string
        scopo = col[4].div.b.font.string
        record = (year, detrazione, ordinaria, principale, scopo)
        print >> outfile, "|".join(record)



outfile = open("milano.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

コードはテーブルを読み取り、必要な情報のみを取得して txt ファイルを作成します。コードはかなり初歩的ですが、これで仕事は完了です。

私の問題は今始まります。上に投稿した URL は、データを解析する必要がある約 200 の URL の 1 つにすぎません。すべての URL は、2 つの要素のみで区別されます。以前の URL を使用:

http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146

このページを一意に識別する 2 つの要素は、MILANO (都市の名前) と 15146 (官僚コード) です。

私がやりたかったことは、まず、2 つの列を持つファイルを作成することでした。

最初に必要な都市の名前。
第二に、官僚的なコード。

次に、このファイルの各行を読み取り、コード内の URL を正しく変更し、都市ごとに個別に解析タスクを実行するループを Python で作成したいと考えました。

進め方について何か提案はありますか？助けと提案を前もってありがとう！

[アップデート]

役に立つ提案をしてくれたすべての人に感謝します。私のPythonの知識では、Thomas Kの答えが最も簡単に実装できることがわかりました。しかし、私にはまだ問題があります。次のようにコードを修正しました。

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
import csv

def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
        col = row.findAll('td')
        year = col[0].div.b.font.string
        detrazione = col[1].div.b.font.string
        ordinaria = col[2].div.b.font.string
        principale = col[3].div.b.font.string
        scopo = col[4].div.b.font.string
        record = (year, detrazione, ordinaria, principale, scopo)
        print >> outfile, "|".join(record)

citylist = csv.reader(open("citycodes.csv", "rU"), dialect = csv.excel)
for city in citylist:
outfile = open("%s.txt", "w") % city
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

ここで、citycodes.csv は次の形式です

MILANO;12345
MODENA;67891

次のエラーが表示されます。

Traceback (most recent call last):
File "modena2.py", line 25, in <module>
 outfile = open("%s.txt", "w") % city
TypeError: unsupported operand type(s) for %: 'file' and 'list'

再度、感謝します！

score 1 · Accepted Answer

あなたが修正する必要がある小さなこと：

これ：

for city in citylist:
    outfile = open("%s.txt", "w") % city
#                                 ^^^^^^

これでなければなりません：

for city in citylist:
    outfile = open("%s.txt" % city, "w")
#                           ^^^^^^

score 0 · Accepted Answer

速くて汚い：

import csv
citylist = csv.reader(open("citylist.csv"))
for city in citylist:
    url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
    # open the page and extract the information

次のようなcsvファイルがあると仮定します。

MILANO,15146
ROMA,12345

urllib.urlencode()Ignacioが述べたように、より強力なツールがあります。しかし、彼らはおそらくこれにはやり過ぎです。

PSおめでとうございます：あなたは難しいことをしました-HTMLからデータをスクレイピングします。リストをループするのは簡単です。

score 0 · Accepted Answer

基本をスクラッチするだけです...

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os

outfile = open("milano.txt", "w")

def extract(soup):
    global outfile
    table = soup.find("table", cellspacing=2)
    for row in table.findAll('tr')[2:]:
            col = row.findAll('td')
            year = col[0].div.b.font.string
            detrazione = col[1].div.b.font.string
            ordinaria = col[2].div.b.font.string
            principale = col[3].div.b.font.string
            scopo = col[4].div.b.font.string
            record = (year, detrazione, ordinaria, principale, scopo)
            print >> outfile, "|".join(record)



br = Browser()
br.set_handle_robots(False)

# fill in your cities here anyway like
ListOfCityCodePairs = [('MILANO', 15146)]

for (city, code) in ListOfCityCodePairs:
    url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=d" % (city, code)
    page1 = br.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1)

outfile.close()

score 0 · Accepted Answer

ファイルが CSV 形式の場合は、を使用csvして読み取ることができます。次に、を使用urllib.urlencode()してクエリ文字列urlparse.urlunparse()を生成し、完全な URL を生成します。

score 0 · Accepted Answer

別のファイルを作成する必要はありません。代わりに、city->code という関係がある Python 辞書を使用してください。

参照: http://docs.python.org/tutorial/datastructures.html#dictionaries

python - Python 初心者: 1 つのファイルの要素を読み取り、それらを使用して別のファイルを変更します

5 に答える 5

Related

Reference