python - Python ブログ RSS フィードで BeautifulSoup の出力を .txt ファイルにスクレイピングする

Question

以下のコードの長いブロックについて事前にお詫び申し上げます。私は BeautifulSoup を初めて使用しましたが、これを使用してブログの RSS フィードをスクレイピングするための便利なチュートリアルがいくつかあることを発見しました。完全な開示: これは、このビデオチュートリアルから改作されたコードであり、これを開始するのに非常に役立ちました: http://www.youtube.com/watch?v=Ap_DlSrT-iE .

これが私の問題です。ビデオは、関連するコンテンツをコンソールに出力する方法を示す素晴らしい仕事をしています。各記事のテキストを個別の .txt ファイルに書き出し、それをディレクトリに保存する必要があります (今はデスクトップに保存しようとしています)。問題は、コードの末尾近くにある 2 つの for ループの範囲にあることはわかっています (人々がすぐにわかるように、これをコメントしようとしました。最後のコメントの始まりです。）、しかし、私は自分でそれを理解できないようです。

現在、プログラムが行っていることは、プログラムによって読み取られた最後の記事からテキストを取得し、それを変数で示される数の .txt ファイルに書き出すことですlistIterator。したがって、この場合、書き出される .txt ファイルが 20 個あると思いますが、それらはすべてループオーバーされた最後の記事のテキストを含んでいます。プログラムに実行させたいのは、各記事をループして、各記事のテキストを個別の .txt ファイルに出力することです。冗長で申し訳ありませんが、洞察をいただければ幸いです。

from urllib import urlopen
from bs4 import BeautifulSoup
import re

# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

# On RSS Feed site, find tags for title of articles and 
# tags for article links to be downloaded.

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)

# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))

for i in listIterator:
    # Print each title to console to ensure program is working. 
    print findPatTitle[i]

    # Read in the linked-to article.
    articlePage = urlopen(findPatLink[i]).read()

    # Find the beginning and end of articles using tags listed below.
    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    # Define article variable that will contain all the content between the 
    # beginning of the article to the end as indicated by variables above.
    article = articlePage[divBegin:divEnd]

    # Parse the page using BeautifulSoup
    soup = BeautifulSoup(article)

    # Compile list of all <p> tags for each article and store in paragList
    paragList = soup.findAll('p')

    # Create empty string to eventually convert items in paragList to string to 
    # be written to .txt files.
    para_string = ''

    # Here's where I'm lost and have some sort of scope issue with my for-loops.
    for i in paragList:
        para_string = para_string + str(i)
        newlist = range(len(findPatTitle))
        for i in newlist:
            ofile = open(str(listIterator[i])+'.txt', 'w')
            ofile.write(para_string)
            ofile.close()

score 3 · Accepted Answer

最後の記事しか書き留めていないように見えるのは、すべての記事が 20 の別々のファイルに何度も何度も書き込まれているためです。以下を見てみましょう。

for i in paragList:
    para_string = para_string + str(i)
    newlist = range(len(findPatTitle))
    for i in newlist:
        ofile = open(str(listIterator[i])+'.txt', 'w')
        ofile.write(para_string)
        ofile.close()

反復ごとparag_stringに同じ 20 個のファイルに何度も書き込みを行っています。あなたがする必要があるのはこれです、すべてのを別のリストに追加し、たとえばを追加し、そのすべての内容を別のファイルに書き込みます。次のようにします。parag_stringparaStringList

for i, var in enumerate(paraStringList):  # Enumerate creates a tuple
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

これはメインループの外にある必要があるためfor i in listIterator:(...)です。これは、プログラムの作業バージョンです。

from urllib import urlopen
from bs4 import BeautifulSoup
import re


webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]

listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []

for i in listIterator:

    print findPatTitle[i]

    articlePage = urlopen(findPatLink[i]).read()

    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    article = articlePage[divBegin:divEnd]

    soup = BeautifulSoup(article)

    paragList = soup.findAll('p')

    para_string = ''

    for i in paragList:
        para_string += str(i)

    paraStringList.append(para_string)

for i, var in enumerate(paraStringList):
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

python - Python ブログ RSS フィードで BeautifulSoup の出力を .txt ファイルにスクレイピングする

1 に答える 1

Related

Reference