python - Python HTMLからすべてのリンクを取得し、リンクのみを表示します

Question

次のステートメントを使用して、Web ページのタイトルを取得しようとしています。

titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)

それを使用すると、が得られ['random webpage example1']ます。引用符と括弧を削除するにはどうすればよいですか?

また、これを使用して、1 時間ごとに変更される一連のリンクを取得しようとしています (これがワイルドカードが必要な理由です) links = re.findall(r'(file=(.*?).mp3)',the_webpage)。

私は得る

[('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
 ('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
 ('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521')]

なしでmp3リンクを取得するにはどうすればよいfile=ですか?

また、mp3 ファイルをダウンロードし、それらに Web サイトのタイトルを追加して、表示されるようにしたいと考えています。

random webpage example1.mp3

どうすればいいですか？私はまだPythonと正規表現を学んでいますが、これはちょっと困惑しています。

score 0 · Accepted Answer

少なくともパート 1 では、次のことができます。

>>> mytitle = title1[0]
>>> print mytitle
random webpage example1

正規表現は一致する文字列のリストを返すため、リストの最初の項目を取得するだけで済みます。

同様に、パート 2 では、正規表現は内部にタプルを含むリストを返します。あなたがすることができます：

>>> download_links = [href for (discard, href) in links]
>>> print download_links
['http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521']

ダウンロードファイルについては、使用しますurlib2（少なくともpython 2.xの場合、python 3.xについてはわかりません）。詳細については、この質問を参照してください。

score 0 · Accepted Answer

コード：

#!/usr/bin/env python

import re,urllib,urllib2

Url = "http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000"
print Url
print 'test .............'
req = urllib2.Request(Url)
print "1"
response = urllib2.urlopen(req)
print "2"
the_webpage = response.read()
print "3"
titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)
print "4"
a2 = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',the_webpage)]
print "5"
a2 = [x[0][5:] for x in a2]
print "6"
ti = titl1[0]
print ti
print "7"
print a2
print "8"

print "9"
#print the_page
print "10"

req=urllib2.Request(a2)
print "11"
temp_file=open(ti)
print "12"
buffer=urllib2.urlopen(req).read()
print "13"
temp_file.write(buff)
print "14"
temp_file.close()
print "15"
print "16"

結果

http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000
test .............
1
2
3
4
5
6
Rick Ross - Sixteen (feat. Andre 3000)
7
['', '', '']
8
9
10
Traceback (most recent call last):
  File "grub.py", line 29, in <module>
    req=urllib2.Request(a2)
  File "/usr/lib/python2.7/urllib2.py", line 198, in __init__
    self.__original = unwrap(url)
  File "/usr/lib/python2.7/urllib.py", line 1056, in unwrap
    url = url.strip()
AttributeError: 'list' object has no attribute 'strip'

score 0 · Accepted Answer

最初の部分 titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)はリストを返し、リストを印刷すると括弧と引用符付きで印刷されます。したがってprint title[0]、一致するものが常に 1 つだけであることが確実な場合は、試してみてください。(代わりに re.search を試すこともできます)

2 番目の部分では、re パターンをからに変更する"(file=(.*?)\.mp3)"と、.mp3 拡張子を追加する必要が"file=(.*?)\.mp3"ある部分のみが得られます。'http://linkInThisPart/path/etc/etc'

すなわち

audio_links = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',web_page)]

ファイルをダウンロードするには、urllib、urllib2 を調べます。

import urllib2
url='http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3'
req=urllib2.Request(url)
temp_file=open('random webpage example1.mp3','wb')
buffer=urllib2.urlopen(req).read()
temp_file.write(buff)
temp_file.close()

python - Python HTMLからすべてのリンクを取得し、リンクのみを表示します

4 に答える 4

Related

Reference