html - HTML 文字列からすべての間隔を削除する

翻译自：https://stackoverflow.com/questions/42545589 2017-03-02T02:09:33.493

52 次

すべての空白とスペース文字を削除してから、ページで発生する上位 3 文字の英数字をカウントするコードを実装しようとしています。私の質問は 2 つあります。

1)分割に使用している方法が機能していないようで、なぜ機能しないのかわかりません。私の知る限りでは、結合してから分割すると、html ソースコードからすべての空白とスペースが削除されるはずですが、そうではありません (以下の Amazon の例から最初に返された値を参照)。

2) 私は most_common 操作にあまり詳しくありません。" http://amazon.com " でコードをテストしたところ、次の出力が得られました。

The top 3 occuring alphanumeric characters in the html of http://amazon.com 
:  [(u' ', 258), (u'a', 126), (u'e', 126)]

返された most_common(3) 値の u は何を意味しますか?

私の現在のコード:

import requests
import collections


url = raw_input("please eneter the url of the desired website (include http://): ")

response = requests.get(url)
responseString = response.text

print responseString

topThreeAlphaString = " ".join(filter(None, responseString.split()))

lineNumber = 0

for line in topThreeAlphaString:
    line = line.strip()
    lineNumber += 1

topThreeAlpha = collections.Counter(topThreeAlphaString).most_common(3)

print "The top 3 occuring alphanumeric characters in the html of", url,": ", topThreeAlpha

html - HTML 文字列からすべての間隔を削除する

1 に答える 1

Related

Reference