python - JSON ファイルを読み込み、html を修正して BeautifulSoup に読み込む

Question

BeautifulSoup で json ファイルを処理しようとしていますが、これを達成する方法がわかりません...

以下はjsonのコピーです。jsonの各IDを調べて、特定のデータビットを抽出しようとしています...誰かが別のルートを提案していますか?

{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        { 
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}

前もって感謝します-Hyflex

score 3 · Accepted Answer

これはあなたが探しているものであると確信しています。行ごとに、「テキスト」属性を BeautifulSoup にロードし、必要なすべての属性を引き出します。これは、必要な動作に一般化できます-かなり読みやすいはずです。

import json
try:
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
myjson = r"""{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        { 
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}"""

data = json.loads(myjson)

for l in data['line']:
    soup = BeautifulSoup(l['text'])
    #print soup.prettify()
    # Get the H1 ID
    print soup.findAll('h1')[0]['id']
    # Get the text
    print soup.findAll('h1')[0].contents[0].strip()
    # Get the <a> href
    print soup.findAll('a')[0]['href']
    # Get the <a> class
    print soup.findAll('a')[0]['class']
    # Get the <a> text
    print soup.findAll('a')[0].contents[0].strip()

score 2 · Accepted Answer

でjsonデータを処理することはできませんBeautifulSoup。jsonモジュールは次のように使用できます。

import json
from pprint import pprint

json_data = r"""
{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        {
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             **{
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a hre**f=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}
"""

s = json.loads(json_data)

# Getting the value of the ids
for i in xrange(0, 10):
    pprint(s['line'][i]['text'])

作業リンクはこちら。文字列宣言の前に theValueErrorを置くのを忘れたため、おそらく a を取得しています。r

これに BeautifulSoup を使うこともできます。

# Imports
import json
from pprint import pprint
from bs4 import BeautifulSoup

json_data = <as described above>
s = json.loads(json_data)
list_of_html_in_json = [s['line'][i]['text'] for i in xrange(10)]
soup = BeautifulSoup(" ".join(list_of_html_in_json))
print soup.find_all("h1", {"id": "r035"})  # Example

これは外部ライブラリ (bs4) を使用しているため、コードのオンラインバージョンを表示できません。しかし、私はそれを試してテストしたことを保証します.

score 1 · Accepted Answer

ちょうど私の試み：

import requests
import json
from bs4 import BeautifulSoup

# Use requests library to get the JSON data
JSONDATA = requests.request("GET", "http://www.websitehere.com/") #Make sure you include the http part
# Load it with JSON 
JSONDATA = JSONDATA.json()

# Cycle through each `line` in the JSON
for line in JSONDATA['line']:
    # Load stripped html in BeautifulSoup
    soup = BeautifulSoup(line['text'])
    # Prints tidy html
    print soup.prettify()

それが役立つことを願っています:)

python - JSON ファイルを読み込み、html を修正して BeautifulSoup に読み込む

4 に答える 4

Related

Reference