プロジェクトに Scrapy を使用しています。このプロジェクトでは、xml から情報を抽出しています。
xml ドキュメントで、for ループを実装したいフォーマット:
<relatedPersonsList>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>
<relatedPersonName>
<firstName>Mark</firstName>
<middleName>E.</middleName>
<lastName>Lucas</lastName>
</relatedPersonName>
<relatedPersonAddress>
<street1>1 IMATION WAY</street1>
<city>OAKDALE</city>
<stateOrCountry>MN</stateOrCountry>
<stateOrCountryDescription>MINNESOTA</stateOrCountryDescription>
<zipCode>55128</zipCode>
</relatedPersonAddress>
<relatedPersonRelationshipList>
<relationship>Executive Officer</relationship>
<relationship>Director</relationship>
</relatedPersonRelationshipList>
<relationshipClarification/>
</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
</relatedPersonsList>
でわかるように、<relatedPersonsList>
複数の を使用できます<relatedPersonInfo>
。また、for ループを作成しようとしても、最初の人の情報しか得られません。
これは私の実際のコードです:
for person in xxs.select('./relatedPersonsList/relatedPersonInfo'):
item = Myform() #even if get rid of it I get the same result
item["firstName"] = person.select('./relatedPersonName/firstName/text()').extract()[0]
item["middleName"] = person.select('./relatedPersonName/middleName/text()')
if item["middleName"]:
item["middleName"] = item["middleName"].extract()[0]
else:
item["middleName"] = "NA"
スパイダーで使用したコードは次のとおりです。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import XmlXPathSelector
from scrapy.http import Request
import urlparse
from formds.items import SecformD
class SecDform(CrawlSpider):
name = "DFORM"
allowed_domain = ["http://www..gov"]
start_urls = [
""
]
rules = (
Rule(
SgmlLinkExtractor(restrict_xpaths=["/html/body/div/table/tr/td[3]/a[2]"]),
callback='parse_formd',
#follow= True no need of follow thing
),
Rule(
SgmlLinkExtractor(restrict_xpaths=('/html/body/div/center[1]/a[contains(., "[NEXT]")]')),
follow=True
),
)
def parse_formd(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="formDiv"]/div/table/tr[3]/td[3]/a/@href').extract()
for site in sites:
yield Request(url=urlparse.urljoin(response.url, site), callback=self.parse_xml_document)
def parse_xml_document(self, response):
xxs = XmlXPathSelector(response)
item = SecformD()
item["stateOrCountryDescription"] = xxs.select('./primaryIssuer/issuerAddress/stateOrCountryDescription/text()').extract()[0]
item["zipCode"] = xxs.select('./primaryIssuer/issuerAddress/zipCode/text()').extract()[0]
item["issuerPhoneNumber"] = xxs.select('./primaryIssuer/issuerPhoneNumber/text()').extract()[0]
for person in xxs.select('./relatedPersonsList//relatedPersonInfo'):
#item = SecDform()
item["firstName"] = person.select('./relatedPersonName/firstName/text()').extract()[0]
item["middleName"] = person.select('./relatedPersonName/middleName/text()')
if item["middleName"]:
item["middleName"] = item["middleName"].extract()[0]
else:
item["middleName"] = "NA"
return item
次のコマンドを使用して、情報を .json ファイルに抽出します。