python-2.7 - Generate start_urls list in Scrapy with constraints

Question

I need to parse urls like the one below with Scrapy (ads from real estate agent)

http://ws.seloger.com/search.xml?idq=?&cp=72&idqfix=1&pxmin=30000&pxmax=60000&idtt=2&SEARCHpg=1&getDtCreationMax=1&tri=d_dt_crea

The response from the server is limited to 200 results whatever the Min/Max price you use in the url (See pxmin / pxman in url).

Therefore, i would like to use a function which generate urls for start_urls with the right price band so it doesn't go over 200 search results and so that urls cover a price range of say [0:1000000]

The function would do the following :

Take the first URL
Check number of results ("nbTrouvees" tag in the XML response)
adjust price band if results > 200 or add to start_urls list if < 200
The function increment the price band until it reach the price of 1,000,000.
Function return final start_urls list which will cover all properties for a given region.

This obviously means numerous requests to the server to find out the right price range plus all the request generated by Spider for the final scraping.

1) My first question therefore is : Is there a better way to tackle this in your point of view ?

2) My second question : I have tried to retrieve the content of one of these page with Scrapy, just to see how i could parse the "nbTrouvees" tag without using a spider but i'am stuck.

I tried using the TextResponse method but got nothing in return. I then tried the below but it fails as the method "body to unicode" doesn't exist for "Response" object.

>>>link = 'http://ws.seloger.com/search.xml?   idq=1244,1290,1247&ci=830137&idqfix=1&pxmin=30000&pxmax=60000&idtt=2&SEARCHpg=1&getDtCreationMax=1&tri=d_dt_crea'

>>>xxs = XmlXPathSelector(Response(link))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site-         packages/scrapy/selector/lxmlsel.py", line 31, in __init__
    _root = LxmlDocument(response, self._parser)
  File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site-    packages/scrapy/selector/lxmldocument.py", line 27, in __new__
    cache[parser] = _factory(response, parser)
  File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site-    packages/scrapy/selector/lxmldocument.py", line 13, in _factory
    body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
AttributeError: 'Response' object has no attribute 'body_as_unicode'

Any idea? (fyi, It works with my spider)

Thank you Gilles

python-2.7 - Generate start_urls list in Scrapy with constraints

0 に答える 0

Related

Reference