I am trying to write parser (parse postal code's streets and houses) with eventmachine and em-synchrony. The thing is that the website I want to parse has nested structure – for each postal code there are many pages of streets, which has pagination. So the algorithm is pretty simple:
- for each postal code
- visit postcal code index page
- parse index page
- parse pagination
- for each pagination page parse this page
- visit postcal code index page
Here is an example of such a parser (it works):
require "nokogiri"
require "em-synchrony"
require "em-synchrony/em-http"
def url page = nil
url = "http://gistflow.com/all"
url << "?page=#{page}" if page
url
end
EM.synchrony do
concurrency = 2
# here [1] is array of index pages, for this template let it be just [1]
results = EM::Synchrony::Iterator.new([1], concurrency).map do |index, iter|
index_page = EM::HttpRequest.new(url).aget
index_page.callback do
# here we make some parsing and find out wheter index page
# has pagination. The worst case is that it has pagination
pages = [2,3,4,5]
unless pages.empty?
# here we need to parse all pages
# with urls like url(page)
# how can I do it more efficiently?
end
iter.return "SUCC #{index}"
end
index_page.errback do
iter.return "ERR #{index}"
end
end
p results
EM.stop
end
So the trick is inside this block:
unless pages.empty?
# here we need to parse all pages
# with urls like url(page)
# how can I do it more efficiently?
end
How can I implement nested EM HTTP calls inside synchrony iterator loop?
I was trying different approaches but each time I got errors like "couldn't yield from root fiber" or errback block was called.