Web scraping relies on the HTML structure of the page, and thus cannot be completely stable. When HTML structure changes the scraper may become broken. Keep this in mind when reading this article. At the moment when you are reading this, css-selectors used here may become outdated.
Scraping allows transforming the massive amount of unstructured HTML on the web into the structured data. A good scraper can retrieve the required data much quicker than the human does. In the previous article, we have built a simple asynchronous web scraper. It accepts an array of URLs and makes asynchronous requests to them. When responses arrive it parses data out of them. Asynchronous requests allow to increase the speed of scraping: instead of waiting for all requests being executed one by one we run them all at once and as a result we wait only for the slowest one.
It is very convenient to have a single HTTP client which can be used to send as many HTTP requests as you want concurrently. But at the same time, a bad scraper which performs hundreds of concurrent requests per second can impact the performance of the site being scraped. Since the scrapers don’t drive any human traffic on the site and just affect the performance, some sites don’t like them and try to block their access. The easiest way to prevent being blocked is to crawl nicely with auto throttling the scraping speed (limiting the number of concurrent requests). The faster you scrap, the worse it is for everybody. The scraper should look like a human and perform requests accordingly.
A good solution for throttling requests is a simple queue. Let’s say that we are going to scrape 100 pages, but want to send only 10 requests at a time. To achieve this we can put all these requests in the queue and then take the first 10 quests. Each time a request becomes complete we take a new one out of the queue.
Queue Of Concurrent Requests
For a simple task like web scraping such powerful tools like RabbitMQ, ZeroMQ or Kafka can be overhead. Actually, for our scraper, all we need is a simple in-memory queue. And ReactPHP ecosystem already has a solution for it: clue/mq-react a library written by Christian Lück. Let’s figure out how can we use it to throttle multiple HTTP requests.
First things first we should install the library:
Well, here the problem we need to solve:
create a queue of HTTP requests and execute a certain amount of them at a time.
Now, let’s perform the same task but with the queue. First of all, we need to instantiate a queue (create an instance of
Clue\React\Mq\Queue). It allows to concurrently execute the same handler (callback that returns a promise) with different (or same) arguments:
In the snippet above we create a queue. This queue allows execution only for two handlers at a time. Each handler is a callback which accepts an
$url argument and returns a promise via
$browser->get($url). Then this
$queue instance can be used to queue the requests:
In the snippet above the
$queue instance is called as a function. Class
Clue\React\Mq\Queue can be invokable and accepts any number of arguments. All these arguments will be passed to the handler wrapped by the queue. Consider calling
$queue($url) as placing a
$browser->get($url) call into a queue. From this moment the queue controls the number of concurrent requests. In our queue instantiation, we have declared
$concurrency as 2 meaning only two concurrent requests at a time. While two requests are being executed the others are waiting in the queue. Once one of the requests is complete (the promise from
$browser->get($url) is resolved) a new request starts.
Scraper With Queue
Scraper via method
scrape($urls) accepts an array of IMDB URLs and then sends asynchronous requests to these pages. When responses arrive method
extractFromHtml($html) scraps data out of them. The following code can be used to scrape data about two movies and then print this data to the screen:
If you want a more detailed explanation of building this scraper read the previous post “Fast Web Scraping With ReactPHP”.
From the previous section we have already learned that the process of queuing the requests consists of two steps:
- instantiate a queue providing a concurrency limit and a handler
- add asynchronous calls to the queue
To integrate a queue with the
Scraper class we need to update its method
scrape(array $urls, $timeout = 5) which sends asynchronous requests. At first, we need to accept a new argument for concurrency limit and then instantiate a queue providing this limit and a handler:
As a handler we use
$this->client->get($url) call which makes an asynchronous request to a specified URL and returns a promise. Once the request is done and the response is received the promise fulfills with this response.
Then the next step is to invoke the queue with the specified URLs. Now, the
$queue variable is a placeholder for
$this->client->get($url) call, but this call is being taken from the queue. So, we can just replace this call with
And we are done. All the limiting concurrency logic is hidden from us and is handled by the queue. Now, to scrape the pages with only 10 concurrent requests at a time we should call the
Scraper like this:
scrape() accepts an array of URLs to scrape, then a timeout for each request and the last argument is a concurrency limit.
It is a good practice to use throttling for concurrent requests to prevent the situation with sending hundreds of such requests and thus a chance of being blocked by the site. In this article I’ve shown a quick overview of how you can use a lightweight in-memory queue in conjunction with HTTP client to limit the number of concurrent requests.
You can find examples from this article on GitHub.
This article is a part of the ReactPHP Series.