Web scraping relies on the HTML structure of the page, and thus cannot be completely stable. When HTML structure changes the scraper may become broken. Keep this in mind when reading this article. At the moment when you are reading this, css-selectors used here may become outdated.
In the previous article, we have created a scraper to parse movies data from IMDB. We have also used a simple in-memory queue to avoid sending hundreds or thousands of concurrent requests and thus to avoid being blocked. But what if you are already blocked? The site that you are scraping has already added your IP to its blacklist and you don’t know whether it is a temporal block or a permanent one.
Such issues can be resolved with a proxy server. Using proxies and rotating IP addresses can prevent you from being detected as a scraper. The idea of rotating different IP addresses while scraping - is to make your scraper look like real users accessing the website from different multiple locations. If you implement it right, you drastically reduce the chances of being blocked.
In this article, I will show you how to send concurrent HTTP requests with ReactPHP using a proxy server. We will play around with some concurrent HTTP requests and then we will come back to the scraper, which we have written before. We will update the scraper to use a proxy server for performing requests.
How to send requests through a proxy in ReactPHP
For sending concurrent HTTP we will use clue/reactphp-buzz package. To install it run the following command:
Now, let’s write a simple asynchronous HTTP request:
We create an instance of
Clue\React\Buzz\Browser which is an asynchronous HTTP client. Then we request Google web page via method
get($url) returns a promise, which resolves with an instance of
Psr\Http\Message\ResponseInterface. This snippet above requests
http://google.com and then prints its HTML.
For a more detailed explanation of working with this asynchronous HTTP client check this post.
Browser is very flexible. You can specify different connection settings, like DNS resolution, TSL parameters, timeouts and of course proxies. All these settings are configured within an instance of
Connector accepts a loop and then a configuration array. So, let’s create one and pass it to our client as a second argument.
This connector tells the client to use
22.214.171.124 for DNS resolution.
Before we can start using proxy we need to install clue/reactphp-socks package:
This library provides SOCKS4, SOCKS4a and SOCKS5 proxy client/server implementation for ReactPHP. In our case, we need a client. This client will be used to connect to a proxy server. Then our main HTTP client will use this proxy client to send connections through a proxy server.
Notice, that this
127.0.0.1:1080is just a dummy address. Of course, there is no proxy server running on our machine.
The constructor of
Clue\React\Socks\Client class accepts an address of the proxy server (
127.0.0.1:1080) and an instance of the
Connector. We have already covered
Connector above. Create an empty connector here, with no configuration array.
Clue\React\Socks\Client can confuse you, that it is one more client in our code. But it is not the same thing as
Clue\React\Buzz\Browser, it doesn’t send requests. Consider it as a connection, not a client. The main purpose of it is to establish a connection to a proxy server. Then the real client will use this connection to perform requests.
To use this proxy connection we need to update a connector and specify
The full code now looks like this:
Now, the problem is: where to get a real proxy?
Let’s find a proxy
On the Internet, you can find many sites dedicated to providing free proxies. For example, you can use https://www.socks-proxy.net. Visit it and pick a proxy from Socks Proxy list.
In this tutorial, I use
Probably when you read this article this particular proxy wouldn’t work. Please, pick another proxy from the site I mentioned above.
Now, the working example looks like this:
Notice, that I have added an onRejected callback. A proxy server might not work (especially a free one), thus it would be useful to show an error if our request has failed. Run the code and you will see HTML code of Google main page.
Updating the scraper
To refresh the memory here is the consumer code of the scraper from the previous article:
We create an event loop. Then we create an instance of
Clue\React\Buzz\Browser. The scraper uses this instance to perform concurrent requests. We scrape two URLs with 40 seconds timeout. As you can see we even don’t need to touch the scraper’s code. All we need is to update
Browser constructor and provide a
Connector configured for using a proxy server. At first, create a proxy client with an empty connector:
Then we need a new connector for
Browser with a configured
tcp option, where we provide our client:
And the last step is to update
Browser constructor by providing a connector:
The updated proxy version looks the following:
But, as I have mentioned before proxies might not work. It will be nice to know why we have scrapped nothing. So, it looks like we still have to update a scraper’s code and add errors handling. The part of the scraper which performs HTTP requests looks the following:
The request logic is located inside
scrape() method. We loop through specified URLs and perform a concurrent request for each of them. Each request returns a promise. As an onFulfilled handler, we provide a closure where the response body is being scraped. Then, we set a timer to cancel a promise and thus a request by timeout. One thing is missing here. There is no error handling for this promise. When the parsing is done there is no way to figure out what errors have occurred. It will be nice to have a list of errors, where we have URLs as keys and appropriate errors as values.
So, let’s add a new
$errors property and a getter for it:
Then we need to update method
scrape() and add a rejection handler for the request promise:
When an error occurs we store it inside
$errors property with an appropriate URL. Now we can keep track of all the errors during the scraping. Also, before scrapping don’t forget to instantiate
$errors property with an empty array. Otherwise, we will continue storing old errors. Here is an updated version of
Now, the consumer code can be the following:
At the end of this snippet, we print both scraped data and errors. A list of errors can be very useful. In addition to the fact that we can track dead proxies, we can also detect whether we are banned or not.
What if my proxy requires authentication?
All these examples above work fine for free proxies. But when you are serious about scraping chances high that you have private proxies. In most cases they require authentication. Providing your credentials is very simple, just update your proxy connection string like this:
But keep in mind that if you credentials contain some special characters they should be encoded:
You can find examples from this article on GitHub.
This article is a part of the ReactPHP Series.