Web scraping relies on the HTML structure of the page, and thus cannot be completely stable. When HTML structure changes the scraper may become broken. Keep this in mind when reading this article. At the moment when you are reading this, css-selectors used here may become outdated.
What is Web Scraping?
Have you ever needed to grab some data from a site that doesn’t provide a public API? To solve this problem we can use web scraping and pull the required information out from the HTML. Of course, we can manually extract the required data from a website, but this process can become very tedious. So, it will be more efficient to automate it via the scraper.
Well, in this tutorial we are going to scrape cats images from Pexels. This website provides high quality and completely free stock photos. They have a public API but it has a limit of 200 requests per hour.
Making concurrent requests
The main advantage of using asynchronous PHP in web scraping is that we can make a lot of work in less time. Instead of querying each web page one by one and waiting for responses we can request as many pages as we want at once. Thus we can start processing the results as soon as they arrive.
Let’s start with pulling an asynchronous HTTP client called buzz-react – a simple, async HTTP client for concurrently processing any number of HTTP requests, built on top of ReactPHP:
Now, we are ready and let’s request an image page on pexels:
We have created an instance of
Clue\React\Buzz\Browser, then we have used it as HTTP client. The code above makes an asynchronous
GET request to a web page with an image of kittens. Method
$client->get($url) returns a promise that resolves with a PSR-7 response object.
The client works asynchronously, that means that we can easily request several pages and these requests will be performed concurrently:
The idea is here the following:
- make a request
- get a promise
- add a handler to a promise
- once the promise is resolved, process the response
So, this logic can be extracted to a class, thus we can easily request many URLs and add the same response handler for them. Let’s create a wrapper over the
Create a class called
Scraper with the following content:
Browser as a constructor dependency and provide one public method
scrape(array $urls). Then for each specified URL we make a
GET request. Once the response is done we call a private method
processResponse(string $html) with the body of the response. This method will be responsible for traversing HTML code and downloading images. The next step is to inspect the received HTML code and extract images from it.
Crawling the website
At this moment we are getting only HTML code of the requested page. Now we need to extract the image URL. For this, we need to examine the structure of the received HTML code. Go to an image page on Pexels, right click on the image and select Inspect Element, you will see something like this:
We can see that
img tag has class
image-section__image. We are going to use this information to extract this tag out of the received HTML. The URL of the image is stored in the
For extracting HTML tags we are going to use Symfony DomCrawler Component. Pull the required packages:
CSS-selector for DomCrawler allows us to use jQuery-like selectors to traverse the DOM. Once everything is installed open our
Scraper class and let’s write some code in
processResponse(string $html) method. First of all, we need to create an instance of the
Symfony\Component\DomCrawler\Crawler class, its constructor accepts a string that contains HTML code for traversing:
To find any element by its jQuery-like selector use
filter() method. Then method
attr($attribute) allows to extract an attribute of the filtered element:
Let’s just print the extracted image URL and check that our scraper works as expected:
When running this script it will output the full URL to the required image. Then we can use this URL to download the image. Again we use an instance of the
Browser and make a
The response arrives with the contents of the requested image. Now we need to save it on disk. But take your time and don’t use
file_put_contents(). All native PHP functions that work with a file system are blocking. It means that once you call
file_put_contents() our application stops behaving asynchronously. The flow control is being blocked until the file is saved. ReactPHP has a dedicated package to solve this problem.
Saving files asynchronously
To process files asynchronously in a non-blocking way we need a package called reactphp/filesystem. Go ahead and pull it:
To start working with the file system create an instance of
Filesystem object and provide it as a dependency to our
Scraper. Also, we need to provide a directory where to put all downloaded images:
Here is an updated constructor of the
Ok, now we are ready to save files on disk. First of all, we need to extract a filename from the URL. The scraped URLs to the images look like this:
And filenames for these URLs will be the following:
Let’s use a regular expression to extract filenames out of the URLs. To get a full path to a future file on disk we concatenate these names with a directory:
Once we have a path to a file we can use it to create a file object:
This object represents a file we are going to work with. Then call method
putContents($contents) and provide a response body as a string:
That’s it. All asynchronous low-level magic is hidden behind one simple method. Under the hood, it creates a stream in a writing mode, writes data to it and then closes the stream. Here is an updated version of method
We pass a full path to a file inside the response handler. Then, we create a file and fill it with the response body. Actually, the whole scraper is less than 50 lines of code!
Note: at first, create a directory where you want to store files. Method
putContents()only creates a file, it doesn’t create folders to a specified filename.
The scraper is done. Now, open your main script and pass a list of URLs to scrape:
The snippet above scraps five URLs and downloads appropriate images. And all of this is being done fast and asynchronously.
In the previous tutorial, we have used ReactPHP to speed up the process of web scraping and to query web pages concurrently. But what if we also need to save files concurrently? In an asynchronous application we cannot use such native PHP function like
file_put_contents(), because they block the flow, so there will be no speed increase in storing images on disk. To process files asynchronously in a non-blocking way in ReactPHP we need to use reactphp/filesystem package.
So, in around 50 lines of code, we were able to get a web scraper up and running. This was just a tiny example of something you could do. Now that you have the basic knowledge of how to build a scraper, go and try building your own one!
You can find examples from this article on GitHub.
This article is a part of the ReactPHP Series.