Splash for scraping javascript pages

python scraping

Home » Blog » Python » python scraping » Splash for scraping javascript pages

Splash is a library which integrate the javascript with the scrapy. Most of the modern website uses ajax request  hence it takes certain time to load the data even after the DOM element finishes loading. In such websites scrapy alone cannot scrape the data. For this additional library called splash should be used.

As an example you can visit this website. While opening the website the data loads after 3-4 seconds even after the page loads as below.

 

 

 

 

Let’s install the first package for splash scrapy using the pip command

Splash runs on the docker so for the docker installation refer this link.

Installation

Linux + Docker

  1. Install Docker.

  2. Pull the image:

  3. Start the container:

  4. Splash is now available at 0.0.0.0 at ports 8050 (http) and 5023 (telnet).

OS X + Docker

  1. Install Docker for Mac (see https://docs.docker.com/docker-for-mac/).

  2. Pull the image:

  3. Start the container:

  1. Splash is available at 0.0.0.0 address at ports 8050 (http) and 5023 (telnet).

 

Once docker is installed, open you localhost server at port 8050 and it will show something as the below image.

Once configured , now start a scrapy project ( see my other post to get started with scrapy ) and in the settings.py file paste the following configuration

This setting will integrate our scrapy with the docker. Now lets scrape the website

The information which we will scrape will be

First make a spider using a genspider scrapy cli command

While using splash we use SplashRequest instead of scrapy.Request and give a argument to wait. The complete code is

 

So after all the settings are written in the settings.py the only difference is doing the request. You have to use the SplashRequest which is imported from the splash_scrapy module which we installed at the beginning. Other code is similar writing the normal scrapy spider.

In the parse method we extract all the links and passing the link to the callback funtion getdata.

After getting the link, all the necessary information is extracted from the nested link using the scrapy xpath.

 

Let me know in the comment section if you have any confusion or issues regarding splash and scrapy.

 

 

About Author

nyasin585

×