Splash is a library which integrate the javascript with the scrapy. Most of the modern website uses ajax request hence it takes certain time to load the data even after the DOM element finishes loading. In such websites scrapy alone cannot scrape the data. For this additional library called splash should be used.
As an example you can visit this website. While opening the website the data loads after 3-4 seconds even after the page loads as below.
Let’s install the first package for splash scrapy using the pip command
1 |
pip install scrapy-splash |
Splash runs on the docker so for the docker installation refer this link.
Installation
Linux + Docker
-
Install Docker.
-
Pull the image:
1$ sudo docker pull scrapinghub/splash -
Start the container:
1$ sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash -
Splash is now available at 0.0.0.0 at ports 8050 (http) and 5023 (telnet).
OS X + Docker
-
Install Docker for Mac (see https://docs.docker.com/docker-for-mac/).
-
Pull the image:
1$ docker pull scrapinghub/splash -
Start the container:
1$ docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
- Splash is available at 0.0.0.0 address at ports 8050 (http) and 5023 (telnet).
Once docker is installed, open you localhost server at port 8050 and it will show something as the below image.
Once configured , now start a scrapy project ( see my other post to get started with scrapy ) and in the settings.py file paste the following configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' |
This setting will integrate our scrapy with the docker. Now lets scrape the website
1 |
'https://www.truelocal.com.au/search/barbers/vic' |
The information which we will scrape will be
1 |
'Category','Name','Phone','Street Address','Locality','Region','Postal Code' |
First make a spider using a genspider scrapy cli command
1 |
scrapy genspider app truelocal.com.au |
While using splash we use SplashRequest instead of scrapy.Request and give a argument to wait. The complete code is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# -*- coding: utf-8 -*- import scrapy from scrapy_splash import SplashRequest import csv class AppSpider(scrapy.Spider): name = 'app' with open("BarbersData.csv","a") as f: writer = csv.writer(f) writer.writerow(['Category','Name','Phone','Street Address','Locality','Region','Postal Code']) allowed_domains = ['truelocal.com.au'] start_urls = ['https://www.truelocal.com.au/search/barbers/vic'] urls = ['https://www.truelocal.com.au/search/barbers/vic'] def start_requests(self): for url in self.urls: yield SplashRequest(url,callback=self.parse,args={'wait':'5'}) def parse(self, response): links = response.xpath('.//*[@class="item-title"]/@href').extract() for link in links: yield SplashRequest(link,callback=self.getdata,args={'wait':'5'}) nextlink = response.xpath('.//*[@rel="next"]/@href').extract_first() if nextlink: yield SplashRequest(nextlink,callback=self.parse,args={'wait':'5'}) def getdata(self,response): category = response.xpath('.//div/@ng-click[contains(.,"searchByCategoryName")]/../text()').extract_first() name = response.xpath('.//h1/@ng-bind-html[contains(.,"generateName")]/../text()').extract_first() phone = response.xpath('.//*[@class="phone ng-scope"]/span/text()').extract_first() streetaddress = response.xpath('.//*[@itemprop="streetAddress"]/text()').extract_first() locality = response.xpath('.//*[@itemprop="addressLocality"]/text()').extract_first() region = response.xpath('.//*[@itemprop="addressRegion"]/text()').extract_first() postalcode = response.xpath('.//*[@itemprop="postalCode"]/text()').extract_first() with open("BarbersData.csv","a") as f: writer = csv.writer(f) writer.writerow([category,name,phone,streetaddress,locality,region,postalcode]) print([category,name,phone,streetaddress,locality,region,postalcode]) |
So after all the settings are written in the settings.py the only difference is doing the request. You have to use the SplashRequest which is imported from the splash_scrapy module which we installed at the beginning. Other code is similar writing the normal scrapy spider.
In the parse method we extract all the links and passing the link to the callback funtion getdata.
1 2 3 4 |
def parse(self, response): links = response.xpath('.//*[@class="item-title"]/@href').extract() for link in links: yield SplashRequest(link,callback=self.getdata,args={'wait':'5'}) |
After getting the link, all the necessary information is extracted from the nested link using the scrapy xpath.
1 2 3 4 5 6 7 8 9 10 11 12 |
def getdata(self,response): category = response.xpath('.//div/@ng-click[contains(.,"searchByCategoryName")]/../text()').extract_first() name = response.xpath('.//h1/@ng-bind-html[contains(.,"generateName")]/../text()').extract_first() phone = response.xpath('.//*[@class="phone ng-scope"]/span/text()').extract_first() streetaddress = response.xpath('.//*[@itemprop="streetAddress"]/text()').extract_first() locality = response.xpath('.//*[@itemprop="addressLocality"]/text()').extract_first() region = response.xpath('.//*[@itemprop="addressRegion"]/text()').extract_first() postalcode = response.xpath('.//*[@itemprop="postalCode"]/text()').extract_first() with open("BarbersData.csv","a") as f: writer = csv.writer(f) writer.writerow([category,name,phone,streetaddress,locality,region,postalcode]) |
Let me know in the comment section if you have any confusion or issues regarding splash and scrapy.