Scrapy is one of the popular and powerful framework of python which is used for web scraping and data mining. Unlike different web scraping modules and libraries like requests,beautiful soup it has its own ecosystem. Using scrapy concurrent request can be send to the server/website so one requests does not wait for the previous one to complete which makes it extremely fast than other tools available for web scraping. Furthermore it can be customized according to our need by implementing various middleware which will be in my another posts. For now let’s get introduced with scrapy.
1. Installation
For windows
- Download miniconda for windows
- Open a command prompt and install scrapy using
-
1conda install -c conda-forge scrapy
For linux/mac
- Using pip command will be enough
-
1sudo pip install scrapy
- Make sure to use pip3 if you are using python 3 version
2.Starting a project
Scrapy gives a lot of option while starting a project. It provides us with a lot of settings and a lot of configuration.
1 |
scrapy startproject myproject |
3. Starting a spider
Once the project is created the spider can be created using the scrapy cli using the genspider keyword
1 |
scrapy genspider <spidername> <spider domain> |
The spider can be integrated with the project and can be run as a stand alone single web spider.
4. Running the spider
If the spider is integrated with the project (not a standalone single spider) then it can be executed using the following command
1 |
scrapy crawl <spidername> -o <output file name and extension> |
For the standalone single spider
1 |
scrapy runspider <python file name> |
3. Getting started with xpath
Xpath is developed for XML files to identify and navigate nodes in an XML document. Using scrapy we can use the xpath for the html files as well. Using xpath gives us a lot of flexibility and helps to maintain a lot more cleaner code. The best part comes when we use class name and id in the xpath. Since most of the website does not change the class name and id of any particular DOM element the chances are our script work irrespective of the changes in the website.
Extracting text of the particular id/class name
1 2 |
response.xpath('.//*[@class="classname"]/text()').extract_first() response.xpath('.//*[@id="idname"]/text()').extract_first() |
Extracting link information from a link tag
1 |
response.xpath('.//a/@href').extract_first() |
There are more advance concept and usage for the xpath which will be written in another post. It is just a surface understanding.
4. Calling another function in scrapy
Instead of return statement we use yield in scrapy and use a callback parameter to call the another function
1 2 3 4 5 |
def parse(self,response): yield scrapy.Response(url,callback=self.anotherfunction) def anotherfunction(self,response): pass |