Home » Blog » Python » python scraping » Getting started with scrapy

Scrapy is one of the popular and powerful framework of python which is used for web scraping and data mining. Unlike different web scraping modules and libraries like requests,beautiful soup it has its own ecosystem. Using scrapy concurrent request can be send to the server/website so one requests does not wait for the previous one to complete which makes it extremely fast than other tools available for web scraping. Furthermore it can be customized according to our need by implementing various middleware which will be in my another posts. For now let’s get introduced with scrapy.

1. Installation

For windows

  • Download miniconda for windows
  • Open a command prompt and install scrapy using

For linux/mac

  • Using pip command will be enough
  • Make sure to use pip3 if you are using python 3 version

 

2.Starting a project

Scrapy gives a lot of option while starting a project. It provides us with a lot of settings and a lot of configuration.

 

3. Starting a spider

Once the project is created the spider can be created using the scrapy cli using the genspider keyword

The spider can be integrated with the project and can be run as a stand alone single web spider.

 

4. Running the spider

If the spider is integrated with the project (not a standalone single spider) then it can be executed using the following command

For the standalone single spider

 

3. Getting started with xpath

Xpath is developed for XML files to identify and navigate nodes in an XML document. Using scrapy we can use the xpath for the html files as well. Using xpath gives us a lot of flexibility and helps to maintain a lot more cleaner code. The best part comes when we use class name and id in the xpath. Since most of the website does not change the class name and id of any particular DOM element the chances are our script work irrespective of the changes in the website.

Extracting text of the particular id/class name

Extracting link information from a link tag

There are more advance concept and usage for the xpath which will be written in another post. It is just a surface understanding.

 

4. Calling another function in scrapy

Instead of return statement we use yield in scrapy and use a callback parameter to call the another function

 

 

 

About Author

nyasin585

×