scrapy startproject <project_name>
This creates a file structure like so:
├── scrapy.cfg
└── <project_name>
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
Scrapy provides several different spider types. Some of the most common ones:
start_urls
and scrapes each one with a parse
method.To create a new generic spider, run:
scrapy genspider <name_of_spider> <website>
A new spider will now have been added to your spiders
folder, and it should look like this:
import scrapy
class NAMEOFSPIDERSpider(scrapy.Spider):
name = 'NAMEOFSPIDER'
allowed_domains = ['website']
start_urls = ['website']
def parse(self, response):
pass
This spider class contains:
<website>
domain. This prevents the spider from going and scraping lots of websites. This is optional.To start using this Spider we will have to:
start_urls
to the URL we want to scrapeparse
functionYou run a spider with:
scrapy crawl <name_of_spider>
scrapy shell
If we run
fetch(<start_url>)
we should see a 200 response in the logs. Scrapy will save the HTML response in an object called response
You can get a list of elements matching a CSS selector by running
response.css("<selector>")
To just get the first matching element run
response.css("<selector>").get()
This returns all the HTML in this node of the DOM tree.
Tags: Programming