My Ocean Production, Puddy Scraper, shuttertock

Scraping Amazon’s Data-Scraping Books with Scrapy

Cortez Ethridge
4 min readOct 29, 2020

--

Scrapy is a popular framework used to scrape data. Since Scrapy has so many steps I decided to create a quick and dirty guide on how to get Scrapy up and running. Scrapy is not beginner friendly, this guide is sparse and technically challenging, and is meant to be a quick reference guide for boilerplate code. I compiled the code from the fantastic video series by Build With Python that distilled Scrapy initialization down to a few simple steps.

  1. Pip install
  2. Start Project
  3. Create Spider
  4. Add Field items
  5. Write spider
  6. Add anonymous user agents
  7. Start Scraping
  8. Post processing

Scrapy optimizes python’s object-oriented paradigm and uses boilerplate classes similar to the Django framework. The class system gives us the ability to prototype many spiders at a time. In this quick tutorial we are going to scrape data scraping books.

If you are a beginner at scraping data, I recommend watching the build with python tutorial series.

Checkout the code at Github

Step 1

In the command line, pip install scrapy

pip install scrapy

Step 2

Start Project

scrapy startproject amazon
cd amazon

Step 3

Create Spider

scrapy genspider singlepage https://www.amazon.com/
cd spiders

Step 4

Add Field items in the items.py, these are the headers that will be in our output

Step 5

Write spider, this will be where the majority of our logic will go.

First write the class. The start_url is the website we will be scraping, in this case we will be scraping books on data scraping, how meta! Name the spider something memorable. I named mine “singlepage” and we will call the spider by that name later.

singlepage.py

Write a function that calls CSS of the imported items fields. Named parse(), This is the hardest part because we need a little CSS knowledge to extract our data. You can see in the code above we have 4 items that reflect our items.py fields. I used the chrome extension Selector Gadget to pickup the CSS id’s for items, product_name, product_link, product_imagelink, and product_author.

Checkout buildwithpython’s video on CSS selectors.

Step 6 (Optional)

Add anonymous user agents, although unnecessary if we are going to be scraping just one page. In such the event that we are scraping hundreds of Amazon pages, we will have to be undetectable to amazon by pretending to be Google. Check out buildwithpython’s video on more information.

pip install scrapy-user-agents
settings.py

In our settings.py we have to add our user agent in the value listed above. You can find a list of Google’s user agents here.

Step 7

Start scraping. We are going to call our scraper from we named from step 5. There are multiple outputs to choose from depend on what format we want our data, including csv, and sql. We will be outputting our data as a json.

scrapy crawl singlepage -o data.json

Step 8

Our data appears to be in nested lists, this is good because we are able to contain lots of data in multidimensional lists, however this isn’t readable as dataframes so we need to do some data cleaning.

import pandas as pd
# load our CSV
df = pd.read_json('data.json')
# load our product name
name = df["product_name"][0]
# load our images
image = df["product_imagelink"][0]
# Model our lists into a dictionary
data_in = {'name': name, 'image': image}
# make a data frame
df2 = pd.DataFrame(data=data_in)
Example of our Data in google collab

In Conclusion

Scraping data with scrapy is as challenging as learning the web framework Django, because it fully utilizes python’s object-oriented paradigm in the form of broilerplate classes. If we learn how to build and use these classes, we are able to quickly develop new scrapers for natural language processing projects.

--

--

Cortez Ethridge

Self starting Data Scientist. I know javascript, python, haskell, clojure, and scala. I use functional concepts to solve data science problems.