Edited 10/20/19

Chi Hack Night Scraper Tutorial

In this tutorial, we will be creating a 2 web scrapers for the Chi Hack Night events page https://chihacknight.org/events/index.html. The first one will be simple and only extract data from the event listing page. The second one will be more complex and extract more complete data from each event page.

If you want to reference the completed scrapers, the repository already contains both web scrapers you will create in this tutorial (Authors note: not yet actually (TODO))

event_processor/scrapers/chihacknight_simple_spider.py

event_processor/scrapers/chihacknight_crawling_spider.py

Prerequisites

You should be comfortable with:

This tutorial assumes you are using a Linux machine, so if you’re using something else the shell commands would be a little different.

Web Scrapers in In2It

Web scrapers are important for the the In2It Chicago project because they extract all event data for the site in a relatively normalized and searchable format. While users could always use a traditional search engine to find these events, In2It Chicago is meant to focus on events related to community involvement and support. Each web scraper is tailor made for each site.

The scrapers are python classes which inherit classes from scrapy https://scrapy.org/, which is the most popular python web scraping library available.

Other than web scrapers, the site also extracts data from APIs (and what else)?

Project file structure

The web scrapers all live in the event_processor container- it is just one of the many In2It Chicago docker containers.

The subfolder event_processor/event_processor is a python package containing all relevant files. The important subfolders are as follows:

Obviously, the folder we care about most in this tutorial is the scrapers folder. The first section will be for a simple scraper that only scrapes one page. The second section is a scraper that crawls and scrapes many pages.

Simple Web Scraper

The more simple ScraperSpider inherits from scrapy’s ScrapySpider https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy-spider and includes other utility methods specific to this project.

Define the Scraper

Go ahead and create a new custom scraper file in event_processor/scrapers called chihacknight_spider.py.

Import the necessary base class for the simple scraper, ScraperSpider and define the spider name. The coding: utf-8 comment on top defines the character encoding of the file. (Authors note: I don’t know if that is what it does or if it is even needed in python)

# -*- coding: utf-8 -*-

from event_processor.base.custom_spiders import ScraperSpider

class ChiHackNightSpider(ScraperSpider):

        name = 'chihacknight'

        allowed_domains = ['chihacknight.org']

        enabled = True

The name attribute is how the scraper will be identified when running this spider from the command line.

The allowed_domains attribute specifies what domains this scraper will be able to visit while it is scraping (Authors note: I don’t know that for sure I’m just assuming). Because we want to scrape the Chi Hack Night website, we only specify the ‘chihacknight.org’ domain.

While the enabled attribute is not required, it is useful to know that you can set enabled to False to stop the event_processor from running or scheduling this scraper.

Initialize the Scraper

We have to specify the parameters for the scraper and tell it where it where exactly it will be scraping data. Define the the __init__ and start_requests methods for these purposes.

def __init__(self, name=None, **kwargs):

        super().__init__(self, 'Chi Hack Night', 'https://chihacknight.org/', date_format='%b %d, %Y', **kwargs)

def start_requests(self):

        yield self.get_request('events/', {})

The date_format is a required argument on __init__ and it is a string specifying the date format the scraper will be searching for. The string must follow the format expected by python’s datetime.strptime() function. A full reference can be found at https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior. Manually examining the event format, it looks like ‘%b %d, %Y’ 

The start_requests() method specifies the page which the requests will start from. For spiders that extend from the ScraperSpider base class.

Parse the Results

Spiders that inherit from the project’s base spider classes are expected to return an object of keys to an array of values. Each array must be the same length. All values at the same index are assumed to be for the same event.

The parse() method takes in a response object which represents the HTML returned by the request specified. Extracting data from this object is done exactly as you would extract data in any scrapy spider. In this tutorial, the .css() method (https://docs.scrapy.org/en/latest/topics/selectors.html#using-selectors) can be used to extract the data.

If we press F12 (or whatever shortcut opens the developer/inspect console in your browser) we can see that the data we need is neatly organized in a table. What looks like the title and url is in the 3rd column, the date is in the 1st column, and something that could work as the description is in the 4th column.

Taking advantage of the css :nth-child(n) selector (https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child), the following code could be used to extract the data. (Authors note: this code is intentionally incorrect to demonstrate the usefulness of a utility function introduced later).

def parse(self, response):

        return {

           'title': response.css('table tr td:nth-child(3) span::text').extract(),

           'url': response.css('table tr td:nth-child(3) span::attr(href)').extract(),

           'event_time': self.create_time_data(

                date=response.css('table tr td:nth-child(3) span::text').extract()

           ),

           'description': response.css('table tr td:nth-child(4)::text').extract()

            }

::text will gets only the text from the retrieved elements. ::attr() gets the value of the given attribute on the retrieved elements. The extract() function will get all elements that match those selectors https://docs.scrapy.org/en/latest/topics/selectors.html#extract-and-extract-first. Both of these elements return arrays of strings where each string is one of the matches found.

The event_time field is special in that it is supposed to be an object where each field of the object is an array of strings. The utility function create_time_data() should always be used for this field. We pass in the date parameter since the Chi Hack Night events page only provides a date for the event.

Testing the Scraper

To run only a specific scraper, you can run the ./start.sh command with the --spider-name parameter to specify what spider you want to run - this is perfect for testing a spider.

sudo ./start.sh --spider-name chihacknight

(Authors note: the command above is for a Linux machine. If you have a different operating system it might look a little different for you). However, although the selectors are correct and the logic is sound, the event_processor container gets this error:

...

event_processor_1  |   File "/usr/src/app/event_processor/scrapy_impl/middlewares.py", line 18, in get_event_count

event_processor_1  |         raise ValueError(f'{spider.organization}: Selectors returned data of differing lengths')

event_processor_1  | ValueError: Chi Hack Night: Selectors returned data of differing lengths

event_processor_1  | 2019-10-01 01:44:28,840 - chihacknight - INFO - No data returned for https://chihacknight.org/

event_processor_1  | Event processor completed

...

So what happened? The event processor expects each field in the object returned by the scraper to be an array of values such that each array is the same length. This is because it needs to correctly associate each piece data to each event, which is can only do reliable if each array is the same length. Because extract() does not return data it can’t find, some of the arrays might have been different lengths, resulting in an error.

Fixing the Scraper

The ScraperSpider base class has a function empty_check_extract() which will attempt to find some data given a selector to some section of HTML and then some data within that section of HTML. If it doesn’t find data using the second selector, then it will replace it with an empty string. This way, as long as the first selector of each call of this function returns the same number of elements, the array returned by this function will always be the same length. We can replace each field with a empty_check_extract() call.

def parse(self, response):

    return {

        'title': self.empty_check_extract(response.css('table tr'), self.css_func,

                  'td:nth-child(3) span::text'),

        'url': self.empty_check_extract(response.css('table tr'), self.css_func,  

                  'td:nth-child(3) a::attr(href)'),

        'event_time': self.create_time_data(

                    date=self.empty_check_extract(response.css('table tr'), self.css_func,

                  'td:nth-child(1) p::text', 'Jan 01, 2012')

                ),

        'address': list(map(lambda x: '222 Merchandise Mart Plaza, Chicago, IL 60654',  

                self.empty_check_extract(response.css('table tr'), self.css_func,

                    'td::text'))),

                'description': self.empty_check_extract(response.css('table tr'), self.css_func, 'td:nth-child(4)::text')

            }

Additionally, I added a value for address which uses a lambda function to transform each returned result to the address of the Merchandise Mart. Since empty_check_extract() will make the array the same length as every other field and each element is transformed to be the same, this is a valid value for the address field. If we rerun the script, we should see something like this…

...

event_processor_1  | Running event processor...

event_processor_1  | 2019-10-01 01:56:00,762 - chihacknight - INFO - Found 7 events for Chi Hack Night.

event_processor_1  | 2019-10-01 01:56:00,815 - chihacknight - INFO - Saved 7 events for Chi Hack Night

event_processor_1  | Event processor completed

event_processor_1  | Data retrieved successfully

...

The scraper will only save events which occur in the future based on the event date and time. Checking the database’s event table, we can find our events to confirm it worked as expected (

Crawling Web Scraper

The ScraperCrawlSpider inherits from scrapy’s CrawlSpider https://docs.scrapy.org/en/latest/topics/spiders.html#crawlspider and uses more of scrapy’s built in features to crawl content from more than one page.

Define the Scraper

Just like before, create a new scraper file in event_processor/scrapers called chihacknight_crawl_spider.py. Import ScraperCrawlSpider for the base class and the scrapy classes Rule and LinkExtractor. Include the UTF-8 encoding comment on top.

# -*- coding: utf-8 -*-

from event_processor.base.custom_spiders import ScraperCrawlSpider

from scrapy.spiders import Rule

from scrapy.linkextractors import LinkExtractor

class ChiHackNightCrawlSpider(ScraperCrawlSpider):

        name = 'chihacknightcrawl'

        allowed_domains = ['chihacknight.org']

        start_urls = ['https://chihacknight.org/events']

        enabled = True

        rules = (Rule(LinkExtractor(restrict_css = 'table tr td:nth-child(3) a'), callback="parse_page", follow=True), )

Just like before, we define a distinct name, allowed_domains for our site, and enabled to mark that this spider is enabled. However, a crawl spider also needs to know what links it is allowed to crawl so that it doesn’t go on a crawling rampage. We also define a rules attribute which is a list of scrapy Crawling Rules (https://docs.scrapy.org/en/latest/topics/spiders.html#crawling-rules).

The first argument of the Rule constructor is a Link Extractor (https://docs.scrapy.org/en/latest/topics/link-extractors.html#topics-link-extractor), which takes in different parameters to filter links on the page. In this example, only the restrict_css parameter is needed since all links go to the same domain.

The Crawling spider will crawl  through pages itself, guided by the given rules. Rather than define a method for starting requests, we can instead define a list of URL strings start_urls which tells the crawler where to begin.

Initialize the Scraper

Initializing the spider is done a lot like the ScraperSpider crawler, only the date format is different to account for event pages using the full month name in the date.

No start_requests() method is needed because the Crawler will look at the the start_urls list to know where to begin crawling.

def __init__(self, name=None, **kwargs):

            super().__init__(self, 'Chi Hack Night', 'https://chihacknight.org', date_format='%B %d, %Y', **kwargs)

Parse the Results

The ScraperCrawlSpider will crawl to each link that applies to any of the rules defined in rules. If multiple rules apply, then it will crawl to the first one in the list. Parsing the results

Parsing data is done exactly like with the single page web scraper. Examining the HTML on a detail page for a Chi Hack Night event, it looks like most relevant information is nicely labeled with an “itemprop” attribute that we can use for the selector (https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors)

The event page doesn’t seem to contain its own URL on the page itself, so we instead get it from the url property passed in the response object.

Some of the event pages seem to have no address, so we override the empty string default value in the empty_check_extract() method for the address field to handle those cases.

For the address and description fields, the selectors end in *::text which means to get all text from all decendent elements returns by the given CSS selector.

def parse_page(self, response):

   return {

      'title': self.empty_check_extract(response.css('#primary-content'), self.css_func, ' [itemprop="name"]::text'),

      'url': list(map(lambda x: response.url, self.empty_check_extract( response.css('#primary-content'), self.css_func, '[itemprop="name"]::text'))),

      'event_time': self.create_time_data(

         date=self.empty_check_extract(response.css('#primary-content'), self.css_func, '[itemprop="startDate"]::text')

      ),

      'address': self.empty_check_extract(response.css('#primary-content'), self.css_func, '[itemprop="address"] *::text', default_value="222 Merchandise Mart Plaza, Chicago, IL 60654"),

      'description': self.empty_check_extract(response.css('#primary-content'), self.css_func, '[itemprop="description"] *::text')

   }

Testing the Scraper

Just like before, run the crawl scraper by running ./start.sh with the given crawler’s name.

sudo ./start.sh --spider-name chihacknightcrawl

After some time, you should see log messages reporting that events for found for “Chi Hack Night.”

Viewing the Events Table

To see what the data looks like in the events database, Go to pgAdmin at locahost:7000 and sign in with username user@domain.com and password pgadmin. If you don’t have the server added, right click on “Servers” on the left and select Create … and Server…

You will be prompted to give information about the server you want to connect to. Give the server a meaningful name and switch to the Connection tab. The Port should be 5432. The Maintenance database should be postgres. Both the username and password is postgres.

Once connected, navigate down the tree: Databases > events > Schemas > Tables > events. Right click on the events table and select Scripts > SELECT Script. Execute the default SELECT script by clicking on the lightning bolt button to see what events are in the database. This is a good way to evaluate whether your scrapers are extracting the correct data.