Skip to the content.

In2It Chicago

View these docs here if you like pretty colors.

Table of Contents

Setup

Get the Code

Clone the In2ItChicago repository into your preferred directory with Git Bash on Windows or a normal terminal otherwise: git clone https://github.com/In2ItChicago/In2ItChicago.git

If you have any issues, see the troubleshooting guide further down in this document.

Install Docker

Note: If you’re running Windows Professional, it’s recommended you use Docker for Windows over Docker Toolbox. This requires the use of Hyper-V, while Docker Toolbox requires the use of VirtualBox. Keep in mind that if you’re currently using VirtualBox, you can’t use Hyper-V at the same time. Make sure you disable one before trying to use the other.

For Windows Home, download the latest version from here. Documentation is here.
For Windows Professional or Enterprise, download from here. Documentation is here.
For Mac, download from here. Documentation is here.
For Linux, download from your package manager. Documentation is here (Other distros have links on the left side of the page).
Make sure you follow any OS and distro-specific instructions for setting up Docker. It may be helpful to go through the getting started guide here.

Extra Installation Steps for Linux (Ubuntu)

These steps are for Ubuntu. Arch Linux has Docker available in pacman without any manual steps required. Other distros may require different steps.

If you’re new to Docker or you’re recovering from a failed installation attempt, it’s best to start by uninstalling older versions of Docker: sudo apt-get remove docker docker-engine docker.io

Run: sudo apt-get update
Install the following packages:
sudo apt-get install apt-transport-https
sudo apt-get install ca-certificates
sudo apt-get install curl
sudo apt-get install software-properties-common
These allow apt to use a repository over HTTPS

Add Docker’s official GNU Privacy Guard (GPG) key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
This should print, “OK” to the terminal.

Run: sudo apt-key fingerprint 0EBFCD88
Verify that the Key Fingerprint line shows: 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88

Set up the stable Docker repository:
sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" Run: sudo apt-get update again.
Install the latest version of Docker CE: sudo apt-get install docker-ce

If there were problems during the installation, try removing docker and starting over.
sudo apt-get purge docker-ce
sudo rm -rf /var/lib/docker

Run: sudo curl -L https://github.com/docker/compose/releases/download/1.21.2/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose

Add executable permissions to the docker-compose binary: sudo chmod +x /usr/local/bin/docker-compose

Run docker-compose --version to verify it installed correctly. It should show a version and build number similar to: “docker-compose version 1.21.2, build 1719ceb”

If the docker-compose command doesn’t work, add the following line to your ~/.bashrc file
export PATH="/usr/bin/docker-compose:$PATH"
Close and reopen your terminal(s) to apply the changes.

Extra Installation Steps for Windows

I was unable to get Kitematic to work on Docker Toolbox, so I would recommend skipping that. Make sure virtualization is enabled in the BIOS. If you need to change virtualization settings, do a full reboot cycle, otherwise Windows may not report that the settings have changed. If you’re running Windows 10 Professional, you’ll need to make sure Hyper-V is enabled in the “Turn Windows Features On or Off” dialog. IMPORTANT: If you’re using Docker Toolbox on Windows Home edition, you’ll want to start the VirtualBox instance manually before starting Docker every time or Docker will complain about not having an IP address.

Additionally, once Docker is installed, you’ll need to tweak the VirtualBox settings slightly. Port forwarding must be configured manually to allow the host system to communicate with Docker over localhost instead of its defualt IP of 192.168.99.100. First, right click on the machine title “default” and select “Settings”. Settings

Once in the settings menu, select “Network” and then “Port Forwarding”. Network

Finally, click the green plus on the top right corner and add a new port forwarding rule. The new rule should be configured exactly like “Rule 1” in the following picture, but you can name it whatever. This is the minimum amount of configuration needed for the application to work, but you can add the other ports used by the application if you’d like to be able to connect to everything via localhost. Other ports used by this application are 3000, 9000, 5000, 7000, and 6800 if you would like to add those now. PortForwaring

Setting Up Docker

If you are using Linux, all of the subsequent Docker commands in this guide might have to be run with sudo. If you would like to be able to use Docker without sudo, look through the answers here. If you’re using Docker Toolbox on Windows Home, all subsequent statements that mention localhost should be replaced with 192.168.99.100 unless you set up port forwarding for all of the ports mentioned in the previous step.

Verify that Docker installed correctly with: docker run hello-world. You should see, “Hello from Docker!”

Running the Code

Open a Docker terminal on Windows Home, Git Bash or some kind of bash emulator on Windows Professional, or a normal terminal otherwise, and cd into the Git repo. If on Windows, it’s probably a good idea to run scripts/fix-bad-characters.bash first because Docker behaves strangely when Windows-specific characters are sent to it. You may need to run this again in the future if more Windows characters make it into your files.

Now, run ./start.sh. If you get a permissions error, you may need to run chmod +x start.sh. This will grant execution permissions to the file. If all goes well, the database will be created, the scrapers will start running, and the website will start up. This process will take some time. Eventually, you should start seeing messages about events being saved. Once a message says Data retrieved successfully, the code is done running. On subsequent runs, you can run light-start.sh instead of start.sh. This will skip all of the steps except for just building and running the containers. If you run into any errors while running light-start.sh, try running start.sh again to clear out any bad cached data.

Several components should be visible now:

System Architecture

System Architecture

Setting up your development environment

For debugging, we’ve set up configurations to allow for remote debugging in Docker using VS Code. This allows you to set breakpoints and step through code remotely while it’s running in Docker. You can use another editor if you’d like, but you’ll have to set up remote debugging yourself. Whenever you open VS Code, it creates a directory called .vscode which stores local configurations.

This repo contains all of the components needed to run the system in separate folders:

When you’re developing, you’ll want to think of those folders as separate projects and open a separate instance of VS Code in each of those subdirectories. This is important because the remote debugger requires the folder structure of the remote and local repository to match. To do so, you can launch VS Code, then choose File -> Open Folder or open it from the command line like this: code ./in2it_site.

Once you have VS Code open, you should see a bug icon on the left panel. This contains the debugger settings. If you click the gear icon near the top right of the submenu, it will open a prompt to choose an environment. It doesn’t matter which one you choose because we’ll overwrite this file in a minute. In this repo, there is a folder called sample_vscode_config with one config per component. Replace the entire launch.json file with whatever config matches your current folder. As the comment in the files explains, you will need to replace localhost with 192.168.99.100 for Docker Toolbox. Once you have the configuration saved, you’ll be able to select it from the debug menu. When you have the code running in Docker, click the green arrow near the top left of VS Code to attach to the running process.

All of the code is running through a program called nodemon which allows you to use hot reloading while debugging. Hot reloading means that any time you change the source code in your editor, nodemon will detect the change and automatically restart the attached process. This way, testing your changes requires no manual intervention. You can try it by pressing CTRL + S on any source code file while the code is running.

You do not need to have python or node running locally for development since it is running in Docker, but it may help to have local installations to help with autocomplete and linting while developing. See python for VS Code and node for VS Code.

Configuring pgAdmin

This is an optional step to view what’s going on with the database. This doesn’t need to be done immediately, but it may be useful for debugging if things aren’t working as expected.

This step must be done while the code is running and after the database has been initialized (1-2 minutes after the first startup).

Go to pgAdmin at locahost:7000 and sign in with username user@domain.com and password pgadmin. On the left pane, right click on the node called Servers and create a new one as shown below.

Create Server

Enter whatever name you want for the server. I choose postgres because I’m boring.

Choose Name

Now, set up the conenection properties. You should enter the properties exactly as shown in the image. The password is postgres.

Create Connection

If your database was created successfully, you should see three databases in the left pane now: events, postgres, and scheduler.

Settings

Command Line Arguments

The following parameters can be passed to start.bash to change its runtime behavior.

Other Settings

The following settings are defined in event_processor/config.py:

Scheduler

We’ve forked a project from Nextdoor called ndscheduler to use as a scheduling system for this project. To run the scheduler with this application, the scheduler repository must be checked out into the same parent folder as this one.

To use, run cd {parent directory of the directory you cloned this repository into} and git clone https://github.com/In2ItChicago/ndscheduler. The startup scripts in this repository check for the existence of the ndscheduler folder when running. Once this is completed, go back to the In2ItChicago folder and run ./start.sh -s to start the application with the scheduler.

If all goes well, you should be able to navigate to localhost:8888 and see the scheduler. From there, you can let the scrapers run on a schedule or run them manually with the UI. localhost:6800 is the url for scrapyd, which is the middleman between the scrapers and ndscheduler.

Development Guide

IMPORTANT: If you’re using Windows, please run git config --global core.autocrlf input before committing anything. This prevents carriage returns from getting sent to the remote repository.

Our current development tasks and bugs are kept in the issues list here.
The easiest way to learn the code base and get started contributing is to add a new scraper as defined in this issue.
The issue contains instructions on how to pick a specific site.

Technical Overview

This project consists of four parts

Detailed Docs

Getting Started

As stated previously, adding a scraper is the best way to start contributing. We’re using Scrapy for this project, which is a complex and sophisticated web scraping and crawling framework. Check out the below tutorials for some introductions to web scraping and Scrapy.

Tutorials

Scrapy uses the CssSelect module to implement css selectors. Docs can be found here. CssSelect defines its selectors according to the w3 specification here with a few exceptions that are listed in CssSelect’s documentation.

Knowing when to use a scraper and when to use an API

Most websites that we’re dealing with will need to be scraped because the data on them is statically loaded from the server as html. However, some sites use APIs to dynamically load data. We should use these whenever possible because scrapers are fragile and need to be changed any time the content on the page changes. APIs are more stable and are less likely to have breaking changes introduced often.

Here is an example of how to detect if a site has an API we can use.

  1. Go to https://chipublib.bibliocommons.com/events/search/index in Google Chrome
  2. Open the developer tools using F12 on Windows/Linux and Command+Option+I on Mac
  3. Click on the “Network” tab at the top of the toolbox
  4. Reload the page. The grid should be populated with data.
  5. Click on the “Name” column for any of the requests. A detailed view should appear and the “Headers” tab should be selected.
  6. Click on the “Response” tab. There could be a variety of data in here. This view can have a variety of data.
    For resource requests like images, it will say there is no data available, javascript files will show the javascript code, css files will show the stylesheet, etc. The only response data we care about right now is json.
  7. Look for a request name that starts with “search?”. Looking through the response, you should see a json object.
  8. Click on the “Headers” tab. The Request URL is what was requested by your browser to retrieve the json data. We can use that same url to get that data in our application.
  9. If you keep clicking through more requests, you should see several more that also returned json data.

This is the code that was used to create an API client for that site.
You can use this as a guide if you need to create your own API client. Some sites have APIs that are well-documented and designed for external use. These should be used if they are available.

Some sites may provide an iCalendar feed. Try to use the iCal reader if it is possible to do so. Some sites may also provide an RSS feed. This is an example of how to use the feedparser module to parse a feed.

How to integrate new scrapers and API clients with the core code

All new scrapers should inherit from one of the classes listed here All new API clients should inherit from ApiSpider and scrapers should inherit from ScraperSpider or ScraperCrawlSpider, depending on if the spider needs to visit multiple urls or not.

The end goal of all scrapers and API clients is to transform the raw data into event objects that conform to the Event class in this file.
For each item, you’ll want to parse out the following data (as much as is available).

Once you’ve decided how to find these fields for your site, look at the existing examples to see what methods to use to extract the data.

Troubleshooting Guide

Weird errors are occuring when I start up the code

Services aren’t restarting properly with nodemon after saving

Docker says that it can’t start because ports are already allocated

When starting Docker on Windows, it complains about not having an IP address

VS Code is complaining about missing dependencies/modules/etc

Deployment

This describes how to deploy the application to the production server. All the following commands assume your current working directory is the root of the repo. Before you do this, you will neeed both an SSH key for the server and the Dockerhub passsword. Ask a member with push access for these.

First, run deploy/tag-images.sh with a version number appended to the end. Our current version number scheme is YYYYMMDD.{REVISION}. EX: deploy/tag-images.sh 20190723.1. The revision number can be incremented if more than one deployment is done in a day. This will build the production version of the images, tag them, and save them locally on your computer.

Next, open up docker-compose.prod.yml and update all of the image versions to point to the new tag. If not all of the images were changed, not all of the tags need to be updated, but we prefer that all of the images get versioned together as that will make it easier to keep everything in sync.

Now, run ./start-prod.sh and test to make sure everything is working locally.

Once tested, you can push the images to Dockerhub. Before doing this the first time, you have to run deploy/login.sh in order to obtain access to push to Dockerhub. Before doing that, you will need to have dockerhub_in2itchicago.txt inside your deploy folder.

Once logged in, run deploy/publish-images.sh 20190723.1 to push the images to Dockerhub (replace the version number with the one you used in the earlier step). This will take a few minutes depending on the extent of the changes. Dockerhub does a diff with the current published image to see what needs to be pushed.

After a successful push, you’re ready to deploy to the server. First, commit your changes to docker-compose.prod.yml so they can be picked up by the server. Then, run deploy/vultr_deploy.sh. For this to work, you need to have the ssh private key installed on your computer at ~/.ssh/id_rsa.

After the deploy script finishes, you can run deploy/connect_ports.sh. This will set up port forwarding to the remote server so you can monitor the applications from your local computer. If anything goes awry, you can use deploy/ssh_vultr.sh to SSH into the server.