Gathering and Using Unstructured Data

Instructor: Tyler Dauphinee and Rhys Chouinard (ATB Financial)

A practical survey on common data gathering tasks via open APIs, web scraping, and OCR (optical character recognition). Taking advantage of some common libraries in python (requests, beautifulsoup, and tesseract/pytesseract) we will build a base competency in data gathering. We will then introduce one of the most widely used machine learning libraries (scikit-learn) and use it to model a few canonical datasets. Finally (time permitting) we will move through one end-to-end real-world modelling scenario for each of the 3 data gathering tasks (document classification, hockey bracket predictions, and movie score modelling)

Target audience: Researchers who need to automate the gathering of data from scanned documents, unruly websites, or other publicly available sources.

Duration: 3 hours

Level: Beginner to Intermediate

Prerequisites: Comfortable with python and a willingness to deal with some messy problems.

Laptop software: sign-up for a free GCP (google cloud platform) trial (https://cloud.google.com/) (preferred) or install Docker locally (https://www.docker.com/products/docker-desktop).