Image by Joshua Sortino

Data Scraping

As an empirical researcher, the ever-present question that I face is to get the right data that can be used to explore my research questions. Data-scraping has been my bread-and-butter so far! In the past 4 years, through multiple data scraping assignments and through umpteen trial-and-errors, I've managed to amass (some) knowledge, which I present here.

Definition: It is a technique where a computer program extracts data from human-readable output coming from another program.

Purchasing the data

  • Costly

  • One-time contact

1

Different ways of data acquisition

Retrospective data that can be used for publishable research can be acquired in multiple ways. Here is a list with the corresponding issues for each of them.

2

Advantages, ethicality, and legality

While there are multiple advantages of scraping data off the internet, I would be remiss if I didn't mention a few words about the ethics and legality of such data collection.

Advantages

  • Very fast
  • Full agency
  • No NDA

Codes guiding data acquisition

  • Declarations about fair use policies while submitting manuscripts

3

Publication perspective

A quick look at the stance of academic journals (especially OM-related) towards research carried out using scraper data.

4

Ways of scraping data

3 main ways of scraping the data in R, with increasing level of difficulty, are listed in order here. Some useful sources to get you started with each of them are also provided.

Through API

  • For publicly available APIs: Watch video
  • Webpage embedded API

Here's an excerpt from a web-scraping workshop held at IESE that focuses on scraping a simple HTML-based website.

Click here to access the example code to follow along. 

Tutorial