Data Scraping

As an empirical researcher, the ever-present question that I face is to get the right data that can be used to explore my research questions. Data-scraping has been my bread-and-butter so far! In the past 4 years, through multiple data scraping assignments and through umpteen trial-and-errors, I've managed to amass (some) knowledge, which I present here.

Definition: It is a technique where a computer program extracts data from human-readable output coming from another program.

Purchasing the data

Costly
One-time contact

Different ways of data acquisition

Retrospective data that can be used for publishable research can be acquired in multiple ways. Here is a list with the corresponding issues for each of them.

Advantages, ethicality, and legality

While there are multiple advantages of scraping data off the internet, I would be remiss if I didn't mention a few words about the ethics and legality of such data collection.

Advantages

Very fast
Full agency
No NDA

Codes guiding data acquisition

Declarations about fair use policies while submitting manuscripts

Publication perspective

A quick look at the stance of academic journals (especially OM-related) towards research carried out using scraper data.

Ways of scraping data

3 main ways of scraping the data in R, with increasing level of difficulty, are listed in order here. Some useful sources to get you started with each of them are also provided.

Through API

For publicly available APIs: Watch video
Webpage embedded API

Here's an excerpt from our data-scraping workshop at IESE that focuses on scraping a simple HTML-based website.

Click here to access the example code to follow along.

Data Scraping

Purchasing the data

Costly