As an empirical researcher, the ever-present question that I face is to get the right data that can be used to explore my research questions. Data-scraping has been my bread-and-butter so far! In the past 4 years, through multiple data scraping assignments and through umpteen trial-and-errors, I've managed to amass (some) knowledge, which I present here.
Definition: It is a technique where a computer program extracts data from human-readable output coming from another program.
Purchasing the data
- One-time contact
Different ways of data acquisition
Retrospective data that can be used for publishable research can be acquired in multiple ways. Here is a list with the corresponding issues for each of them.
Advantages, ethicality, and legality
While there are multiple advantages of scraping data off the internet, I would be remiss if I didn't mention a few words about the ethics and legality of such data collection.
- Very fast
- Full agency
- No NDA
Codes guiding data acquisition
Declarations about fair use policies while submitting manuscripts
A quick look at the stance of academic journals (especially OM-related) towards research carried out using scraper data.
Ways of scraping data
3 main ways of scraping the data in R, with increasing level of difficulty, are listed in order here. Some useful sources to get you started with each of them are also provided.
Here's an excerpt from our data-scraping workshop at IESE that focuses on scraping a simple HTML-based website.
Click here to access the example code to follow along.