2016-02-15

Scraping Data from HotsLogs with Python

I wrote a program to scrape data from HotsLogs (a webpage about PC game Heroes of the Storm) and store it for analysis using Python with BeautifulSoup and Selenium. HTML tables were read using BeautifulSoup, however most of the data was not sent to the client because tables had a 100 row limit and additional collapsed branches with more data. Selenium was used to click Javascript buttons to reveal and page through table records, making them readable by BeautifulSoup. The data was finally stored locally in JSON format, and that was further distilled to CSV for analysis.

The motivation was that I wanted to generate my own statistics from HotsLogs, whichs shows summary statistics about Heroes of the Storm based on replays uploaded by the community. The raw data is displayed in HTML tables, though much of it is not sent to the client's broswer until the table is expanded or paged through via Javascript buttons. In the end, I was able to get all of the raw data I wanted, although it takes quite some time click through and mine the site. Years later an API was released and scraping is no longer necessary.

I uploaded the project and data on Dropbox for viewing:
Project on dropbox
Data as JSON on dropbox (might be slow)