I recently discovered rvest and SelectorGadget as a way to scrape data from websites easily. This is a follow up to a previous post here about how I obtained the data.
Introduction
The goal is to scrape the win/loss information for each player’s champion selection from the 2013-2015 NA/EU LCS season. To do this SelectorGadget can be used to get the required information to feed into rvest and get the highlighted information.
Import Site
To do this, readLines and HTMLTreeParse was used to import the site.
Teams
The first thing is to get what teams played during the week, to do this we can highlight the selected regions of interest and deselect regions that are of no interest.
This gives the path selection as tr+ tr th:nth-child(4), tr+ tr th:nth-child(1) to be used with html_nodes.
From this, we get the extra values of “C” and “K” and the new line character need to be removed. League of Legends incorporates 5 players to a team, so each team name is going to be repeated 5 times.
Player and Champion
This requires more work. The easiest way to do this was to grab the table and using the design of the 1st and 2nd value being champion followed by player name. Originally, we get the table and then extract the data into lists. This is because the number of items bought per game is unknown as well as the trinkets was only introduced in the 2014 season.
From this the href value can be obtained and the title attribute can be obtained per line. In order to deal with data that may not come from a game, each player must pick a summoner spell. This is checked to determine if it is a player and not another field.
#####Win/Loss
To get win/loss, the path tr+ tr th:nth-child(3) , tr+ tr th:nth-child(2) was used. Again to follow how the data is to be set up, this is repeated 5 times.
####Date
To get the date, the path .match-recap tr:nth-child(1) td was used. Each value is from this line is found, so the Date: line must be found and replaced.
Putting it together
In order to put this together, the above functions are called and checked to make sure that data is of the same length to combine it to a data frame.
Missing data is generally shown as one of the above not being of the same length and since this is a wiki, they could be fixed by adding the required information. With missing information, the function and webpage will be shown as a warning to check what information is missing from the site.
player
champ
team
wL
date
Balls
Rumble
Cloud9
1
5/30/2015
Meteos
Gragas
Cloud9
1
5/30/2015
Incarnati0n
Kog’Maw
Cloud9
1
5/30/2015
Sneaky
Sivir
Cloud9
1
5/30/2015
LemonNation
Nautilus
Cloud9
1
5/30/2015
Dyrus
Maokai
Team SoloMid
0
5/30/2015
Using this, a vector of websites can be computed such as:
player
champ
team
wL
date
Dyrus
Irelia
Team SoloMid
1
2015-01-24
Santorin
Rek’Sai
Team SoloMid
1
2015-01-24
Bjergsen
Ahri
Team SoloMid
1
2015-01-24
WildTurtle
Jinx
Team SoloMid
1
2015-01-24
Lustboy
Janna
Team SoloMid
1
2015-01-24
Balls
Gnar
Cloud9
0
2015-01-24
Conclusion
Using this, it can be combined to form a final data format and looped over to gather all the data. There are a few things not shown here, such as standardizing the dates, checking for misspelled player names (Dyrus/Dryus), players not updated (SELFIE NO PAGE FOUND, SELFIE), and fixing bo3/5 series where wins are not 0 or 1 but can be 0-5.