library(rvest)
library(xml2)
## Choose a starting point
<- 'http://www.homedepot.com/b/Appliances-Dishwashers-Built-In-Dishwashers/N-5yc1vZc3nj'
baseurl
## Data frame to hold results
<- data.frame(model = character(0), rating = numeric(0), link = character(0))
df <- baseurl
url <- TRUE
loadNextPage while(loadNextPage) { ## Loop through pages
print('Reading Page')
Sys.sleep(0.1) ## Let's be nice
<- url %>%
html read_html() ## pull back the page
<- html %>%
dw html_nodes('.plp-pod') ## focus in on the dishwashers
<- dw %>%
model html_node('.pod-plp__model') %>%
html_text() ## get model ID
<- gsub('[^[:alnum:] ]', '', model)
model <- trimws(sub("Model\\s([^ ]*).*$", "\\1", model)) ## remove the unwanted
model <- dw %>%
rating html_nodes('.pod-plp__ratings') %>%
html_node('a') %>%
html_node('span') %>%
html_attr('rel') %>%
as.numeric() ## rating can be found in a link
<- dw %>%
link html_nodes('.plp-pod__image') %>%
html_nodes('a') %>%
html_attr('href') ## link to more information
<- rbind(df, data.frame(model = model, rating = rating, link = paste0('http://www.homedepot.com', link)))
df <- html %>%
gotoNext html_nodes('.hd-pagination__link') %>%
html_nodes(xpath = '//a[contains(@title,"Next")]') ## Link to the next page
if (length(gotoNext) > 0) {
<- gotoNext %>% html_attr('href')
url <- paste0('http://www.homedepot.com', url)
url <- TRUE ## Next page exists
loadNextPage else {
} <- FALSE ## We've reached the last page
loadNextPage
} }
Our dishwasher broke and was beyond repair. Time for a new one. Fortunately Black Friday was approaching so there were plenty of deals to be found. It seems it’s now easier than ever to pull information off the web and much of it is in a usable format so, if you ask me to identify a replacement dishwasher, web-scraping is where I’m going to start. I’m a big fan of the R library rvest and it took little time to pull back some useful information off the Home Depot website.