Scraping for Dishwashers

Our dishwasher broke and was beyond repair. Time for a new one. Fortunately Black Friday was approaching so there were plenty of deals to be found. It seems it's now easier than ever to pull information off the web and much of it is in a usable format so, if you ask me to identify a replacement dishwasher, web-scraping is where I'm going to start. I'm a big fan of the R library rvest and it took little time to pull back some useful information off the Home Depot website.

 1library(rvest)
 2library(xml2)
 3
 4## Choose a starting point
 5baseurl <- 'http://www.homedepot.com/b/Appliances-Dishwashers-Built-In-Dishwashers/N-5yc1vZc3nj'
 6
 7## Data frame to hold results
 8df <- data.frame(model = character(0), rating = numeric(0), link = character(0))
 9url <- baseurl
10loadNextPage <- TRUE
11while(loadNextPage) {  ## Loop through pages
12  print('Reading Page')
13  Sys.sleep(0.1)  ## Let's be nice
14  html <- url %>%
15    read_html()  ## pull back the page
16  dw <- html %>%
17    html_nodes('.plp-pod') ## focus in on the dishwashers
18  model <- dw %>%
19    html_node('.pod-plp__model') %>%
20    html_text() ## get model ID
21  model <- gsub('[^[:alnum:] ]', '', model)
22  model <- trimws(sub("Model\\s([^ ]*).*$", "\\1", model)) ## remove the unwanted
23  rating <- dw %>%
24    html_nodes('.pod-plp__ratings') %>%
25    html_node('a') %>%
26    html_node('span') %>%
27    html_attr('rel') %>%
28    as.numeric() ## rating can be found in a link
29  link <- dw %>%
30    html_nodes('.plp-pod__image') %>%
31    html_nodes('a') %>%
32    html_attr('href') ## link to more information
33  df <- rbind(df, data.frame(model = model, rating = rating, link = paste0('http://www.homedepot.com', link)))
34  gotoNext <- html %>%
35    html_nodes('.hd-pagination__link') %>%
36    html_nodes(xpath = '//a[contains(@title,"Next")]') ## Link to the next page
37  if (length(gotoNext) > 0) {
38    url <- gotoNext %>% html_attr('href')
39    url <- paste0('http://www.homedepot.com', url)
40    loadNextPage <- TRUE  ## Next page exists
41  } else {
42    loadNextPage <- FALSE ## We've reached the last page
43  }
44}