Scraping for Dishwashers
Our dishwasher broke and was beyond repair. Time for a new one. Fortunately Black Friday was approaching so there were plenty of deals to be found. It seems it's now easier than ever to pull information off the web and much of it is in a usable format so, if you ask me to identify a replacement dishwasher, web-scraping is where I'm going to start. I'm a big fan of the R library rvest and it took little time to pull back some useful information off the Home Depot website.
1library(rvest)
2library(xml2)
3
4## Choose a starting point
5baseurl <- 'http://www.homedepot.com/b/Appliances-Dishwashers-Built-In-Dishwashers/N-5yc1vZc3nj'
6
7## Data frame to hold results
8df <- data.frame(model = character(0), rating = numeric(0), link = character(0))
9url <- baseurl
10loadNextPage <- TRUE
11while(loadNextPage) { ## Loop through pages
12 print('Reading Page')
13 Sys.sleep(0.1) ## Let's be nice
14 html <- url %>%
15 read_html() ## pull back the page
16 dw <- html %>%
17 html_nodes('.plp-pod') ## focus in on the dishwashers
18 model <- dw %>%
19 html_node('.pod-plp__model') %>%
20 html_text() ## get model ID
21 model <- gsub('[^[:alnum:] ]', '', model)
22 model <- trimws(sub("Model\\s([^ ]*).*$", "\\1", model)) ## remove the unwanted
23 rating <- dw %>%
24 html_nodes('.pod-plp__ratings') %>%
25 html_node('a') %>%
26 html_node('span') %>%
27 html_attr('rel') %>%
28 as.numeric() ## rating can be found in a link
29 link <- dw %>%
30 html_nodes('.plp-pod__image') %>%
31 html_nodes('a') %>%
32 html_attr('href') ## link to more information
33 df <- rbind(df, data.frame(model = model, rating = rating, link = paste0('http://www.homedepot.com', link)))
34 gotoNext <- html %>%
35 html_nodes('.hd-pagination__link') %>%
36 html_nodes(xpath = '//a[contains(@title,"Next")]') ## Link to the next page
37 if (length(gotoNext) > 0) {
38 url <- gotoNext %>% html_attr('href')
39 url <- paste0('http://www.homedepot.com', url)
40 loadNextPage <- TRUE ## Next page exists
41 } else {
42 loadNextPage <- FALSE ## We've reached the last page
43 }
44}