Web Scrapping with R

Example of a script that scrapes a website

In this example, we will scrape the Office of the Comptroller General from Paraguay, that hosts thousands of financial statements of government employees. You’ll need:

Setup

First, clear the environment and load necessary libraries:

rm(list = ls()) 

library(pacman)
pacman::p_load("RSelenium", "tidyverse", "tidylog", "netstat",
               "wdman","polite","rvest")

First, let’s ask politely if we can scrappe the site:


bow("https://portaldjbr.contraloria.gov.py/portaldjbr/")
## <polite session> https://portaldjbr.contraloria.gov.py/portaldjbr/
##     User-agent: polite R package
##     robots.txt: 1 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent

Let’s use 2 presidents of Paraguay as examples:

names_to_search <- c("Mario Abdo Benítez", "Horacio Cartes")

Start an RSelenium server and browser session:

rD <- RSelenium::rsDriver(
  port = free_port(),
  browser = c("firefox"),
  version = "latest",
  chromever = NULL
)

remDr <- rD[["client"]]
# # In case you get an error on the port, trial closing the browser like this:
# remDr$close()

# or like this:
# system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

Navigate to the website

initial_address <- "https://portaldjbr.contraloria.gov.py/portaldjbr/"
remDr$navigate(initial_address)

Hopefully everything went smoothly until this step. Now we are almost ready to start downloading the financial statements. Let’s do a loop to search for the names of the presidents in the search box and download all their statements:


for (i in 1:length(names_to_search)){
  
  remDr$navigate(initial_address)
  
  search_bar_location <- ".col-5 > input:nth-child(1)"
  
  actual_search_bar <- remDr$findElement(using = "css selector",
                                         search_bar_location)
  
  Sys.sleep(2)
  
  actual_search_bar$sendKeysToElement(list(names_to_search[i]))
  
  Sys.sleep(3)
  
  remDr$findElement(using = "css selector", ".icon")$clickElement()

  # Load the table:
  Sys.sleep(4)
  
  # first, let's see how many files there are:
  # Check the number of rows of the table:
  table <- remDr$findElement(using = 'css selector', value = 'table')
  
  table_html <- table$getElementAttribute('outerHTML')[[1]]
  # Parse the HTML to extract the table data
  table_data <- read_html(table_html) %>% html_table()
  # Convert the data frame to a tibble
  table_tibble <- table_data[[1]] %>% as_tibble()
  
  # Now, let's download the files:
  
  for (j in 1:nrow(table_tibble)){
    
    # Click on the PDF file
    remDr$findElement(using = 'css selector', 
                      value = paste0("div.row:nth-child(4) > div:nth-child(1) > div:nth-child(1) >
                                     div:nth-child(2) > table:nth-child(1) > tbody:nth-child(2) >
                                     tr:nth-child(",j,") > td:nth-child(5) > div:nth-child(1) >
                                     div:nth-child(1) > button:nth-child(1)"))$clickElement()
    # Wait for the download to finish:
    Sys.sleep(5)
    
  }
  
  
}
  

And voilá, you should see the downloaded pdfs in your download folder.