Rselenium Web Scraping



  • Rselenium web-scraping with R. Scraping table with R using RSelenium. Scraping with RSelenium. Scraping table with drop-down values in RSelenium.
  • Selenium is an automation tool. RSelenium is a package which helps us to achieve automation in R. By using RSelenium, we fool the website as if a browser has walked through the website completely and collected every bit of info on its way. The server will think the selenium driver to be just another human wandering the page.

I'm trying to scrape data but I'm having trouble scraping it. I'm able to navigate through website using RSelenium. You can find my code below. I want to scrape names from each drop down so that I.

Advanced

Web scraping is a very useful mechanism to either extract data, or automate actions on websites. Normally we would use urllib or requests to do this, but things start to fail when websites use javascript to render the page rather than static HTML. For many websites the information is stored in static HTML files, but for others the information is loaded dynamically through javascript (e.g. from ajax calls). The reason maybe because the information is constantly changing, or it maybe to prevent webscraping! Either way, you need to more advanced techniques to scrape the information – this is where the library selenium can help.

What is web scraping?

To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a web browser. The big advantage in simulating the website is that you can have the website fully render – whether it uses javascript or static HTML files.

What is selenium?

According to selenium official web page, it is a suite of tools for automating web browsers. This project is a member of the Software Freedom Conservancy, Selenium has three projects, each provides a different functionality if you are interested in it, visit their official website. The scope of this blog will be attached to the Selenium WebDriver project

When should you use selenium?

Selenium is going to facilitate us with tools to perform web scraping, but when should it be used? You generally can use selenium in the following scenarios:

  • When the data is loaded dynamically – for example Twitter. What you see in “view source” is different to what you see on the page (The reason is that “view source” just shows the static HTML files. If you want to see under the covers of a dynamic website, right click and “inspect element” instead)
  • When you need to perform an interactive action in order to display the data on screen – a classic example is infinite scrolling. For some websites, you need to scroll to the bottom of the page, and then more entries will show. What happens behind the scene is that when you scroll to the bottom, javascript code will call the server to load more records on screen.

So why not use selenium all the time? It is a bit slower then using requests and urllib. The reason is that selenium simulates running a full browser including the overhead that a brings with it. There are also a few extra steps required to use selenium as you can see below.

Once you have the data extracted, you can still use similar approaches to process the data (e.g. using tools such as BeautifulSoup)

Pre-requisites for using selenium

Step 1: Install selenium library

Before starting with a web scraping sample ensure that all requirements have been set, Selenium requires pip or pip3 installed, if you don’t have it installed you can follow the official guide to install it based on the operating system you have.

Once pip is installed you can proceed with the installation of selenium, with the following command

Alternatively, you can download the PyPI source archive (selenium-x.x.x.tar.gz) and install it using setup.py:

Step 2: Install web driver

Selenium simulates an actual browser. It won’t use your chrome installation but it will use a “driver” which is the browser engine to run a browser. Selenium supports multiple web browsers, so you may chose which web browser to use (read on)

Selenium WebDriver refers to both the language bindings and the implementations of the individual browser controlling code. This is commonly referred to as just a web driver.

Web driver needs to be downloaded, and then it could be either added to the path environment variable or initialized with a string containing the path where downloaded web driver is. Environment variables are out of the scope of the blog so we are going to use the second option.

From here to the end Firefox web driver is going to be used, but here is a table containing information regarding each web driver, you are able to choose any of them, Firefox is recommended to follow this blog

Download the driver to a common folder which is accessible. Your script will refer to this driver.

You can follow our guide on how to install the web driver here.

A Simple Selenium Example in Python

Ok, we’re all set. To begin with, let’s start with a quick staring example to ensure things are all working. Our first example will involving collecting a website title. In order to achieve this goal, we are going to use selenium, assuming it is already installed in your environment, just import webdriver from selenium in a python file as it’s shown in the following.

Running the code below will open a firefox window which looks a little bit different as can be seen in the following image and at the then it prints into the console the title of the website, in this case, it is collecting data from ‘Google’. Results should be similar to the following images:

Rselenium Web Scraping

Note that this was run in foreground so that you can see what is happening. Now we are going to manually close the firefox window opened, it was intentionally opened in this way to be able to see that the web driver actually navigates just like a human will do. But now that it is known, we can add at the end of the out this code: driver.quit() so the window will automatically be closed after the job is done. Code now will look like this.

Now the sample will open the Firefox web driver do its jobs and then close the windows. With this little and simple example, we are ready to go dipper and learn with a complex sample

How To Run Selenium in background

In case you are running your environment in console only or through putty or other terminal, you may not have access to the GUI. Also, in an automated environment, you will certainly want to run selenium without the browser popping up – e.g. in silent or headless mode. This is where you can add the following code at the start “options” and “–headless”.

The remaining examples will be run in ‘online’ mode so that you can see what is happening, but you can add the above snippet to help.

Example of Scraping a Dynamic Website in Python With Selenium

Until here, we have figure out how to scrap data from a static website, with a little bit of time, and patience you are now able to collect data from static websites. Let’s now dive a little bit more into the topic and build a script to extract data from a webpage which is dynamically loaded.

Imagine that you were requested to collect a list of YouTube videos regarding “Selenium”. With that information, we know that we are going to gather data from YouTube, that we need the searching result of “Selenium”, but this result will be dynamic and will change all the time.

The first approach is to replicate what we have done with Google, but now with YouTube, so a new file needs to be created yt-scraper.py

Now we are retrieving data YouTube title printed, but we are about to add some magic to the code. Our next step is to edit the search box and fill it with the word that we are looking for “Selenium” by simulating a person typing this into the search. This is done by using the Keys class:

from selenium.webdriver.common.keys import Keys.

The driver.quit() line is going to be commented temporally so we are able to see what we are performing

The Youtube page shows a list of videos from the search as expected!

As you might notice, a new function has been called, named find_element_by_xpath, which could be kind of confusing at the moment as it uses strange xpath text. Let’s learn a little bit about XPath to understand a bit more.

What is XPath?

XPath is an XML path used for navigation through the HTML structure of the page. It is a syntax for finding any element on a web page using XML path expression. XPath can be used for both HTML and XML documents to find the location of any element on a webpage using HTML DOM structure.

The above diagram shows how it can be used to find an element. In the above example we had ‘//input[@id=”search”]. This finds all <input> elements which have an attributed called “id” where the value is “search”. See the image below – under the “inspect element” for the search box from youTube, you can seen there’s a tag <input id=”search” … >. That’s exactly the element we’re searching for with XPath

There are a great variety of ways to find elements within a website, here is the full list which is recommended to read if you want to master the web scraping technique.

Looping Through Elements with Selenium

Now that Xpath has been explained, we are able to the next step, listing videos. Until now we have a code that is able to open https://youtube.com, type in the search box the word “Selenium” and hit Enter key so the search is performed by youtube engine, resulting in a bunch of videos related to Selenium, so let’s now list them.

Firstly, right click and “inspect element” on the video section and find the element which is the start of the video section. You can see in the image below that it’s a <div> tag with “id=’dismissable'”

We want to grab the title, so within the video, find the tag that covers the title. Again, right click on the title and “inspect element” – here you can see the element “id=’video-title'”. Within this tag, you can see the text of the title.

One last thing, let’s remind that we are working with internet and web browsing, so sometimes is needed to wait for the data to be able, in this case, we are going to wait 5 seconds after the search is performed and then retrieve the data we are looking information. Keep in mind that the results could vary due to internet speed, and device performance.

Once the code is executed you are going to see a list printed containing videos collected from YouTube as shown in the following image, which firstly prints the website title, then it tells us how many videos were collected and finally, it lists those videos.

Waiting for 5 seconds works, but then you have to adjust for each internet speed. There’s another mechanism you can use which is to wait for the actual element to be loaded – you can use this a with a try/except block instead.

So instead of the time.sleep(5), you can then replace the code with:

This will wait up to a maximum of 5 seconds for the videos to load, otherwise it’ll timeout

Conclusion

With Selenium you are going to be able to perform endless of tasks, from automation tasks to automate testing, the sky is the limit here, you have learned how to scrape data from static and dynamic websites, performing javascript actions like send some keys like “Enter”. You can also look at BeautifulSoup to extract and search for data next

Subscribe to our newsletter

Get new tips in your inbox automatically. Subscribe to our newsletter!

I love Dungeons and Dragons. I am also a specifies that we want the path attribute. (Anatomy of an HTML link: <a href='https://www.something.com'>Link text seen on page</a>).

  • The remainder of the function subsets the extracted links to only those that pertain to the monster pages (removing links like the home page). Printing the output indicates that these links are only relative links, so we append the base URL to create absolute links (abs_links).
  • Finally, we can loop through all pages of results to get the hundreds of pages for the individual monsters:

    Step 2: Use RSelenium to access pages behind login

    In Step 1, we looped through pages of tables to get the URLs for pages that contain detailed information on individual monsters. Great! We can visit each of these pages and just do some more rvest work to scrape the details! Well… not immediately. Most of these monster pages can only be seen if you have paid for the corresponding digital books and are logged in. DnD Beyond uses Twitch for authentication which involves a redirect. This redirect made it way harder for me to figure out what to do. It was like I had been thrown into the magical, mysterious, and deceptive realm of the Feywild where I frantically invoked Google magicks to find many dashed glimmers of hope but luckily a solution in the end.

    What did not work

    It’s helpful for me to record what things I tried and failed so I can remember my thought process. Hopefully, it saves you wasted effort if you’re ever in a similar situation.

    • Using rvest’s page navigation abilities did not work. I tried the following code:

    But I ran into an error:

    • Using rvest’s basic authentication abilities did not work. I found this tutorial on how to send a username and password to a form with rvest. I tried hardcoding the extremely long URL that takes you to a Twitch authentication page, sending my username and password as described in the tutorial, and following [this Stack Overflow suggestion] to create a fake login button since the authentication page had an unnamed, unlabeled “Submit” input that did not seem to conform to rvest’s capabilities. I got a 403 error.

    What did work

    Only when I stumbled upon this Stack Overflow post did I learn about the RSelenium package. Selenium is a tool for automating web browsers, and the RSelenium package is the R interface for it.

    I am really grateful to the posters on that Stack Overflow question and this blog post for getting me started with RSelenium. The only problem is that the startServer function used in both posts is now defunct. When calling startServer, the message text informs you of the rsDriver function.

    Step 2a: Start automated browsing with rsDriver

    The amazing feature of the rsDriver function is that you do not need to worry about downloading and installing other sofware like Docker or phantomjs. This function works right out of the box! To start the automated browsing, use the following:

    When you first run rsDriver, status messages will indicate that required files are being downloaded. After that you will see the status text “Connecting to remote server” and a Chrome browser window will pop open. The browser window will have a message beneath the search bar saying “Chrome is being controlled by automated test software.” This code comes straight from the example in the rsDriver help page.

    Step 2b: Browser navigation and interaction

    The rem_dr object is what we will use to navigate and interact with the browser. This navigation and interaction is achieved by accessing and calling functions that are part of the rem_dr object. We can navigate to a page using the $navigate() function. We can select parts of the webpage with the $findElement() function. Once these selections are made, we can interact with the selections by

    • Sending text to those selections with $sendKeysToElement()
    • Sending key presses to those selections with $sendKeysToElement()
    • Sending clicks to those selections with $clickElement()

    All of these are detailed in the RSelenium Basics vignette, and further examples are in the Stack Overflow and blog post I mentioned above.

    The code below shows this functionality in action:

    Note: Once the Chrome window opens, you can finish the login process programatically as above or manually interface with the browser window as you would normally. This can be safer if you don’t want to have a file with your username and password saved anywhere.

    Step 2c: Extract page source

    Now that we have programatic control over the browser, how do we interface with rvest? Once we navigate to a page with $navigate(), we will need to extract the page’s HTML source code to supply to rvest::read_html. We can extract the source with $getPageSource():

    The subset [[1]] is needed after calling rem_dr$getPageSource() because $getPageSource() returns a list of length 1. The HTML source that is read in can be directly input to rvest::read_html.

    Excellent! Now all we need is a function that scrapes the details of a monster page and loop! In the following, we put everything together in a loop that iterates over the vector of URLs (all_monster_urls) generated in Step 1.

    Within the loop we call the custom scrape_monster_page function to be discussed below in Step 3. We also include a check for purchased content. If you try to access a monster page that is not part of books that you have paid for, you will be redirected to a new page. We perform this check with the $getCurrentUrl() function, filling in a missing value for the monster information if we do not have access. The Sys.sleep at the end can be useful to avoid overloading your computer or if rate limits are a problem.

    Step 3: Write a function to scrape an individual page

    The last step in our scraping endeavor is to write the scrape_monster_page function to scrape data from an individual monster page. You can view the full function on GitHub. I won’t go through every aspect of this function here, but I’ll focus on some principles that appear in this function that I’ve found to be useful in general when working with rvest.

    Principle 1: Use SelectorGadget AND view the page’s source

    As useful as SelectorGadget is for finding the correct CSS selector, I never use it alone. I always open up the page’s source code and do a lot of Ctrl-F to quickly find specific parts of a page. For example, when I was using SelectorGadget to get the CSS selectors for the Armor Class, Hit Points, and Speed attributes, I saw the following:

    I wanted to know if there were further subdvisions of the areas that the .mon-stat-block__attribute selector had highlighted. To do this, I searched the source code for “Armor Class” and found the following:

    Looking at the raw source code allowed me to see that each line was subdivided by spans with classes mon-stat-block__attribute-label, mon-stat-block__attribute-data-value, and sometimes mon-stat-block__attribute-data-extra.

    With SelectorGadget, you can actually type a CSS selector into the text box to highlight the selected parts of the page. I did this with the mon-stat-block__attribute-label class to verify that there should be 3 regions highlighted.

    Because SelectorGadget requires hovering your mouse over potentially small regions, it is best to verify your selection by looking at the source code.

    Principle 2: Print often

    Continuing from the above example of desiring the Armor Class, Hit Points, and Speed attributes, I was curious what I would obtain if I simply selected the whole line for each attribute (as opposed to the three subdivisions). The following is what I saw when I printed this to the screen:

    A mess! A length-3 character vector containing the information I wanted but not in a very tidy format. Because I want to visualize and explore this data later, I want to do a little tidying up front in the scraping process.

    Selenium Web Scraping

    What if I just access the three subdivisions separately and rbind them together? This is not a good idea because of missing elements as shown below:

    For attribute-label, I get a length-3 vector. For attribute-data-value, I get a length-3 vector. For attribute-data-value, I only get a length-2 vector! Through visual inspection, I know that the third line “Speed” is missing the span with the data-extra class, but I don’t want to rely on visual inspection for these hundreds of monsters! Printing these results warned me directly that this could happen! Awareness of these missing items motivates the third principle.

    Principle 3: You will need loops

    For the Armor Class, Hit Points, and Speed attributes, I wanted to end up with a data frame that looks like this:

    This data frame has properly encoded missingness. To do this, I needed to use a loop as shown below.

    The code below makes use of two helper functions that I wrote to cut down on code repetition:

    • select_text to cut down on the repetitive page %>% html_nodes %>% html_text

    Web Scraping Legal

    • replace_if_empty to repace empty text with NA

    I first select the three lines corresponding to these three attributes with

    This creates a list of three nodes (pieces of the webpage/branches of the HTML tree) corresponding to the three lines of data:

    We can chain together a series of calls to html_nodes. I do this in the subsequent lapply statement. I know that each of these nodes contains up to three further subdivisions (label, value, and extra information). In this way I can make sure that these three pieces of information are aligned between the three lines of data.

    Nearly all of the code in the scrape_monster_page function repeats these three principles, and I’ve found that I routinely use similar ideas in other scraping I’ve done with rvest.

    Selenium Web Scraping Python

    Summary

    This is a long post, but a few short take-home messages suffice to wrap ideas together:

    Web Scraping Open Source

    • rvest is remarkably effective at scraping what you need with fairly concise code. Following the three principles above has helped me a lot when I’ve used this package.
    • rvest can’t do it all. For scraping tasks where you wish that you could automate clicking and typing in the browser (e.g. authentication settings), RSelenium is the package for you. In particular, the rsDriver function works right out of the box (as far as I can tell) and is great for people like me who are loath to install external dependencies.

    Selenium Web Scraping Java

    Happy scraping!





    Comments are closed.