Selenium Web Scrape



The “Webdriver” module of Selenium is most important because it will control the browser. To control the browsers there are certain requirements and these requirements have been set in the form of. As with every “web scraping with Selenium” tutorial, you have to download the appropriate driver to interface with the browser you’re going to use for scraping. Since we’re using Chrome, download the driver for the version of Chrome you’re using. The next step is to add it to the system path.

Selenium can be classified as the automation tool that facilitates scraping of information from the HTML web pages to perform web scraping utilizing google chrome. The scraping on the internet should be performed carefully. It is normally against the terms of the website to scrape out information.

Introduction

Web scraping is something that will come up sooner or later in a developer’s career – be it for a business requirement or for personal use. So, how do we tackle web scraping when the time comes?

First, let’s split the scraping task into three subtasks:

  • Getting the data
  • Parsing the data to get our results
  • Outputting the results (most commonly to an Excel spreadsheet or CSV file)

Based on the subtasks, the most obvious choice would be to use Python for web scraping because:

  • It’s easy to set up for a developer
  • There are a lot of well-documented libraries (or even frameworks -> Scrapy) to help us out with the subtasks
    • Getting the data – Requests for static websites, Selenium with Python for dynamic websites (preferred)
    • Parsing the data – Beautiful Soup
    • Outputting the results – csv, pandas or xlsxwriter
  • As a consequence we can accomplish our goal without writing a lot of code
Tool
Wait, why not Python?

There are a couple of issues that come with the Python approach.

Validating output

Validating the output of our Python script can prove to be a time-consuming task. Every time we run our script we have to open the generated output file to see if it matches our demands.

This isn’t an overwhelming issue if we have to scrape only one website with simple markup where there isn’t much room for mistake when writing the script. However, if we have to scrape multiple complex sites (pagination, AJAX, the data needs a lot of work to format it according to our liking) suddenly we lose a lot of time just by validating the results from the generated file(s).

Sharing scraped data

The other issue that comes to mind is sharing the data with colleagues/friends that are not developers.

An obvious solution would be to send a new copy of the data every time you run the script. However, this approach is not in accordance with the developer mindset as we want to automate tasks as much as possible.

An alternative would be to set up Python on their machine, send the script and explain how to execute it. That means you’d have to set a respectable amount of time aside (especially if you have to do it multiple times), or even worse, your colleagues/friends might refuse to set up Python/learn what they have to do in order to get the data when they want.

This poses the question: Is there an alternative way in which we could leverage something as useful as Selenium for web scraping, while at the same time overcome the aforementioned issues?

The answer is yes: QueryStorm lets you use C# inside of Excel and solves both issues, and there is also a Selenium NuGet package we can use for scraping. In addition – you don’t have to worry about writing the code for outputting the results to a CSV/XSLX file.

How does QueryStorm solve these issues?

Since you can use C# literally inside of Excel with QueryStorm, you don’t have to keep re-opening files to validate them – they are already in the same window as the (QueryStorm) IDE.

Also, the C# code is a part of the Excel workbook, so sharing is easy – just share the workbook file. The only setup the recipient needs to do is install the QueryStorm runtime with a simple (and small – 4 MB) installer and that’s it, they are ready to run the scraper on their own.

Enough talk, let’s get started and see how we would accomplish this!

Prerequisites and creating a workbook project

Of course, you need QueryStorm to follow this tutorial. You can generate a trial key and download (and install) the full version from the downloads page.

As with every “web scraping with Selenium” tutorial, you have to download the appropriate driver to interface with the browser you’re going to use for scraping. Since we’re using Chrome, download the driver for the version of Chrome you’re using. The next step is to add it to the system path. For more details check the official Selenium documentation. (Note: If you’re sharing the scraper, the recipient must also download the driver and add it to the system path.)

Open up Excel with a blank workbook, select the QueryStorm tab and click on the C# script button.

This will create a new workbook project and open it in the QueryStorm IDE.

Adding the Selenium NuGet package

To work with Selenium, we have to add the Selenium NuGet package to the project. Open the package manager either by clicking on the Manage packages button in the ribbon or by right clicking the project in the IDE and selecting the Manage packages context menu item.

Next, select the NuGet Packages tab item and search for the Selenium.WebDriver package. Now we can install the package by pressing the blue install package button. Wait for the package to be installed and exit the Package manager.


Navigating to a URL with Selenium

Now we’re ready to write some code. Let’s start off by creating an instance of a Chrome WebDriver (the driver is an IDisposable object, so it should be instantiated in a using statement) and navigating to a URL (I’ll be using this scraping test site). Additionally, let’s wait for 5 seconds before the browser is closed.

You can run the script by pressing the Run button in the ribbon or by pressing F5.

If you’ve downloaded the driver and correctly added the directory with the driver to the system path, this example should’ve opened a Chrome window, navigated to the specified URL and closed after 5 seconds.

Scraping item names

Let’s say that instead of waiting for 5 seconds and closing the browser, we want to get the names of the top items being scraped right now from the example homepage. In addition, we can also save them to the current workbook. How do we do that?

First, create a table in the spreadsheet (CTRL + T when selecting cell(s)), name it ResultsTable and name the column Results.

Now return to the C# script and start typing ResultsTable. You’ll see the table is shown in the code completion window and we can use it to save our results. How cool is that?

To get the complete names of the items, we have to find the item elements with the driver’s FindElements method. Therefore, to find the elements we have to supply the CSS selector that specifies the items. You can find the selector with the help of Chrome’s DevTools (it’s h4 > a). Finally, to get the complete item name, we have to get the element’s title attribute (with the IWebElement’s GetAttribute method).

Now when we run the script, we can see the results in our ResultsTable! We’ve managed to scrape useful data in just a couple of lines of code.


We’ll expand on this example and scrape more data in the next part of the tutorial.

Advanced

Web scraping is a very useful mechanism to either extract data, or automate actions on websites. Normally we would use urllib or requests to do this, but things start to fail when websites use javascript to render the page rather than static HTML. For many websites the information is stored in static HTML files, but for others the information is loaded dynamically through javascript (e.g. from ajax calls). The reason maybe because the information is constantly changing, or it maybe to prevent webscraping! Either way, you need to more advanced techniques to scrape the information – this is where the library selenium can help.

What is web scraping?

To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a web browser. The big advantage in simulating the website is that you can have the website fully render – whether it uses javascript or static HTML files.

What is selenium?

According to selenium official web page, it is a suite of tools for automating web browsers. This project is a member of the Software Freedom Conservancy, Selenium has three projects, each provides a different functionality if you are interested in it, visit their official website. The scope of this blog will be attached to the Selenium WebDriver project

When should you use selenium?

Selenium is going to facilitate us with tools to perform web scraping, but when should it be used? You generally can use selenium in the following scenarios:

  • When the data is loaded dynamically – for example Twitter. What you see in “view source” is different to what you see on the page (The reason is that “view source” just shows the static HTML files. If you want to see under the covers of a dynamic website, right click and “inspect element” instead)
  • When you need to perform an interactive action in order to display the data on screen – a classic example is infinite scrolling. For some websites, you need to scroll to the bottom of the page, and then more entries will show. What happens behind the scene is that when you scroll to the bottom, javascript code will call the server to load more records on screen.

So why not use selenium all the time? It is a bit slower then using requests and urllib. Recursion driver download. The reason is that selenium simulates running a full browser including the overhead that a brings with it. There are also a few extra steps required to use selenium as you can see below.

Once you have the data extracted, you can still use similar approaches to process the data (e.g. using tools such as BeautifulSoup)

Pre-requisites for using selenium

Step 1: Install selenium library

Before starting with a web scraping sample ensure that all requirements have been set, Selenium requires pip or pip3 installed, if you don’t have it installed you can follow the official guide to install it based on the operating system you have.

Once pip is installed you can proceed with the installation of selenium, with the following command

Selenium Web Scrape

Alternatively, you can download the PyPI source archive (selenium-x.x.x.tar.gz) and install it using setup.py:

Step 2: Install web driver

Selenium simulates an actual browser. It won’t use your chrome installation but it will use a “driver” which is the browser engine to run a browser. Selenium supports multiple web browsers, so you may chose which web browser to use (read on)

Selenium WebDriver refers to both the language bindings and the implementations of the individual browser controlling code. This is commonly referred to as just a web driver.

Web driver needs to be downloaded, and then it could be either added to the path environment variable or initialized with a string containing the path where downloaded web driver is. Environment variables are out of the scope of the blog so we are going to use the second option.

From here to the end Firefox web driver is going to be used, but here is a table containing information regarding each web driver, you are able to choose any of them, Firefox is recommended to follow this blog

Download the driver to a common folder which is accessible. Your script will refer to this driver.

You can follow our guide on how to install the web driver here.

Selenium Web Scraper

A Simple Selenium Example in Python

Ok, we’re all set. To begin with, let’s start with a quick staring example to ensure things are all working. Our first example will involving collecting a website title. In order to achieve this goal, we are going to use selenium, assuming it is already installed in your environment, just import webdriver from selenium in a python file as it’s shown in the following.

Running the code below will open a firefox window which looks a little bit different as can be seen in the following image and at the then it prints into the console the title of the website, in this case, it is collecting data from ‘Google’. Results should be similar to the following images:

Note that this was run in foreground so that you can see what is happening. Now we are going to manually close the firefox window opened, it was intentionally opened in this way to be able to see that the web driver actually navigates just like a human will do. But now that it is known, we can add at the end of the out this code: driver.quit() so the window will automatically be closed after the job is done. Code now will look like this.

Now the sample will open the Firefox web driver do its jobs and then close the windows. With this little and simple example, we are ready to go dipper and learn with a complex sample

How To Run Selenium in background

In case you are running your environment in console only or through putty or other terminal, you may not have access to the GUI. Also, in an automated environment, you will certainly want to run selenium without the browser popping up – e.g. in silent or headless mode. This is where you can add the following code at the start “options” and “–headless”.

The remaining examples will be run in ‘online’ mode so that you can see what is happening, but you can add the above snippet to help.

Example of Scraping a Dynamic Website in Python With Selenium

Until here, we have figure out how to scrap data from a static website, with a little bit of time, and patience you are now able to collect data from static websites. Let’s now dive a little bit more into the topic and build a script to extract data from a webpage which is dynamically loaded.

Imagine that you were requested to collect a list of YouTube videos regarding “Selenium”. With that information, we know that we are going to gather data from YouTube, that we need the searching result of “Selenium”, but this result will be dynamic and will change all the time.

The first approach is to replicate what we have done with Google, but now with YouTube, so a new file needs to be created yt-scraper.py

Now we are retrieving data YouTube title printed, but we are about to add some magic to the code. Our next step is to edit the search box and fill it with the word that we are looking for “Selenium” by simulating a person typing this into the search. This is done by using the Keys class:

Selenium Web Scrape Download

from selenium.webdriver.common.keys import Keys.

The driver.quit() line is going to be commented temporally so we are able to see what we are performing

The Youtube page shows a list of videos from the search as expected!

As you might notice, a new function has been called, named find_element_by_xpath, which could be kind of confusing at the moment as it uses strange xpath text. Let’s learn a little bit about XPath to understand a bit more.

What is XPath?

XPath is an XML path used for navigation through the HTML structure of the page. It is a syntax for finding any element on a web page using XML path expression. XPath can be used for both HTML and XML documents to find the location of any element on a webpage using HTML DOM structure.

The above diagram shows how it can be used to find an element. In the above example we had ‘//input[@id=”search”]. This finds all <input> elements which have an attributed called “id” where the value is “search”. See the image below – under the “inspect element” for the search box from youTube, you can seen there’s a tag <input id=”search” … >. That’s exactly the element we’re searching for with XPath

There are a great variety of ways to find elements within a website, here is the full list which is recommended to read if you want to master the web scraping technique.

Looping Through Elements with Selenium

Selenium Web Scraping Python Firefox

Now that Xpath has been explained, we are able to the next step, listing videos. Until now we have a code that is able to open https://youtube.com, type in the search box the word “Selenium” and hit Enter key so the search is performed by youtube engine, resulting in a bunch of videos related to Selenium, so let’s now list them.

Firstly, right click and “inspect element” on the video section and find the element which is the start of the video section. You can see in the image below that it’s a <div> tag with “id=’dismissable'”

We want to grab the title, so within the video, find the tag that covers the title. Again, right click on the title and “inspect element” – here you can see the element “id=’video-title'”. Within this tag, you can see the text of the title.

One last thing, let’s remind that we are working with internet and web browsing, so sometimes is needed to wait for the data to be able, in this case, we are going to wait 5 seconds after the search is performed and then retrieve the data we are looking information. Keep in mind that the results could vary due to internet speed, and device performance.

Once the code is executed you are going to see a list printed containing videos collected from YouTube as shown in the following image, which firstly prints the website title, then it tells us how many videos were collected and finally, it lists those videos.

Waiting for 5 seconds works, but then you have to adjust for each internet speed. There’s another mechanism you can use which is to wait for the actual element to be loaded – you can use this a with a try/except block instead.

So instead of the time.sleep(5), you can then replace the code with:

This will wait up to a maximum of 5 seconds for the videos to load, otherwise it’ll timeout

Conclusion

Selenium Web Scraper Java

With Selenium you are going to be able to perform endless of tasks, from automation tasks to automate testing, the sky is the limit here, you have learned how to scrape data from static and dynamic websites, performing javascript actions like send some keys like “Enter”. You can also look at BeautifulSoup to extract and search for data next

Selenium Web Scraper Python Github

Subscribe to our newsletter

Get new tips in your inbox automatically. Subscribe to our newsletter!