Selenium Web Scraping Tutorial
Web scraping allows you to extract data from websites. The process is automatic in which the HTML is processed to extract data that can be manipulated and converted to the format of your liking for retrieval and or analysis. The process is commonly used for data mining.
What is Selenium?
Selenium is an automation tool for web browsers. It is primarily used for testing of websites, allowing you to test it before you put it live. It gives you the chance to perform the following tasks on the website:
- Click buttons
- Enter information within the website, forms
- Search for information on the website
It is a tool that has been used for scraping website. But you must note that if you scrape a website too often, you risk the chance of having your IP banned from the website so approach with caution.
How to scrape with Selenium?
In order to scrape websites with Selenium you will need Python, either Python3.x. or Python2.x. Once you have that downloaded you will need the following driver and package:
Selenium package – allows you to interact with website from Python
Chrome Driver – a platform to perform and launch tasks on browser
Virtualenv – helps create an isolated Python environment
- In Python, you need to create a new project. You can create a file and name it setup.py and within it type in selenium as dependency.
- Then open the command line and you will need to create a virtual environment by typing the following command: $ virtualenv webscraping_example
- You will now need to run the dependency on virtualenv, you can do this by typing the following command in the terminal: $(webscraping_example) pip install -r setup.py
- Now going back to the folder in Python, create another file and you can name it, webscraping_example.py. Once done, you need to add the following code snippets:
- from selenium import webdriver
- from selenium.webdriver.common.by import By
- from selenium.webdriver.support.ui import WebDriverWait
- from selenium.webdriver.support import expected_conditions as EC
- from selenium.common.exceptions import TimeoutException
- You then need to put Chrome in Incognito mode, this is done in the webdriver by adding the incognito argument:
- option = webdriver.ChromeOptions()
- option.add_argument(“ — incognito”)
- You will then create a new instance with this code: browser = webdriver.Chrome(executable_path=’/Library/Application Support/Google/chromedriver’, chrome_options=option)
- You can now start making request you pass in the website url you want to scrape.
- You may need to create a user account with Github to do this but that is an easy process.
- You are now ready to scrape the data from the website.