A brief Introduction to Web scraping in Python

Web scraping simply concerns with Extracting data from the website.
As a programmer in many cases, you will need to extract data from websites therefore Web scraping is a skill you need to have.
In this tutorial, you’re going to learn how to perform web scraping in Python using requests and BeautifulSoup libraries.
Throughout the tutorial, you will learn out basic web scraping examples together with implementing a simple web scraper to scrap quotations from a website.
Requirements
In order to follow through with this tutorial, you need to have the following Python Libraries Installed on your System
Installation
$ pip install requests
$ pip install beautifulsoup4
Requests
Requests is an elegant and simple HTTP library for Python, built for human beings.
We will use requests during implementing our project which scraps quotations from a particular website
BeautifulSoup
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
Diving deeper to BeautifulSoup
For Instance, Let’s use BeautifulSoup to extract data from the below HTML file
<!DOCTYPE html>
<head>
<title>Document</title>
</head>
<body>
<div id = 'quotes'>
<p id = 'normal'>Time the time before the time times you</p>
<p id = 'normal'>The Future is now </p>
<p id = 'special'>Be who you wanted to be when you're younger</p>
<p id = 'special'>The world is reflection of who you're</p>
</div><div>
<p id = 'Languages'>Programming Languages</p>
<ul>
<li>Python</li>
<li>C+++</li>
<li>Javascript</li>
<li>Golang</li>
</ul>
</div>
</body>
</html>
Extracting all paragraphs in HTML
Let’s Extract all paragraphs from the Sample.html
app.py
from bs4 import BeautifulSouphtml = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')for paragraph in soup.find_all('p'):
print(paragraph.text)
Output:
When you run the above simple program it will produce the following result.
$ python app.py
Time the time before the time times you
The Future is now
Be who you wanted to be when you're younger
The world is reflection of who you're
Programming LanguagesCode Explanation Importing Libraryfrom bs4 import BeautifulSoup
The above line of code is for importing our BeautifulSoup Library to our program
Creating a BeautifulSoup object with HTML string
html = open('sample.html').read() soup = BeautifulSoup(html, 'html.parser')
The above 2 lines of code are for reading the sample.html and creating a Beautifulsoup object ready for parsing data within it.
The Syntax for making a BeautifulSoup object is
soup = BeautifulSoup(html_string, 'html.parser')
Finding all paragraphs and printing them
for paragraph in soup.find_all('p'):
print(paragraph.text)
The above 2 lines of code are for finding all paragraphs in the HTML file and displaying their text.
The BeautifulSoup object we just created above provides us tons of methods for parsing through it to find the data we want.
One of those methods is find_all ( ), it accepts a parameter of the name of the tag, and then it parses through the HTML string to find those tags and returns them.
Extracting all List in HTML
For Instance, Let’s twist the above program to display out List text found in the HTML file
app.py
from bs4 import BeautifulSouphtml = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')for List in soup.find_all('li'):
print(List.text)
Output :
$ python app.py
Python
C+++
Javascript
Golang
Extracting Paragraphs with specific Id
Apart from just returning all tags in HTML string, we can specify the attributes of those tag in order to get the specific data, For instance
Program to Extract paragraphs with an id of normal
app.py
import requests
from bs4 import BeautifulSouphtml = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')for paragraph in soup.find_all('p'):
if paragraph['id'] == 'normal':
print(paragraph.text)
Output :
$ python app.py
Time the time before the time times you
The Future is now
Building Our Demo Project
So far we have seen how to extract data from an HTML file that is in our local file, now Let’s go see how we can extract data from the website in the cloud.
On this project, we are going to implement a web scraper to scrap quotations from a website of a given URL.
We are going to use the requests library to pull the HTML from the website and then parse that HTML using BeautifulSoup.
Website to Scrap
Note: Don’t just go out there and scrap whatever website you want, First research what kind of scraping to that site is legal, and then build your scraper for it
On our demo project, we are going to use the below URL to scrap quotations
URL = 'http://quotes.toscrape.com/'
scraper.py
import requests
from bs4 import BeautifulSouphtml = requests.get('http://quotes.toscrape.com/').text
soup = BeautifulSoup(html, 'html.parser')for paragraph in soup.find_all('span'):
if paragraph.string:
print(paragraph.string
Output :
$ python scraper.py
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
"Try not to become a man of success. Rather become a man of value."
"It is better to be hated for what you are than to be loved for what you are not."
"I have not failed. I've just found 10,000 ways that won't work."
"A woman is like a tea bag; you never know how strong it is until it's in hot water."
"A day without sunshine is like, you know, night."
Hope you find this post interesting, don’t forget to subscribe to get more tutorials like this
In case of any suggestion or comment, drop it in the comment box and I will reply to you immediately.
Originally published at https://kalebujordan.com on May 15, 2020.