A brief Introduction to Web scraping in Python

Kalebu Jordan
4 min readMay 15, 2020

--

Web scraping simply concerns with Extracting data from the website.

As a programmer in many cases, you will need to extract data from websites therefore Web scraping is a skill you need to have.

In this tutorial, you’re going to learn how to perform web scraping in Python using requests and BeautifulSoup libraries.

Throughout the tutorial, you will learn out basic web scraping examples together with implementing a simple web scraper to scrap quotations from a website.

Requirements

In order to follow through with this tutorial, you need to have the following Python Libraries Installed on your System

Installation

$ pip install requests 
$ pip install beautifulsoup4

Requests

Requests is an elegant and simple HTTP library for Python, built for human beings.

We will use requests during implementing our project which scraps quotations from a particular website

BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

Diving deeper to BeautifulSoup

For Instance, Let’s use BeautifulSoup to extract data from the below HTML file

<!DOCTYPE html>
<head>
<title>Document</title>
</head>
<body>
<div id = 'quotes'>
<p id = 'normal'>Time the time before the time times you</p>
<p id = 'normal'>The Future is now </p>
<p id = 'special'>Be who you wanted to be when you're younger</p>
<p id = 'special'>The world is reflection of who you're</p>
</div>
<div>
<p id = 'Languages'>Programming Languages</p>
<ul>
<li>Python</li>
<li>C+++</li>
<li>Javascript</li>
<li>Golang</li>
</ul>
</div>
</body>
</html>

Extracting all paragraphs in HTML

Let’s Extract all paragraphs from the Sample.html
app.py

from bs4 import BeautifulSouphtml = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'):
print(paragraph.text)

Output:

When you run the above simple program it will produce the following result.

$ python app.py 
Time the time before the time times you
The Future is now
Be who you wanted to be when you're younger
The world is reflection of who you're
Programming LanguagesCode Explanation Importing Library
from bs4 import BeautifulSoup

The above line of code is for importing our BeautifulSoup Library to our program

Creating a BeautifulSoup object with HTML string

html = open('sample.html').read() soup = BeautifulSoup(html, 'html.parser')

The above 2 lines of code are for reading the sample.html and creating a Beautifulsoup object ready for parsing data within it.

The Syntax for making a BeautifulSoup object is

soup = BeautifulSoup(html_string, 'html.parser')

Finding all paragraphs and printing them

for paragraph in soup.find_all('p'):
print(paragraph.text)

The above 2 lines of code are for finding all paragraphs in the HTML file and displaying their text.

The BeautifulSoup object we just created above provides us tons of methods for parsing through it to find the data we want.

One of those methods is find_all ( ), it accepts a parameter of the name of the tag, and then it parses through the HTML string to find those tags and returns them.

Extracting all List in HTML

For Instance, Let’s twist the above program to display out List text found in the HTML file

app.py

from bs4 import BeautifulSouphtml = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for List in soup.find_all('li'):
print(List.text)

Output :

$ python app.py
Python
C+++
Javascript
Golang

Extracting Paragraphs with specific Id

Apart from just returning all tags in HTML string, we can specify the attributes of those tag in order to get the specific data, For instance

Program to Extract paragraphs with an id of normal

app.py

import requests
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'):
if paragraph['id'] == 'normal':
print(paragraph.text)

Output :

$ python app.py 
Time the time before the time times you
The Future is now

Building Our Demo Project

So far we have seen how to extract data from an HTML file that is in our local file, now Let’s go see how we can extract data from the website in the cloud.

On this project, we are going to implement a web scraper to scrap quotations from a website of a given URL.

We are going to use the requests library to pull the HTML from the website and then parse that HTML using BeautifulSoup.

Website to Scrap

Note: Don’t just go out there and scrap whatever website you want, First research what kind of scraping to that site is legal, and then build your scraper for it

On our demo project, we are going to use the below URL to scrap quotations

URL = 'http://quotes.toscrape.com/'

scraper.py

import requests
from bs4 import BeautifulSoup
html = requests.get('http://quotes.toscrape.com/').text
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('span'):
if paragraph.string:
print(paragraph.string

Output :

$ python scraper.py 
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
"Try not to become a man of success. Rather become a man of value."
"It is better to be hated for what you are than to be loved for what you are not."
"I have not failed. I've just found 10,000 ways that won't work."
"A woman is like a tea bag; you never know how strong it is until it's in hot water."
"A day without sunshine is like, you know, night."

Hope you find this post interesting, don’t forget to subscribe to get more tutorials like this

In case of any suggestion or comment, drop it in the comment box and I will reply to you immediately.

Originally published at https://kalebujordan.com on May 15, 2020.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Kalebu Jordan
Kalebu Jordan

Written by Kalebu Jordan

Mechatronics Engineer by Professional || Self taught Python Developer || Passionate about open source and bringing impact to education sector

No responses yet

Write a response