Let’s Scrape the page, Using Beautiful Soup 4

Reading Excel Spread Sheet Using Python Your Chance to win a Free Copy of Get...

Let’s Scrape the page, Using Beautiful Soup 4

I was thinking about porting my blog post about scraping a website using Beautiful Soup from Python 2.7 and Beautiful Soup 3 to Python 3 and Beautiful Soup 4. Thanks to Steve for his code which made it easy for me. In this blog post i will be scraping the same website using Beautiful Soup 4.

Task : Extract all U.S university name and url from University of texas website as a csv ( comma-seperated values ) format.

Dependencies : Python and Beautiful Soup 4

Script with Explanation

Importing Beautiful Soup 4

from bs4 import BeautifulSoup

This is major difference between Beautiful Soup 3 and 4. In 3 it was just

from BeautifulSoup import BeautifulSoup

But in Bs4 it is entirely different.

Next we will import the request module for opening the Url

import urllib.request

We now need to open the page at the above Url.

url="http://www.utexas.edu/world/univ/alpha/"
page = urllib.request.urlopen(url)

Creating the Soup

soup = BeautifulSoup(page.read())

Finding the pattern in the page

Web scraping will be effective only if we can find patterns used in the websites for the contents. For example in the university of texas website if you view the source of the page then you can see that all university names have a common format like as shown below in the screeshshot

From the pattern we can see that all universities will be within tag with css class institution. So we need to find all the tags whose class is institution to find all the universities. We can use Beautiful Soup 4 find_all() method to accomplish this.

universities=soup.find_all('a',class_='institution')

In the above code line, we used find_all method in Beautiful Soup 4 to find all the universities. We found all the tags with class institution. In Beautiful Soup 4 we can use the keyword argument class_ to search based on the css classes. We can iterate over each university by using the code below

for university in universities:
print(university['href']+","+university.string)

In the above page each University name is stored as the string of the tag and URL are stored as the href property of the tag. So in the above code, by using university.string we will get each university name and using university['href'] we will get the university URL.

Putting it all together

The script for scraping University of texas website using Beautiful Soup 4 is as below.

from bs4 import BeautifulSoup
import urllib.request
url="http://www.utexas.edu/world/univ/alpha/"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
print(university['href']+","+university.string)

If you want to learn and understand more examples and sample codes, I have authored a book on Beautiful Soup 4 and you can find more details here