Getting All Text From Web Page Using Beautiful Soup

Get 50% Discount on E Book- Getting S... Only $5 for Getting Started with Beau...

Getting All Text From Web Page Using Beautiful Soup

In this blog post, i will explain how to get only the text from a web page using Beautiful Soup 4. In Beautiful Soup 4, we have get_text() method which can be used to get all the text information from a web page. So consider my blog kochi-coders.com itself. If i want to get all the text within this blog, i can use the below code.

from bs4 import BeautifulSoup

import urllib2

url="http://www.kochi-coders.com"

page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

In the above code we have created the soup object based on the URL http://www.kochi-coders.com. To get all the text stored within the page we can use the below code.

all_texts = soup.get_text()

print all_texts

In The above code get_text() is used to get all the text content within the page.

Sample Output

Getting All Text From Web Page Using Beautiful Soup « kochi-coders.com

script

( function() {

var query = document.location.search;

if ( query && query.indexOf( 'preview=true' ) !== -1 ) {

window.name = 'wp-preview-387';

}

if ( window.addEventListener ) {

window.addEventListener( 'unload', function() { window.name = ''; }, false );

}

}());

/script

But wait, are we seeing any script tags and contents within the output ?

Yes we can see java script has been present. This is because get_text() considers the

extract() is used to delete a particular tag and all its content.

Removing the JavaScript and printing only the text within a document can be

achieved using the following :

[x.extract() for x in soup_packtpage.find_all('script')]

The previous line of code will remove all the script elements from the document.

After this, the print(soup_packtpage.get_text()) line will print only the text

stored within the page.

Happy Scraping :-)

View more on Vineeth G. Nair's website »

Like • 0 comments • flag

Published on December 09, 2015 08:18

No comments have been added yet.

Vineeth G. Nair's Blog

Vineeth G. Nair's profile
2 followers