Getting All Text From Web Page Using Beautiful Soup

In this blog post, i will explain how to get only the text from a web page using Beautiful Soup 4. In Beautiful Soup 4, we have get_text() method which can be used to get all the text information from a web page.  So consider my blog kochi-coders.com itself.  If i want to get all the text within this blog, i can use the below code.


from bs4 import BeautifulSoup

import urllib2

url="http://www.kochi-coders.com"

page = urllib2.urlopen(url)

soup = BeautifulSoup(page)


In the above code we have created the soup object based on the URL http://www.kochi-coders.com. To get all the text stored within the page we can use the below code.


all_texts = soup.get_text()

print all_texts


In The above code get_text() is used to get all the text content within the page.


Sample Output



Getting All Text From Web Page Using Beautiful Soup « kochi-coders.com


script

( function() {

var query = document.location.search;


if ( query && query.indexOf( 'preview=true' ) !== -1 ) {

window.name = 'wp-preview-387';

}


if ( window.addEventListener ) {

window.addEventListener( 'unload', function() { window.name = ''; }, false );

}

}());

/script


But wait, are we seeing any script tags and contents within the output ?


Yes we can see java script has been present. This is because get_text() considers the

extract() is used to delete a particular tag and all its content.

Removing the JavaScript and printing only the text within a document can be

achieved using the following :

[x.extract() for x in soup_packtpage.find_all('script')]

The previous line of code will remove all the script elements from the document.

After this, the print(soup_packtpage.get_text()) line will print only the text

stored within the page.


Happy Scraping :-)

 •  0 comments  •  flag
Share on Twitter
Published on December 09, 2015 08:18
No comments have been added yet.