Getting All Text From Web Page Using Beautiful Soup
In this blog post, i will explain how to get only the text from a web page using Beautiful Soup 4. In Beautiful Soup 4, we have get_text() method which can be used to get all the text information from a web page. So consider my blog kochi-coders.com itself. If i want to get all the text within this blog, i can use the below code.
from bs4 import BeautifulSoup
import urllib2
url="http://www.kochi-coders.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
In the above code we have created the soup object based on the URL http://www.kochi-coders.com. To get all the text stored within the page we can use the below code.
all_texts = soup.get_text()
print all_texts
In The above code get_text() is used to get all the text content within the page.
Sample Output
Getting All Text From Web Page Using Beautiful Soup « kochi-coders.com
script
( function() {
var query = document.location.search;
if ( query && query.indexOf( 'preview=true' ) !== -1 ) {
window.name = 'wp-preview-387';
}
if ( window.addEventListener ) {
window.addEventListener( 'unload', function() { window.name = ''; }, false );
}
}());
/script
But wait, are we seeing any script tags and contents within the output ?
Yes we can see java script has been present. This is because get_text() considers the
extract() is used to delete a particular tag and all its content.
Removing the JavaScript and printing only the text within a document can be
achieved using the following :
[x.extract() for x in soup_packtpage.find_all('script')]
The previous line of code will remove all the script elements from the document.
After this, the print(soup_packtpage.get_text()) line will print only the text
stored within the page.
Happy Scraping 


