Jump to ratings and reviews
Rate this book

Web Crawling and Data Mining with Apache Nutch

Rate this book
This book is a user-friendly guide that covers all the necessary steps and examples related to web crawling and data mining using Apache Nutch."Web Crawling and Data Mining with Apache Nutch" is aimed at data analysts, application developers, web mining engineers, and data scientists. It is a good start for those who want to learn how web crawling and data mining is applied in the current business world. It would be an added benefit for those who have some knowledge of web crawling and data mining.

Paperback

First published December 24, 2013

4 people are currently reading
20 people want to read

About the author

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
1 (6%)
4 stars
5 (33%)
3 stars
4 (26%)
2 stars
0 (0%)
1 star
5 (33%)
Displaying 1 - 5 of 5 reviews
1 review
April 23, 2014
This book is poorly written, badly organised, full of incorrect, incomplete and misleading statements, touching variety of topics and technologies, related but not expected to dominate in a book with this title. It is more a set of learning notes of author’s first encounter with each of technologies than experts coverage of complex topic.

Full review is on our blog http://www.atlantbh.com/book-review-w...
Profile Image for Chris.
2 reviews4 followers
August 13, 2016
After finishing Web Crawling and Data mining with Apache Nutch, I can’t help but feel like less than half of the book was actually about Apache Nutch. While I accept that talking about how Nutch stores its crawl data is necessary, do we really need an introduction on how to install MySql and Apache Acumulo? It is even less compelling when most of the part about installing Acumulo is copied directly from the referenced blog post.

The authors have, however, gone through the trouble of compiling information scattered through the documentation and various blog posts into one book. I would like it if the book were better organized though. It feels jumpy, repetitive, and unstructured.

It jumps back and forth between Nutch 1.x and Nutch 2.x, often without mentioning which version they are talking about. It would probably have made more sense for the authors to split it into 2 books, one dedicated to each version that try to mash them together so haphazardly.
Profile Image for Arthur.
97 reviews5 followers
January 10, 2014
In our age of Data Explosion it becomes increasingly appealing, if not necessary, to scout the myriad of what it looks like though shrinking World Wide Web pages. If you even are not tasked with crawling a subset of the webpages today you may want to grab a copy of Web Crawling and Data Mining with Apache Nutch book to make you well prepared in advance.

Advantageously, the book is not excessively long, so even if you are in a hurry, it will allow you to accomplish the desired scope in a short time. Be aware that the book concentrates a lot on making related software communicate with each other and devotes a significant portion of it to setting things up in general so you may need to check for changes in how to integrate or install the parts in case you happen to work on newer releases of the involved software.

I need to give the credits to the authors here that they have made every effort to showcast the Nutch capabilities and yet make your solution prepared to be scalable. Better yet, Zakir and Abdulbasit empower you by sharing the intimacies of setting Hadoop and integrate Nutch with several popular Key-Value or RDBMS data persistence solutions as Accumulo and MySQL. However, the Nutch crawl optimization is for some reason is missing.

The book gladly is covering the index processing which is compulsory, but unfortunately in my opinion, does not expand enough on an a necessary part: Apache Solr.

The book also covers Apache Gora, but lefts out the option to integrate with Cassandra.

Based on my impression, this book is an ideal fit for IT practitioners, field/infrastructure people who need to deliver quickly a working Nutch prototype environment with various options.

On the not so happy note, the book concentrates a lot on the infrastructure aspects so while reading the book I desired the authors could provide better explanations about the place of the technologies covered. At least of what Nutch is comprised of supplemented with real life usage examples, perhaps a study or two would not harm. It also felt at the beginning like the book lacks some reader background prep steps so at times I needed to take a pause to seek some additional information. I suggest some reference would be nice to have along with glossary of terms.

Nevertheless, overall, it is a good read: 4 out of 5 is my verdict.
38 reviews3 followers
February 1, 2018
Web Crawling and Data Mining with Apache Nutch focuses on implementation of Apache Nutch with other big data technologies. The book begins with explanation of dependencies, an overview of Apache Nutch file structure and a simple demonstration of how Nutch can crawl webpages. The rest of the book is dedicated to implementing Nutch with different distributed architectures including SolR, Hadoop, Accumulo and MySQL (relational). Most of the book is dedicated to implementation. I'd recommend it to experienced software, information management or data analytic professionals with a strong foundation in software implementation.

Overall not a bad book. I'll probably turn this into a weekend project just to get a feel for the different Apache products mentioned in this book and also to see how Nutch functions.
1 review
Read
February 12, 2014
It is really a great book. And I get help in my project.

In my project I need to crawl the web content and do the data analyst. From the book I can know how to use and integrate Nutch and Solr frameworks to implement it.

If you have similar case, recommend to read this book.
Displaying 1 - 5 of 5 reviews

Can't find what you're looking for?

Get help and learn more about the design.