Rate this book

Tika in Action

Name: Tika in Action
Rating: 3.69 (3 reviews)
ISBN: 9781935182856

Chris A. Mattmann, Jukka L. Zitting, Jukka Zitting

Rate this book

Apache Tika is an open source toolkit that makes it easy for search engines, content management systems and other applications to detect and extract content from digital documents in all major file formats.

Tika in Action is a hands-on guide for developers working with search engines, content management systems and other similar applications who want to exploit the information locked in digital documents. It introduces you to the world of mining text and binary documents and other information sources like Internet media types and Dublin Core metadata. The book shows where Tika fits within this landscape and how readers can use Tika to build and extend applications. The book's many case studies give real-world experience from domains ranging from search engines to digital asset management and scientific data processing.

In addition to the architectural overviews, developers will find more detailed information in chapters that focus on advanced features like XMP metadata processing, automatic language detection and custom parser extensions. The book also describes common file formats like MS Word, PDF, HTML, and ZIP and the open source libraries used to process files in these formats. The included code examples are designed support hands-on experimentation.

This book requires no previous knowledge of Tika or text mining techniques, and will be most valuable to readers with a working knowledge of Java. Tika in Action fits perfectly with other Manning books including Lucene in Action, Mahout in Action, Taming Text, Algorithms of the Intelligent Web, and Collective Intelligence in Action.

GenresProgramming

225 pages, Paperback

First published January 1, 2011

3 people are currently reading

20 people want to read

About the author

Chris A. Mattmann

1 book1 follower

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

1 (6%)

4 stars

11 (68%)

3 stars

3 (18%)

2 stars

0 (0%)

1 star

1 (6%)

Displaying 1 - 3 of 3 reviews

Uli Kunkel

22 reviews6 followers

Read

December 25, 2019

The book isn't bad, but not very practical.

Alex Ott

Author 3 books209 followers

January 25, 2012

Very good book on media type detection & content extraction using the Apache Tika framework. By using Tika for text & metadata extraction you can index & search documents in many existing formats. You can also extend Tika with support new formats that are need in your work. And its open source nature, makes it very attractive for both open source & corporate developers, allowing flexible integration with many different systems, like, ManifoldCF, Lucene, UIMA, etc.

Books provides comprehensive description of framework itself, how to use it for different tasks (file format & language detection, text/metadata extraction, etc.), how to extend it to support new file formats (both detection & data extraction). Besides this, there are several chapters dedicated to real world use-cases - how Apache Tika is used in different projects.

I would recommend this book for everybody who need to perform media type detection and/or text extraction, especially who're working with indexing & searching of heterogeneous documents.

P.S. I gave 4 stars only because I would like to have more detailed description of how to create complex signatures for file formats (although, this information could be found on project's pages).

ir-dm-nlp-ml-search own-ebook own-pbook