Scraping - getting a computer to capture information from online sources - is one of the most powerful techniques for data-savvy journalists who want to get to the story first, or find exclusives that no one else has spotted. Faster than FOI and more detailed than advanced search techniques, scraping also allows you to grab data that organisations would rather you didn’t have - and put it into a form that allows you to get answers. Scraping for Journalists introduces you to a range of scraping techniques - from very simple scraping techniques which are no more complicated than a spreadsheet formula, to more complex challenges such as scraping databases or hundreds of documents. At every stage you'll see results - but you'll also be building towards more ambitious and powerful tools. You’ll be scraping within 5 minutes of reading the first chapter - but more importantly you'll be learning key principles and techniques for dealing with scraping problems. Unlike general books about programming languages, everything in this book has a direct application for journalism, and each principle of programming is related to their application in scraping for newsgathering. And unlike standalone guides and blog posts that cover particular tools or techniques, this book aims to give you skills that you can apply in new situations and with new tools.
Paul Bradshaw runs the MA in Data Journalism and MA in Multiplatform and Mobile Journalism at Birmingham City University. He publishes the Online Journalism Blog, and is the founder of investigative journalism website HelpMeInvestigate. He has written for the Guardian and Telegraph’s data blogs, journalism.co.uk, Press Gazette, InPublishing, Nieman Reports and the Poynter Institute in the US. He is the author of the Online Journalism Handbook, and co-author of Mobile-First Journalism (with Steve Hill) and Magazine Editing (with John Morrish). His ebooks include Scraping for Journalists, Finding Stories in Spreadsheets, the Data Journalism Heist and 8000 Holes: How the 2012 Olympic Torch Relay Lost its Way.
Other books which Bradshaw has contributed to include Investigative Journalism (second edition) and Web Journalism: A New Form of Citizenship; and Citizen Journalism: Global Perspectives.
Bradshaw has been listed in Journalism.co.uk’s list of the leading innovators in journalism and media and Poynter’s most influential people in social media. In 2010, he was shortlisted for Multimedia Publisher of the Year, and in 2016 won the CNN MultiChoice Award for an investigation into player trafficking in Nigeria.
In addition to teaching and writing, Paul acts as a consultant and trainer to a number of organisations on social media and data journalism. You can find him on Twitter @paulbradshaw
This book is not only an introduction to scraping for journalists who want to know how to use or write programs to automatically collect and organize data from web pages and databases. It is also a workbook and reference. I'm definitely going to keep it on my shelf and refer to it as needed for future reporting projects. It shows readers how to do everything from using a Google Docs function to pull information out of tables on Wikipedia pages and drop it into sortable spreadsheets, to modifying code on ScraperWiki to get into a searchable aspx database, find certain keywords in pdf documents, pull them out along with a few sentences of context, and drop them into a file to look at later. The examples and instructions are clear and easy to follow. The skills, I think, are important in an age when larger and larger amounts of data are available online and reporters need to dig through it fast and accurately.