2009 doctorate of philosophy dissertation from University of Texas Austin
The Internet has enabled the collection, aggregation and analysis of personal data on a massive scale. It has also enabled the sharing of collected data in various ways: wholesale outsourcing of data warehousing, partnering with advertisers for targeted advertising, data publishing for exploratory research, etc. This has led to complex privacy questions related to the leakage of sensitive user data and mass harvesting of information by unscrupulous parties. These questions have information-theoretic, sociological and legal aspects and are often poorly understood. There are two fundamental paradigms for how the data is released: in the interactive setting, the data collector holds the data while third parties interact with the data collector to compute some function on the database. In the non-interactive setting, the database is somehow "sanitized" and then published. In this thesis, we conduct a thorough theoretical and empirical investigation of privacy issues involved in non-interactive data release. Both settings have been well analyzed in the academic literature, but simplicity of the non-interactive paradigm has resulted in its being used almost exclusively in actual data releases. We analyze several common applications including electronic directories, collaborative filtering and recommender systems, and social networks. Our investigation has two main foci. First, we present frameworks for privacy and anonymity in these different settings within which one might dene exactly when a privacy breach has occurred. Second, we use these frameworks to experimentally analyze actual large datasets and quantify privacy issues. The picture that has emerged from this research is a bleak one for noninteractivity. While a surprising level of privacy control is possible in a limited number of applications, the general sense is that protecting privacy in the non-interactive setting is not as easy as intuitively assumed in the absence of rigorous privacy definitions. While some applications can be salvaged either by moving to an interactive setting or by other means, in others a rethinking of the tradeoffs between utility and privacy that are currently taken for granted appears to be necessary.
Arvind Narayanan is a professor of computer science at Princeton University and the director of the Center for Information Technology Policy. He was one of TIME's inaugural list of 100 most influential people in AI.
Narayanan led the Princeton Web Transparency and Accountability Project to uncover how companies collect and use our personal information. His work was also among the first to show how machine learning reflects cultural stereotypes.
He was awarded the Privacy Enhancing Technology Award for showing how publicly available social media and web information can be cross-referenced to find customers whose data has been "anonymized" by companies.
Narayanan prototyped and developed Do Not Track in HTTP header fields.
He is a co-author of the book AI Snake Oil and a newsletter of the same name which is read by 50,000 researchers, policy makers, journalists, and AI enthusiasts.