Skip directly to content

Minimize RSR Award Detail

Research Spending & Results

Award Detail

Doing Business As Name:Carnegie-Mellon University
  • Jason Hong
  • (412) 268-1295
Award Date:09/05/2014
Estimated Total Award Amount: $ 499,290
Funds Obligated to Date: $ 515,290
  • FY 2014=$499,290
  • FY 2016=$16,000
Start Date:10/01/2014
End Date:09/30/2017
Transaction Type:Grant
Awarding Agency Code:4900
Funding Agency Code:4900
CFDA Number:47.070
Primary Program Source:040100 NSF RESEARCH & RELATED ACTIVIT
Award Title or Description:TWC: Small: CrowdVerify: Using the Crowd to Summarize Web Site Privacy Policies and Terms of Use Policies
Federal Award ID Number:1422018
DUNS ID:052184116
Parent DUNS ID:052184116
Program:Secure &Trustworthy Cyberspace

Awardee Location

Street:5000 Forbes Avenue
Awardee Cong. District:18

Primary Place of Performance

Organization Name:Carnegie Mellon University
Street:5000 Forbes Avenue

Abstract at Time of Award

Everyday web users have little guidance in handling the growing number of privacy issues they face when they go online. Many web sites - some legitimate, some less so - have behaviors many would consider unexpected or undesirable. These include popular and well-known web sites, as well as web sites that aim to dupe customers with "free" trials. These kinds of sites often detail their behaviors in privacy policies and terms of use pages, but these policies are rarely read, hard to understand, and sometimes intentionally obfuscated with legal jargon, small text, and pale fonts. The goal of this research is to develop new techniques to pinpoint and summarize the most surprising and most important parts of policies. The results of this research will be made publicly available on a web site and through web browser extensions. The major research activity for this research will be to design, implement, and evaluate CrowdVerify, a system that combines crowdsourcing with machine learning techniques to flag the most important and unexpected behaviors of web sites. The core idea is to slice up a given policy into smaller text segments, have crowd workers compare different segments, and then aggregate the results together. A number of competitor scoring systems will also be evaluated for rating the importance of segments, including ELO, Glicko, and TrueSkill. Using these results, computational models will be built that can predict what people find most surprising as well as most important in web policies.

Project Outcomes Report


This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The goal of this project is to develop new techniques to analyze and summarize terms and conditions policies on web sites, making it easy for consumers to see the most important statements. These policies tend to be long and difficult to read, with important information buried in long tracts of text.

Our team has been examining how to use crowd-based techniques to gather information about what people feel is important in these policies. More specifically, we slice a given policy into individual sentences and then show pairs of sentences at a time to crowd workers, asking them to choose which statement is more important. Using this data, we have been building language models that can be used on terms and conditions policies that we have not yet seen, to help predict what people will view as important.

To date, we have collected crowd data on 20 different ecommerce web site policies. We have also used machine learning techniques to build some language models. We are currently in the process of applying these language models to thousands of web policies that we have crawled, and organizing these into a web site that consumers can use to quickly see what are the most important things they should know about a site.

From a scientific and intellectual perspective, the main contribution of our work are in developing new techniques for having crowd workers analyze policies. We have also investigated techniques for optimizing the amount of work needed by crowd workers, and analyzed what categories of statements people find most important in terms and conditions. Lastly, our data set is available on our web site. 

From a broader contributions perspective, our work has the potential to help consumers quickly understand the most important items in lengthy terms and conditions policies. Our work also has the potential to help consumer advocates, in terms of understanding what consumers are most worried about as well as pinpointing unusual statements in these policies. Lastly, our work may be of interest to journalists and to the companies whose policies are being analyzed.

Last Modified: 04/09/2018
Modified by: Jason Hong

For specific questions or comments about this information including the NSF Project Outcomes Report, contact us.