Award Abstract # 1224035
TWC: Small: Critter@home: Content-Rich Traffic Trace Repository from Real-Time, Anonymous, User Contributions

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: UNIVERSITY OF SOUTHERN CALIFORNIA
Initial Amendment Date: August 28, 2012
Latest Amendment Date: August 28, 2012
Award Number: 1224035
Award Instrument: Standard Grant
Program Manager: Jeremy Epstein
jepstein@nsf.gov
 (703)292-8338
CNS
 Division Of Computer and Network Systems
CSE
 Direct For Computer & Info Scie & Enginr
Start Date: September 1, 2012
End Date: August 31, 2014 (Estimated)
Total Intended Award Amount: $375,000.00
Total Awarded Amount to Date: $375,000.00
Funds Obligated to Date: FY 2012 = $375,000.00
History of Investigator:
  • Jelena Mirkovic (Principal Investigator)
    mirkovic@isi.edu
Recipient Sponsored Research Office: University of Southern California
3720 S FLOWER ST FL 3
LOS ANGELES
CA  US  90089-0701
(213)740-7762
Sponsor Congressional District: 37
Primary Place of Performance: USC-Information Sciences Institute
4676 Admiralty Way, Suite 1001
Marina del Rey
CA  US  90292-6611
Primary Place of Performance
Congressional District:
36
Unique Entity Identifier (UEI): G88KLJR3KYT5
Parent UEI:
NSF Program(s): CYBERCORPS: SCHLAR FOR SER,
Secure &Trustworthy Cyberspace
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
04001213DB NSF Education & Human Resource
Program Reference Code(s): 7434, 7923, 9102, 9178, SMET
Program Element Code(s): 166800, 806000
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

There are very few publicly available network traces that contain application-level data, because of the enormous privacy risk that sharing such data creates. Application-level data is rich with personal and private information, such as human names, social security numbers, etc. that criminals can monetize. Yet such data is necessary for realistic testing of research products, and for understanding trends in the domain of networking and network applications.

This project develops a publicly accessible, diverse and fresh archive of content-rich network data, contributed by volunteer users, called Critter-at-home. Users join the Critter overlay whenever online, offering their data to interested researchers. Privacy of data contributors is protected by several means. First, contributors may opt to host their own data on their machines, thus retaining full control over it. Second, we process contributed data to modify all personal and private information (PPI) and we encrypt it. Third, no human apart from the contributor ever accesses the raw, PPI-sanitized, data. Instead, researchers query the data via our Critter-at-home framework, and they receive aggregate statistics (counts, distributions, etc.) of the traffic features they query for. Four, all contact with a contributor is at her discretion and is done through an anonymous network, where contributor identities are hidden.

The archive this project creates will greatly advance security research by providing necessary data for its validation and for data mining. This archive will further be valuable to a broader networking e.g., for realistic traffic generation, as ground truth in traffic classification, and for many other purposes.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Researchers need real-world network data from real users for computer network and security research.Unfortunately, because such data contains large amounts of personal information---such as what web sites a user visits---collecting such data and granting access to researchers is often deemed to have too many privacy risks.This is especially true for content-rich data---application-level data such as the content of websites a user visits.Critter---which stands for "Content-Rich Traffic Trace Repository"--- aims  to provide this very needed content-rich data to researchers through a network of volunteer data contributors. Critter allows researchers to run certain queries on user data and returns aggregate responses to protect user privacy.Unlike traditional network data sources, where researchers work with an ISP or other large organization to gain access to data,  Critter connects researchers with individual end-users willing to share their data for research purposes.
Since Critter works on an individual level,  users retain much more control over their data and how their data is used than in traditional data collection methods such as network traffic traces collected at a university.Users keep their ``raw'' data locally on their machine and can withdraw their data  at any time.  When sharing privacy-sensitive data, the original data always remains under the control of its owner.  The data owner releases information through responding to queries with a numerical value.These responses are aggregated on the Critter Server before being returned to a researcher.Since we release only aggregate responses, many active and passive attacks that work against data sets such as  sanitized network traces or sanitized logs are ineffective in our context.  

Figure 1 illustrates how queries work in Critter. First, (1) a researcher submits a query via the public portal.  Data contributors' clients (2) poll for new queries,   and (3) retrieve this new query. The Critter client processes this query if   the data contributor's policy permits it,  and returns the result.The Critter Server aggregates the results  and stores these aggregated results for the researcher to retrieve.
Our result aggregation provides privacy protection through ``hiding in a crowd''.The Critter Server enforces  k-anonymity criteria  before any result is returned to the researcher.If a researcher asks for how often users visit a particular website during a specific week, k-anonymity ensures that the returned result is a set of grouped responses such that each group has a single value representing at least k different contributors'   replies.            

Figure 2 depicts how this works  with k = 3, and  an example of responses from  four data contributors to a query about how often each contributor visits  a particular website. Since Group#2 does not have a k of at least 3, we cannot release such information.Instead,  we drop this group and return only Group#1.Once results are aggregated, an attacker cannot know for sure which contributors participated in a query or know any single contributor's response to a query.

The end result of this NSF funded effort is the  implemented Critter System, including an easy to install Critter client, and a small base of volunteer data contributors.Becoming a data contributor through Critter is easy.We have created a simple install process and self-updating client which works under Windows and Mac.To see more or to join Critter plea...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page