Skip directly to content

Minimize RSR Award Detail

Research Spending & Results

Award Detail

  • Chao Huang
Award Date:12/17/2015
Estimated Total Award Amount: $ 150,000
Funds Obligated to Date: $ 171,250
  • FY 2016=$171,250
Start Date:01/01/2016
End Date:12/31/2016
Transaction Type:Grant
Awarding Agency Code:4900
Funding Agency Code:4900
CFDA Number:47.041
Primary Program Source:040100 NSF RESEARCH & RELATED ACTIVIT
Award Title or Description:SBIR Phase I: Providing Automatic Anomaly Prediction and Diagnosis Software as a Service for Cloud Infrastructures
Federal Award ID Number:1548867
DUNS ID:080044293
Program:SBIR Phase I
Program Officer:
  • Peter Atherton
  • (703) 292-8772

Awardee Location

Street:154 Grand Street
City:New York
County:New York
Awardee Cong. District:10

Primary Place of Performance

Organization Name:Cloud Solutions LLC
Street:805 Transom View Way
Cong. District:04

Abstract at Time of Award

The broader impact/commercial potential of this Small Business Innovation Research (SBIR) Phase I project will be to greatly improve the robustness and diagnosability of many real world cloud computing infrastructures. The proposed technology will significantly reduce the downtime of production cloud systems, which can attract more users to adopt cloud computing technology and thus benefit the expanding segment of society and the economy that depends on cloud technology. The project will also advance the state of the art of cloud system reliability research by putting research results into real world use. This Small Business Innovation Research (SBIR) Phase I project will transform system anomaly management for production cloud computing infrastructures. The novelty of the company's solution lies in three unique features: 1) it provides automatic multivariate anomaly detection that can enable high-fidelity anomaly alerts without imposing any configuration burden on the user; 2) it provides early anomaly alerts before big system problems occur; and 3) it provides anomaly diagnosis that can generate hints on why an anomaly occurs. The proposed research will produce novel and practical anomaly prediction and diagnosis solutions that will be validated in real world cloud infrastructures. Specifically, the project consists of two thrusts: 1) online multivariate anomaly prediction that explores new light-weight unsupervised learning algorithms for achieving high-fidelity anomaly alerts and providing time-to-failure estimations; and 2) automatic anomaly diagnosis that can identify possible causes of an anomaly to greatly expedite the anomaly troubleshooting process in the cloud. The company will implement the software products and carry out case studies with partners on real world cloud computing infrastructures.

Project Outcomes Report


This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In January 2016, we received an NSF SBIR Phase I grant to study the feasibility of providing anomaly detection and diagnosis Software as a Service (SaaS) for cloud infrastructures. We first developed a scalable and resilient analysis engine to host our anomaly detection and root cause inference algorithms in public clouds and provide web interface for the user to access our anomaly detection and diagnosis services via Internet. We then developed a set of monitoring agents that connect our analysis engine to different metric data sources including Amazon CloudWatch Monitoring APIs, Google Cloud Monitoring APIs, Docker containers, Kubernetes Clusters, Vmware Hypervisors, Splunk, and ELK data repositories. We launched our beta product in March and made our products publicly available in May. We have been been doing Proof-of-Concept (POC) testing with a set of prospective customers since then.

Our research aims at answering a set of key commercialization feasibility questions: 1) can our multivariate anomaly detection algorithms raise advance alerts with high accuracy for real world system failures? 2) can we significantly reduce false alarms compared to traditional anomaly detection algorithms? 3) can we provide useful diagnostic hints about the anomaly root cause? and 4) can we provide real-time alerts for large-scale real world distributed applications? We carried out data-driven research study to answer those questions.

We evaluated our anomaly detection and diagnosis algorithms using real world failure data sets provided by our POC testing customers. The root causes of those system failures range from hardware failures, human errors, to software bugs, or a combination of those three factors. Those failure data sets are typically quite big consisting of several weeks of monitoring data samples and hundreds of or thousands of metrics. We also compared our algorithms with a set of common alternative approaches such as simple threshold-based alert and clustering (DBScan). Our anomaly detection solution not only achieves 100% detection rate (detecting all true anomalies) but also raises alerts hours or days earlier than customers’ existing tools by capturing early warning signs. Our false alarm rates are orders of magnitude lower than existing approaches, which can significantly reduce the alert processing cost. Moreover, our faulty metric inference algorithm can provide useful hints on the anomaly root cause, which can potentially reduce the incident triage time from hours or days to minutes. Our tool also detected some production system failures that are missed by existing alert tools. 

We have also implemented an initial prototype of the system call analysis system that provides deep-dive debugging for applications running inside the public cloud infrastructures. Our system call monitoring agent only imposes less than 1% CPU overhead to the customer’s environment. The monitoring agent transmits compressed system call trace to our server only after an anomaly alert is raised. The system call analysis module can estimate the fault impact scope (e.g., global vs local impact) and produce a rank list of root cause related functions. We tested our prototype using 10 real software bugs existed in 4 common open source software systems (Cassandra, Apache, Hadoop, MySQL). The results show that we can accurately estimate the fault impact scope and rank the root cause related functions within top 25 candidate faulty functions out of millions of application functions, which can greatly reduce the debugging time for the application developer.

We also tested with a range of distributed applications deployed on production cloud infrastructures. The distributed applications we tested include hundreds of computing nodes. Our system can complete anomaly detection within several seconds and train hundreds of models in parallel within tens of seconds. So we believe that our approach is scalable and practical for large-scale system monitoring and real-time analytics. During our Phase IB project, we also implemented a set of open source monitoring agents to integrate InsightFinder with different types of systems. We also implemented an initial prototype of log event classification and anomaly detection tool. 

The project provides valuable internship opportunties to three graduate students and one female undergraduate student to obtain software development and data analytics skills. Techniques developed in this project have significant impact on improving the diagnosability and robustness of many real world cloud computing infrastructures. The commercial potential of the cloud anomaly prediction and diagnosis technology is big and has been demonstrated by our intial comericalization success. As the proposing technology increases the robustness of the cloud infrastructure, it allows more users to adopt the cloud computing technology and thus benefit the whole society that depends on the cloud technology. The project also advances the state of the art of the cloud reliability management research by putting the technology into real world use and enhance the technology to address real world challenges. 



Last Modified: 10/12/2016
Modified by: Chao Huang

For specific questions or comments about this information including the NSF Project Outcomes Report, contact us.