A White-Box Testing Model For Deep Learning Systems

How do you find errors in a system that exists in a black box whose contents are a mystery even to experts?

That is one of the challenges of perfecting self-driving cars and other deep learning systems that are based on artificial neural networks—known as deep neural networks—modeled after the human brain. Inside these systems, a web of neurons enables a machine to process data with a nonlinear approach and, essentially, to teach itself to analyze information through what is known as training data.

When an input is presented to a “trained” system—like an image of a typical two-lane highway shown to a self-driving car platform—the system recognizes it by running an analysis through its complex logic system. This process largely occurs inside a black box and is not fully understood by anyone, including a system’s creators.

Any errors also occur inside the black box and are thus difficult to identify and fix. This opacity presents a particular challenge to identifying “corner case” behaviors that occur outside normal operating parameters. For example, a self-driving car system might be programmed to recognize curves in two-lane highways in most instances. However, if the lighting is dimmer or brighter than normal, the system may not recognize it and an error could occur.

Shining a light into the black box of deep learning systems is what researchers from Lehigh and Columbia University have achieved with DeepXplore, the first automated white-box testing of such systems. The group includes Yinzhi Cao, assistant professor of computer science and engineering at Lehigh; Junfeng Yang, associate professor of computer science at Columbia; Suman Jana, assistant professor of computer science at Columbia; and Columbia Ph.D. student Kexin Pei.

Evaluating DeepXplore on real-world datasets, the researchers have been able to expose thousands of unique incorrect corner-case behaviors. The team has made their open-source software public for other researchers to use, and launched a website to let people upload their own data to see how the testing process works.

The researchers presented their findings and won a Best Paper Award at the 2017 biennial ACM Symposium on Operating Systems Principles (SOSP) conference in Shanghai, China, on Oct. 29 in a session titled Bug Hunting.

“Our DeepXplore work proposes the first test coverage metric called ‘neuron coverage’ to empirically understand if a test input set has provided bad versus good coverage of the decision logic and behaviors of a deep neural network,” says Cao, assistant professor of computer science and engineering and an artificial intelligence expert.

In addition to introducing neuron coverage as a metric, the researchers demonstrate how a technique for detecting logic bugs in more traditional systems—called differential testing—can be applied to deep learning systems.

“DeepXplore solves another difficult challenge of requiring many manually labeled test inputs.  It does so by cross-checking multiple deep neural networks and cleverly searching for inputs that lead to inconsistent results from the deep neural networks,” says Yang, associate professor of computer science.

“For instance, given an image captured by a self-driving car camera, if two networks think that the car should turn left and the third thinks that the car should turn right, then a corner case is likely in the third deep neural network.  There is no need for manual labeling to detect this inconsistency.”

The team evaluated DeepXplore on real-world datasets including Udacity self-driving car challenge data, image data from ImageNet and MNIST, Android malware data from Drebin, and PDF malware data from Contagio/VirusTotal, and production quality deep neural networks trained on these datasets, such as these ranked top in Udacity’s self-driving car challenge. 

Their results show that DeepXplore found thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails) in 15 state-of-the-art DL models with a total of 132,057 neurons trained on five popular datasets containing around 162 GB of data.

White box testing model leads to greater neuron coverage

According to a paper to be published soon (see preliminary version here), DeepXplore is designed to generate inputs that maximize a DL system’s neuron coverage.

“At a high level, neuron coverage of DL systems is similar to code coverage of traditional systems, a standard metric for measuring the amount of code exercised by an input in a traditional software,” the authors write.

“However, code coverage itself is not a good metric for estimating coverage of DL systems as most rules in DL systems, unlike traditional software, are not written manually by a programmer but rather [are] learned from training data.”

“We found that for most of the deep learning systems we tested, even a single randomly picked test input was able to achieve 100-percent code coverage—however, the neuron coverage was less than 10 percent,” says Jana, assistant professor of computer science.

The inputs generated by DeepXplore achieved 34.4 percent and 33.2 percent higher neuron coverage on average than the same number of randomly picked inputs and adversarial inputs (inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake) respectively.

Using multiple DL systems to find logic bugs

Cao and Yang show how multiple deep learning systems with similar functionality (e.g., self-driving cars by Google, Tesla and Uber) can be used as cross-referencing oracles to identify erroneous corner cases without manual checks. For example, if one self-driving car decides to turn left while others turn right for the same input, one of them is likely to be incorrect. Such differential testing techniques have been applied successfully in the past for detecting logic bugs without manual specifications in a wide variety of traditional software.

In their paper, the researchers show how differential testing can be applied to DL systems.

Finally, the researchers’ novel testing approach can be used to retrain systems to improve classification accuracy. During testing, they achieved up to 3-percent improvement in classification accuracy by retraining a deep learning model on inputs generated by DeepXplore compared to retraining on the same number of randomly picked or adversarial inputs.

“DeepXplore is able to generate numerous inputs that lead to deep neural network misclassifications automatically and efficiently,” adds Yang. “These inputs can be fed back to the training process to improve accuracy.”

“Our ultimate goal,” says Cao, “is to be able to test a system, like self-driving cars, and tell the creators whether it is truly safe and under what conditions.”

Story by Lori Friedman