Using Big Data to Study Creativity in Scientific Enterprise

Quantifying Creativity

Creativity has long been an intangible concept, says Ting Wang.

The complex process of connecting two seemingly unrelated scientific ideas is difficult to understand.

"It's kind of magic," says Wang, assistant professor of computer science and engineering. "How are you connecting these two thoughts?"

Wang has a possible answer to that question that's both creative and concrete: He's using big data to better understand the underlying mechanisms of the creative process.

Until now, researchers studying creativity in the scientific enterprise have focused on the references of scientific publications to gain an understanding of how the publications relate to one another. A paper's creativity, therefore, has been measured by examining how it connects previously disconnected knowledge. The greater the difference between the idea presented in a new paper and the claims made by the works cited within it, the more creative the new idea.

However, this approach doesn't reflect a complete picture of how the authors actually write the paper, says Wang. The authors consumed and possibly have been inspired by information outside the publications they've referenced, he says. "That information would be critical to understanding the novelty or the creativity of those who published the paper because it reflects how we take information and how we digest it and how we actually produce something new."

In a paper titled "Inspiration or Preparation? Explaining Creativity in Scientific Enterprise," Wang and his colleagues—Lehigh doctoral student Xinyang Zhang and Dashun Wang, associate professor of management and organizations at Northwestern University's Kellogg School of Management—explain an approach that allows them to "quantitatively assess the creativity of a paper, an author, an institution or even a discipline" for the first time. They also develop a predictive framework that "accurately identifies the most critical knowledge to fostering target scientific innovations." Zhang presented the paper at the International Conference on Knowledge Management in October.

Making the jump

In defining the creativity of an idea, Wang and his colleagues consider two factors: its rarity, or how many others have taken a similar approach, and the disconnect between the idea in the paper and the papers it cites. How big was the jump from an existing idea to the new idea?

illustration of box

To find out, the team looked to information consumption. The most comprehensive dataset that captures information consumption in scientific enterprise, they write, is the web traffic generated by researchers, which reveals what online resources they access. Wang and his colleagues used two web-scale, longitudinal datasets—Indiana University Click and Microsoft Academic Graph—to contrast authors' information consumption behaviors (input) against published scientific papers (output).

The Indiana University Click dataset is an anonymized dataset comprised of 53.5 billion web requests initiated by researchers at Indiana University between September 2006 and May 2010, and the Microsoft Academic Graph (MAG) dataset consists of 120.9 million papers published in 24,843 venues across all scientific fields. After identifying in the MAG dataset all papers published from 2007 to the present with at least one Indiana University-affiliated author, the team correlated the two datasets.

"You have input and output, and we try to find the corresponding part. Then you can sort of figure out what input leads to what output, how big the leap was [from one idea to another]"—or, in other words, the level of creativity of the publication, says Wang.

Due to privacy and technology constraints, the team could not track information consumption and knowledge production at an individual level. They were, however, able to study correlation at an organization level and found "remarkable predictability in creative processes": Of 59 percent of papers across all scientific fields, 25.7 percent of their creativity could be readily explained by their potential authors' information consumption.

Speeding up the process

An understanding of the creative process can potentially provide a valuable tool to streamline that process, filtering out useless information and focusing on the most critical pieces. With that in mind, the team leveraged their findings about predictability to develop a predictive framework.

"We are kind of buried in this sea of information," says Wang. "So when you want to do something, the first step is to see what others have done. Particularly if you're doing very cutting-edge research and the frontier is so large, there are a lot of related works. You are aware or maybe not aware [of those works]."

The team's framework, says Wang, can reveal the next step a researcher should take and speed up the creative process.

"If we understand how you make the jump from one thought to another, eventually a machine or an algorithm [can be created] to recommend a lot of information, saying 'this is one piece of information you might want to look at,'" he explains.

This could be especially valuable for those working on target innovation and is not limited to scientific enterprise.

"Indeed," the team writes, "[our framework's] mechanistic nature makes it potentially applicable for describing creative processes in other domains as well, such as musical, artistic and linguistic creativity."

Creative differences

Wang and his colleagues also found diversity in the creativity phenomenon of different academic disciplines. They compared biology and computer science specifically.

Reference pairs for publications in biology, Wang says, tend to be close together. These small steps occur, he says, because biology is a more established, risk-averse discipline resting on a large amount of existing work. In fact, he says, ideas with higher levels of creativity may not be as well received in the biology community due to its more conservative tendencies.

Computer science, on the other hand, tends to link diverse, more difficult-to-connect ideas. "Computer science has a little more room for risk [because] it won't lead to any consequential results," says Wang.

Therefore, Wang explains, papers in computer science demonstrate higher levels of creativity: "One can observe that biology follows a lognormal distribution, while computer science apparently follows a bimodal distribution, peaking at both low and high creativity scores. Such phenomena may be explained by that compared with biology, computer science is a relatively 'engineering' discipline, featuring more frequent fusion of originally disconnected knowledge."

Risk vs. reward

Wang, who runs the DataPower Lab at Lehigh, works on "both sides of data mining." He conducts research that invents new concepts and methods that empower large-scale data mining, and his work also bridges disciplinary boundaries for the application of these advances to privacy, security and trust issues. Prior to his arrival at Lehigh, Wang worked as a security analyst at IBM's Thomas J. Watson Research Center, where he wondered if analyzing web traffic might reveal to an outsider the company's research activity. This eventually led to his work on creativity and information consumption. Privacy, he says, remains a concern in this research.

illustration of box

"On the positive side, I try to understand complex societal, technological, and business phenomena using massive amounts of data," Wang explains. "On the negative side, I see how such big data approaches break privacy, safety, transparency."

Recalling his work at IBM, Wang notes the concerns a private-sector company might have with this use of data.

For example, researchers are not permitted to search a patent database, he explains, because narrowing the scope of a search and focusing on a particular type of patent can enable a competitor to determine the specific patents a company is going to file, putting that company at a disadvantage.

In a nonprofit environment like a university, these concerns still exist, and discussions about them are becoming increasingly more important. Any potential risk, though, is weighed against its reward.

"I think we are in a time where huge amounts of data give us lots of possibilities, and these things can only happen at this moment because previously we didn't have this kind of data available," says Wang.

Still, says Wang, data is limited when it comes to creativity. Researchers generate ideas from many sources—conversations with colleagues or print copies of books, for example—and those sources cannot be captured using the dataset he and his colleagues used for this study. However, this "may be an important step towards a better understanding of this creative process," he says.

 

This story appears as "Quantifying Creativity" in the 2017 Lehigh Research Review.