Getting the most from ultralarge datasets - Engineering

In the words of Virginia Commonwealth University associate computer science professor , “There is no part of contemporary human activity left untouched by the need and desire to collect data.”

Indeed. Human beings, he explains, are immersed in a sea of data generated by myriad sources – sensors, cameras, microphones, software and other devices. From measurements, images and patterns, to sounds, Web pages and tunes, humans simply cannot process ultralarge scale datasets. But, he says, there are algorithms and methods that can be developed specifically to perform the job of learning from data.

“There is no efficient tool for dealing with millions of records,” Kecman explains. “This is why we are attacking this problem in our lab. We are developing algorithms to handle ultralarge datasets in all areas of human activities by using statistical learning approaches and models. We are inventing new ways to analyze datasets and we’re creating novel mathematical structures.”

In his lab at the VCU School of Engineering, Kecman and his students are developing and applying models in bioinformatics, medicine, engineering, science, e-commerce and Web mining. They are tackling the challenges associated with what is known in the information technology industry as “Big Data.” The challenges presented by vast amounts of digital data are so many, in fact, that the White House announced the “Big Data” initiative in March 2012.

“We’re very proud that we started working on the Big Data project long before it became an initiative of the U.S. government,” Kecman said. “We believe we’re just a step ahead of many other machine learning (ML) and data mining (DM) labs.”

By teaching graduate students how to better analyze and mine today’s ultralarge databases, Kecman is helping prepare them for the workplace by giving them a competitive edge. He is also at work developing a new undergraduate data analysis course that goes beyond the traditional statistics approach of years past, which is not equipped to handle today’s Big Data problems.

“Our results show that we’re able to handle millions of records with higher accuracy in a much shorter amount of time with fewer resources,” he said.

In the sea of data in today’s world, that’s quite a life raft.

“Our results show that we’re able to handle millions of records with higher accuracy in a much shorter amount of time with fewer resources.”

– Vojislav Kecman, Ph.D.