Big Data Man

By Michael McLaughlin / November/December 2014
November 5th, 2014

You can thank David Blei ’97 for all those personalized suggestions of things to buy that pop up on your screen whenever you’re online. In fact, the easiest way for this Columbia University computer science and statistics professor to explain his field of expertise—“probabilistic topic modeling”—is to talk about Amazon or Netflix or Etsy, the wildly popular website that sells handmade goods and crafts. (Etsy uses a variation of an algorithm that Blei cowrote.)

blei.jpg
Courtesy David Blei

Amazon and Netflix use advanced formulas related to probabilistic topic modeling on you. Their method for suggesting what music or TV show you might like might use algorithms that not only analyze your past choices but compare them with those of similar shoppers.

The predictions can be amazingly—and eerily—accurate. Blei’s pioneering doctoral work on probabilistic topic modeling at UC Berkeley first allowed computers to summarize the content of a large collection of data.

“It used to be difficult to handle 10,000 documents,” Blei says, “but now we can process millions.” Because of his work, Blei was awarded a prestigious National Science Foundation Presidential Early Career Award for Scientists and Engineers in 2011.

The applications for Blei’s breakthrough go far beyond stores and streaming services. Researchers can now scan visual images, like photos, or find patterns among common ancestors by looking at genetic data. Lawyers digging through subpoenaed e-mails, historians studying government records, or anyone who doesn’t know what’s buried in a mountain of data also stand to benefit. “Sure, you could hire a thousand people to read every document and summarize them,” Blei says, “but this does it automatically.”

At Brown, Blei concentrated in math and computer science, focusing on artificial intelligence. At Berkeley he became interested in text data mining, a direct precursor to his current work. With another grad student and their adviser, he wrote the groundbreaking algorithm, known as “latent Dirichlet allocation,” that spawned the field of probabilistic topic modeling.

The ongoing challenge is to do more with increasingly large data sets and to do it faster. Until now, many search engines have only hunted for keywords that match a user’s specific search terms. A search for “cat,” for example, would not turn up links to sites abut “felines.”

Blei’s algorithms may help to correct that deficiency. His model, for example, could show that weather is the unifying subject among pages discussing temperature, rainfall, and wind. “We need to take advantage of all this information,”  Blei says. “Something like keyword search can fall short.”

At Princeton, where Blei previously worked, he teamed up with Ken Norman, a neuroscientist, to study human memory. The researchers performed brain scans on volunteers as they memorized and recited a list of words. A computer program then examined the brain scans for patterns that reveal how people store and retrieve language in their minds

Amazon and Netflix use advanced formulas related to probabilistic topic modeling on you. Their method for suggesting what music or TV show you might like might use algorithms that not only analyze your past choices but compare them with those of similar shoppers.

What do you think?
See what other readers are saying about this article and add your voice. 
Related Issue
November/December 2014