USC News

So Much Data, So Little Time

07/07/08
MySpace looks to USC researchers to help overwhelmed servers keep up with millions of new users on the social networking system.
By Eric Mankin
Professor Shahram Ghandeharizadeh and USC graduate student Shahin Shayandeh with MySpace representatives Felipe Cariño Jr. and Tin Zaw, from left

Photo/Chuck Espinoza
Imagine a moment – a thousandth of a second – in the life of a MySpace computer.

In that fraction of time, thousands of fingers on mice, spread across thousands of square miles, have clicked-in urgent requests for countless chunks of data: a photo requested in Minneapolis, video needed in Des Moines, a forum comment registered in Detroit.

As the vast social networking system grows, each millisecond becomes more and more crowded with requests. But now a USC specialist is working to make sure that the answers keep coming back quickly, even with tens of millions of new users.

“If MySpace were less successful,” said USC Viterbi School professor Shahram Ghandeharizadeh, “there would be no problem. But at the current volume of transactions, getting to the data quickly becomes an issue.”

The key to speed and capacity is called DRAM.

“Ideally,” said Ghandeharizadeh, the director of the USC Database Laboratory, “you want all the data requested to be in the quick-access cache memory of the servers, the DRAM, rather than having to retrieve it from the servers’ disc memory, which is much slower.”

But the total volume of data created by users is far more than the DRAM cache will hold. And as the user population grows at an accelerating pace, more and more requests arrive to query a larger and larger body of information. Even the innovative Berkeley DataBase system that MySpace uses to keep interactions quick is coming under increasing strain.

The Berkeley system keeps MySpace information flowing by quadruple redundancy: Each section of users is served not by one but four overlapping servers that share DRAM space, making the system faster, more reliable and more scalable.

“Now it works,” said Felipe Cariño, who heads MySpace Research, the company’s in-house R&D facility, “but if you double it, it may not.” And the population may in fact double as people in other countries learn to meet and greet each other on their own sites, he said.

Ghandeharizadeh is working with Cariño, a 1995 USC Marshall School Executive MBA, to find a way around the impending squeeze. Cariño dubs their effort the “Gemini Project,” after the famous twins: “Two heads, Viterbi and MySpace, coming together.”

The collaborators have been exploring a new method for maintaining and replacing the data kept in DRAM. Up to now, their methodology has been simple, but there is room for improvement: Data that has remained in the DRAM longest without being accessed is overwritten by new data.

Another method is potentially more effective: “heuristic” replacement, in which data is given simple but useful characterizations, which a program then uses to guide replacement.

The heuristic algorithm that the MySpace Intrapreneurial Research Group is adapting to the MySpace database comes out of a recent Ph.D. research thesis done by USC graduate student Shahin Shayandeh, who is a member of the team, along with three MySpace computer scientists.

The key element is taking file size into account as well as creation date. Large objects, like video files, kick out many small objects from the memory when they’re loaded. While the general solution is based on how frequently objects are accessed, another rule is to not let very large objects into DRAM.

“When you have a gigantic video file that takes up the same space as hundreds or even thousands of text files, uploading it is a shock to the system,” Ghandeharizadeh said.

Based on their research, Cariño and Ghandeharizadeh are hopeful that the new algorithm will adapt to MySpace demands and deliver the desired improvements in performance.

“Simulation studies show the heuristic method is a marvel. But seeing whether it delivers in a real ultra-large system, such as the one at MySpace, remains to be seen,” Ghandeharizadeh said.

Ghandeharizadeh cherishes the opportunity to take the software to limits not reached by ohers. Cariño and Ghandeharizadeh’s specialties are reliable and ultra-large systems, and only three or four exist in the entire world, of which MySpace is one. It includes, according to Cariño, “10 data centers and thousands of servers in 30-plus countries supporting local cultures and languages.”

The opportunity delights Ghandeharizadeh, who does not have many occasions to work directly on the ultra-big databases for which his creations are designed.

Cariño noted that the MySpace Research mission is clear and tightly focused “so that the final result must be a system or prototype that creates a new product or technology.”