The Forgotten Data Centers

Data centers are where the Internet and cloud services live, and so they have been getting lots of public attention in recent years. If we read technology news or research papers, it’s not uncommon that we see IT giants, like Google and Facebook, publicly discuss and share the designs of mega-scale data centers they operate. But, another important type of data center –– multi-tenant data center, or commonly called “colocation”/”colo” –– has been largely hidden from the public and rarely discussed (at least in research papers), although it’s very common in practice and located almost everywhere, from Silicon Valley to the gambling capital, Las Vegas.

Unlike a Google-type data center where the operator manages both IT equipment and the facility, multi-tenant data center is a shared facility where multiple tenants house their own servers in shared space and the data center operator is mainly responsible for facility support (like power, cooling, and space). Although the boundary is blurring, multi-tenant data centers can be generally classified as either a wholesale data center or a retail data center: wholesale data centers (like Digital Realty) primarily serve large tenants, each having a power demand of 500kW or more, while retail data centers (like Equinix) mostly target tenants with smaller demands.

Multi-tenant data centers serve almost all industry sectors, including finance, energy, major web service providers, content delivery network providers, and even some IT giants lease multi-tenant data centers to complement their own data center infrastructure. For example, Google, Microsoft, Amazon, and eBay are all large tenants in a hyper-scale multi-tenant data center in Las Vegas, NV, and Facebook leases a large data center in Singapore to serve its users in Asia.

Multi-tenant data centers and clouds are also closely tied. Many public cloud providers, like SalesForce, which don’t want to or can’t build their own massive data centers, lease capacities from multi-tenant data center providers. Even the largest players in the public clouds, like Amazon, use multi-tenant data centers to quickly expand their services, especially in regions outside the U.S. In addition, with the emergence of hybrid cloud as the most popular option, many companies are housing the entirety of their private clouds in multi-tenant data centers, while large public cloud providers are forging partnership with multi-tenant data center providers to help tenants leverage public clouds to complement their private parts.

Today, the U.S. alone has over 1,400 large multi-tenant data centers, which consumed nearly as five times energy as Google-type data centers all combined (37.3% versus 7.8%, in percentage relative to all data center energy usage, excluding tiny server closets). Driven by the surging demand for web services, cloud computing, and Internet of Things, the multi-tenant data center industry is expected to continue its rapid growth. While the public attention mostly goes to large IT giants who continuously expand their data center infrastructure, multi-tenant data center providers are also building their own, even at a faster pace.

Despite their prominence, multi-tenant data centers have been much less studied than Google-type data centers by the research community. While, these two types of data centers share many of the same high-level goals, like energy efficiency, utilization, and renewable integration; many of the existing approaches proposed for Google-type data centers don’t apply to multi-tenant data centers, which have additional challenges due to the operator’s lack of control over tenants’ servers. Even worse, individual tenants manage their own servers with little coordination with others (in fact, tenants typically don’t even know whom they’re sharing the data center with).

As a concrete example, consider the problem we studied in our recent HPCA’16 paper. Keeping servers’ aggregate power usage always below the data center capacity is extremely important for ensuring data center uptime.  When the aggregate power occasionally exceeds the capacity (called emergency, due to power oversubscription), a common technique used in Google-type data centers is to carefully lower the servers’ power consumption so as to meet multi-level power capacity constraints while minimizing the performance degradation. In a multi-tenant data center, the operator can’t do this –– tenants themselves control the servers. Further, who should reduce power and by how much must be carefully decided to minimize the performance loss, but these decisions require the operator to know tenants’ private information (e.g., what workloads are running, what’s the performance loss if lowering certain servers’ power usage).  So, these challenges can’t be addressed by the existing technological approaches alone; instead, they require novel market designs along with new advances in data center architecture, with the goal of providing mechanisms that are “win-win” for both data center operators and tenants.

The above was one example of the added complexity due to the multi-tenant setting, but there are many others.  We have a few papers on this topic (HPCA’16, HPCA’15, Performance’15, …), but will be looking into it much more in the coming years. We hope others do as well!

Reporting from SoCal NEGT

Last week, USC hosted our annual Southern California Network Economics and Game Theory (NEGT) workshop.  (Thanks to David Kempe and Shaddin Dughmi for all the organization this year!)  It’s always a very fun workshop, and really does a great job in ensuring a multidisciplinary community around CS, EE, and Econ in the LA area.  We’ve been doing it for so long now that the faculty & students really know each other well at this point…

As always, there were lots of great talks.  In particular, we had a great set of keynotes again this year.

The first keynote was Ashish Goel, who gave an inspiring talk about his work on “Crowdsourced democracy” where he has managed an incredible thing.  He has built a system for participatory budgeting (the process where the community votes on particular social works projects and the outcome of the voting actually determines budget priorities).  His system has now been used in a wide variety of cities, and whenever it’s used, he’s gotten to run experiments outside of the voting that allow him to gather data about the effectiveness of a variety of platform designs for participatory voting.  This in turn has motivated some deep theoretical work on the efficiency of different platform designs, which looks like it has the possibility of impacting real practice in the coming years!  It is truly an exciting place where theory and practice are intertwined — and where a researcher is really attacking a problem of crucial societal importance.

The second keynote came from Kevin Leyton-Brown, who also gave a truly ambitious talk about work he’s pursuing that takes a look at the foundations of game theoretic models — questioning the standard models of game theory.  Kevin’s work typically takes a hard look at the interactions of theoretical and practical issues in algorithmic game theory, and this does the same.  It questions the typical theoretical abstractions about agent behavior and strives to build better models of how people actually react in strategic settings.  It is great to see someone from the computer side of algorithmic game theory getting engaged in the behavioral economic side of things — this is an area of economics that is, to this point, fairly untouched by computer scientists.

Of course, there were lots of interesting talks from the “locals” too, but I’ll stop here.  We’ve now been doing this for seven years, and I’m so glad that it’s going strong — I’m looking forward to trucking over to UCLA for next year’s incarnation!

Introducing DOLCIT

At long last, we have gotten together and created a “Caltech-style” machine learning / big data / optimization group, and it’s called DOLCIT: Decision, Optimization, and Learning at the California Institute of Technology.  The goal of the group is to take a broad and integrated view of research in data-driven intelligent systems. On the one hand, statistical machine learning is required to extract knowledge in the form of data-driven models. On the other hand, statistical decision theory is required to intelligently plan and make decisions given imperfect knowledge. Supporting both thrusts is optimization.  DOLCIT envisions a world where intelligent systems seamlessly integrate learning and planning, as well as automatically balance computational and statistical tradeoffs in the underlying optimization problems.

In the Caltech style, research in DOLCIT spans traditional areas from applied math (e.g., statistics and optimization) to computer science (e.g., machine learning and distributed systems) to electrical engineering (e.g., signal processing and information theory). Further, we will look broadly at applications spanning information and communication systems to the physical sciences (neuroscience and biology) to social systems (economic markets and personalized medicine).

In some sense, the only thing that’s new is the name, since we’ve been doing all these things for years already.  However, with the new name will come new activities like seminars, workshops, etc.  It’ll be exciting to see how it morphs in the future!

(And, don’t worry, RSRG is still going strong — RSRG and DOLCIT should be complementary with their similar research style but differing focuses with respect to tools and applications.)

Caltech CMS is hiring

I’m happy to announce that the Computing and Mathematical Sciences (CMS) department will be continuing to grow this year.  The ad for our faculty search is now up, so spread the word!

As you’ll see, the ad is intentionally broad.  We are looking for strong applicants across computer science and applied math.  We look for impressive, high-impact work rather than enforcing preconceived notions of what is hot at the moment, and we are definitely welcoming of areas on the periphery of computing and applied math too — candidates at the interface of EE, mechanical engineering, economics, biology, physics, etc. are definitely encouraged to apply!  One of the strengths of our department (and Caltech in general) is the permeability of the boundaries between fields.

Algorithms & Uncertainty at the Simons Institute

I’m hoping that most of the people who read this blog have already heard, but in case they haven’t — next Fall, the Simons Institute is hosting a semester-long program on Algorithms and Uncertainty, which I am co-organizing with Avrim Blum, Anupam Gupta, Robert Kleinberg, Stefano Leonardi, and Eli Upfal.

It should be a very interesting semester, and we’ve already lined up a long list of interesting long-term participants.  The planning for the workshops is just beginning, but there will be three main events: an initial “Boot Camp” and then two workshops: “Optimization and Decision-Making Under Uncertainty” and “Learning, Algorithm Design, and Beyond Worst-Case Analysis”.

More info about all of the events will be posted here as it becomes available.

The reason for posting about this now is that a call recently went up for Simons-Berkeley Research Fellowships, which are for junior researchers within six years of their PhD.  Please spread the word about the application — the deadline is December 15, 2015.

A new ACM proposal for integrating conference and journal publication models

The relative merits (and problems) of the conference and journal publication models are an ubiquitous topic of conversation at almost every conference/workshop I’ve ever attended.    It’s been great to see lots of experiments with adjustments to the models in recent years at, e.g., ICML, VLDB, SIGGRAPH, etc.   For various reasons, these venues have been willing and able to be quite bold at experimenting with the publication models and creating various “jourference” hybrids.

It seems that the success of these experiments may now have given birth to a broader solution.  This month’s CACM outlines a proposal for a new approach toward merging conference and journal publication models in CS.  They also devote a point/counterpoint to a discussion of the relative merits of the proposal.  If you haven’t yet seen it, give it a read and be sure to fill out the survey afterwards.  This is a crucial issue for the CS community in my mind.

To put myself out there a bit — I think the proposal is a great option for CS, and I hope that we can move to a model like this across the ACM.  I think the conference publication model is great at allowing fast publication and creating a sense of community in different research areas, but that it has lots of very significant issues that come from the lack of time for revisions to address reviewer comments and the challenge of creating “journal versions” after the fact (duplicated review effort, awkwardness in assessing novelty, etc.).  As anyone who has worked at the interface of CS and other fields knows, this issue creates significant hurdles for visibility — especially for students — while also leading to significant duplication of reviewing effort from the community.

How much energy does Bitcoin require?

You hear a lot about Bitcoin these days — how it is or isn’t the future of currency… In the middle of such a discussion recently, the topic of energy usage came up: Suppose Bitcoin did take over — what would the sustainability impact be?

It’s certainly a complicated question, and I don’t think I have a good answer to it yet.  But, a first order question along the way would be: how much energy is required per Bitcoin transaction?  

There’s a nice analysis of this over at motherboard, and the answer it comes to is this: a single Bitcoin transaction uses enough electricity to power ~1.5 american households for a day!   By comparison, a Visa transaction requires somewhere on the order of the electricity to power .0003 households for a day…  

That’s certainly a much bigger gap than I would’ve guessed, and highlights that there would certainly be a lot of energy-efficiency improvement needed if Bitcoin were to grow dramatically!  Of course, it’s not clear whether that would be possible, since the security of the system depends on the difficulty of the computations (and thus large energy consumption).  So, in a sense, the significant energy usage becomes part of the protection against attacks…  I guess there are probably some interesting research questions here.

See the full article for a discussion of how they arrived at these numbers.  It’s a back-of-the-envelope calculation, but a pretty reasonable estimate in my mind.