Autoscale, a.k.a. “Dynamic right-sizing”, at Facebook

A bit of news on the data center front, for those who may have missed it:  Facebook recently announced the deployment of a new power-efficient load balancer called “Autoscale.”  Here’s their blog post about it.

Basically, the quick and dirty summary of the design is to adapt the number of active servers so that it’s proportional to the workload, and adjust the load balancing to focus on keeping servers “busy enough” so that they don’t end up in a situation where lots of servers are very lightly loaded.

So, the ideas are very related to what’s been going on in academia over the last few years.  Some of the ideas are likely inspired by Anshul Ghandi and Mor Harchol-Balter et al.’s work (who have been chatting with Facebook over the past few years), and it’s actually quite similar in the architecture to the “Net Zero Data Center Architecture” developed by HP (that incorporated some of our work, e.g. these papers, which are joint with Minghong Lin, who now works with the infrastructure team at Facebook).

While this isn’t the first tech company to release something like this, it’s always nice to see it happen.   And, it will give me more ammo to use when chatting with people about the feasibility of this sort of design.  It is amazing to me that I still get comments from folks about how “data center operators don’t care about energy”…  So, to counter that view, here’re some highlights from the post:

“Improving energy efficiency and reducing environmental impact as we scale is a top priority for our data center teams.”

“during low-workload hours, especially around midnight, overall CPU utilization is not as efficient as we’d like. […] If the overall workload is low (like at around midnight), the load balancer will use only a subset of servers. Other servers can be left running idle or be used for batch-processing workloads.”

Anyway, congrats to Facebook for taking the plunge.  I hope that I hear about many other companies doing the same in the coming years!

The Community Seismic Network: Citizen Science and the Cloud

We almost missed the chance to highlight that the cover story of the July, 2014 issue of the Communications of the ACM (CACM) is a paper by a Caltech group on the Community Seismic Network (CSN). This note is about CSN as an example of system in a growing, important nexus: citizen science, inexpensive sensors, and cloud computing.

CSN uses inexpensive MEMS accelerometers or accelerometers in phones to detect shaking from earthquakes. The CSN project builds accelerometer “boxes” that contain an accelerometer, a Sheevaplug, and cables. A citizen scientist merely affixes the small box to the floor with double-sided sticky tape, and connects cables from the box to power and to a router. Installation takes minutes.

Analytics in the Sheevaplug or some other computer connected to the accelerometer analyzes the raw data streaming in from the sensor. This analytics engine detects local anomalous acceleration. Local anomalies could be due to somebody banging on a door, or a big dog jumping off the couch (frequent occurrence in my house), or due to an earthquake. The plug computer or phone sends messages to the cloud when it detects a local anomaly. An accelerometer may measure at 200 samples per second, but messages get sent to the cloud at rates that range from one per minute, to one every 20 minutes. The local anomaly message includes the sensor id, location (because phones move), and magnitude.

There are four critical differences between community networks and traditional seismic networks:

  •  Community sensor fidelity is much poorer than that of expensive instruments.
  •  The quality of deployment of community sensors by ordinary citizens is much more varied than that of sensors deployed by professional organizations.
  • Community sensors can be deployed more densely than expensive sensors. Think about the relative density of phones versus seismometers in earthquake-prone regions of the world such as Peru, India, China, Pakistan, Iran and Indonesia.
  • Community sensors are deployed where communities are located, and these locations may not be the most valuable for scientists.

Research questions investigated by the Caltech CSN team include: Are community sensor networks useful? Does the lower-fidelity, varied installation practices, and relatively random deployment result in networks that don’t provide value to the community and don’t provide value to science? Can community networks add value to other networks operated by government agencies and companies? Can inexpensive cloud computing services be used to fuse data from hundreds of sensors to detect earthquakes within seconds?

Continue reading

Rigor + Relevance beyond Caltech

I’ve been busy traveling for most of the last month, so missed my chance for timely postings about a couple of exciting happenings that highlight how theory can impact practice. But, here are two that I can’t resist a quick post about, even if it is a bit late.

First, this year’s ACM SIGCOMM award recipient is George Varghese, for “sustained and diverse contributions to network algorithmics, with far reaching impact in both research and industry.”  I think that’s a pretty nice summary. George’s work has been an inspiration for me since grad school when I worked through his book on Network Algorithmics. He is one of the masters at finding places where theoretical/algorithmic foundations can have practical impact.  He’s also a master at communicating theoretical work to systems folks.  As a result, he’s one of the few folks that can consistently get theory-driven work into Sigcomm, which makes him the envy of many in the community.  It’s really great to see someone with such strong theory chops being recognized by the networking systems community.

The second piece of news is another highlight about the impact of Leslie Lamport‘s work.  Mani wrote a great post about him when he received the Turing award earlier this year, in case you missed it.  I finally got to see him give a technical talk at the Microsoft Faculty summit this year, and was excited to hear about how some of his work on model checking was finally getting adopted by industry.  He mentioned in his talk that Amazon was beginning to adopt it for some large internal projects, and was even budgeting time for development for TLA+, a formal specification language that Leslie invented.   After hearing this, I started to dig around, and it turns out that James Hamilton has recently blogged about incorporating TLA+ into the development of DynamoDB.

Continue reading

Communication and Power Networks: Forward Engineering (Part II)

In Part I of this post, I have explained the idea of reverse and forward engineering, applied to TCP congestion control.   Here, I will describe how forward engineering can help the design of ubiquitous, continuously-acting, and distributed algorithms for load-side participation in frequency control in power networks. One of the key differences is that, whereas on the Internet, both the TCP dynamics and the router dynamics can be designed to obtain a feedback system that is stable and efficient, a power network has its own physical dynamics with which our active control must interact.

Continue reading

Communication and Power Networks: Forward Engineering (Part I)

This blog post will contrast another interesting aspect of communication and power networks: designing distributed control through optimization.  This point of view has been successfully applied to understanding and designing TCP (Transmission Control Protocol) congestion control algorithms in the last 1.5 decades, and I believe that it can be equally useful for thinking about some of the feedback control problems in power networks, e.g., frequency regulation.

Even though this simple and elegant theory does not account for many important details that an algorithm must deal with in a real network, it has been successfully put to practice.  Any theory-based design method can only provide the core of an algorithm, around which many important enhancements must be developed to create a deployable product. The most important value of a theory is to provide a framework to understand issues, clarify ideas, and suggest directions, often leading to a new opportunity or a simpler, more robust and higher performing design.

In Part I of this post, I will briefly review the high-level idea using TCP congestion control as a concrete example.  I will call this design approach “forward engineering,” for reasons that will become clear later.   In Part II, I will focus on power: how frequency regulation is done today, the new opportunities that are in the future, and how forward engineering can help capture these new opportunities.

Continue reading

A report from (one day of) EC

This past week, a large part of our group attended ACM EC up in Palo Alto.  EC is the top Algorithmic Game Theory conference, and has been getting stronger and stronger each year.  I was on the PC this year, and I definitely saw very strong papers not making the cut (to my dismay)… In fact, one of the big discussions at the business meeting of the conference was how to handle the growth of the community.

Finding about about the increasingly difficult acceptance standards, I was even happier that our group was so well-represented.  We had four papers on a variety of topics, from privacy to scheduling to equilibrium computation.  I’ll give them a little plug here before talking about some of my highlights from the conference…

Continue reading

A report from NSDI

Last week, I attended NSDI for the first time in quite a few years… I only managed to be at the conference for a day-and-a-half, but there was a lot of interesting stuff going on even in just that short time.

For me, it’s always stimulating to attend pure systems conferences like NSDI, given the contrast in research style with my own.  For example, there were more than a few papers where somewhere in the implementation, a quite challenging resource allocation problem came up, and the authors just applied a simple heuristic and moved past it without a second thought.  For me, I’d be distracted for months trying to figure out optimality guarantees, etc.  That’s, of course, a lot of fun to do and sometimes pays off, but it’s always good to see a reminder that often simple heuristics are good enough…

If you only look at four papers, which should they be?

Well, of course, you should start with the best paper award winner:

The topic of this paper highlights that, despite the fact that NSDI is a true systems conference, there were a definitely a few papers that took a theoretical/rigorous approach to design.  (Of course our paper did, but there were others too!)

Continue reading