Privacy as plausible deniability?

As I was flying to the NSDI PC meeting this week I was catching up on reading and came across an article on privacy in the Atlantic that (to my surprise) pushed nearly the same perspective on privacy that we studied in a paper a year or so ago… Privacy as plausable deniability.

The idea is that hacks, breaches, monitoring behavior, etc. are so common and hard to avoid that relying on tools from crypto or differential privacy isn’t really enough.  Instead, if someone really cares about privacy they probably need to take that into account in their actions.  For example, you can assume that google/facebook/etc. are observing your behavior online and that this is impacting prices, advertisements, etc. Tools from privacy, encryption, etc. can’t really help with this.  However, tools that add “fake” traffic can.  If an observer knows that you are using such a tool then you always have plausible deniability about any observed behavior, and if these are chosen carefully, then they can counter the impact of personalized ads, pricing, etc.  There are now companies such as “Plausible Deniability LLC” that do exactly this!

On the research front, we looked at this in the context of the following question: If a consumer knows that their behavior is being observed and cares about privacy, can the observer infer the true preferences of the consumer?  Our work gives a resounding “no”.  Using tools from revealed preference theory, we show that the observer not only cannot learn, but that every set of observed choices can be “explained” as consistent with any underlying utility function from the consumer.  Thus, the consumer can always maintain plausible deniability.

If you want to see the details, check it out here!   And, note that the lead author (Rachel Cummings) is on the job market this year!

P.S. The NSDI PC meeting was really stimulating!  It’s been a while since I had the pleasure of being on a “pure systems” PC, and it was great to see quite a few rigorous/mathematical papers be discussed and valued.  Also, it was quite impressive to see how fair and thorough the discussions were.  Congrats to Aditya and Jon on running a great meeting!

Data Markets in the Cloud

Over the last year, while I haven’t been blogging, one of the new directions that we’ve started to look at in RSRG is “data markets”.

“Data Markets” is one of those phrases that means lots of different things to lots of different people.  At its simplest, the idea is that data is a commodity these days — data is bought and sold constantly. The challenge is that we don’t actually understand too much about data as an economic good.  In fact, it’s a very strange economic good and traditional economic theory doesn’t apply…

Continue reading

(Nearly) A year later

It’s been one year since I started as executive officer (Caltech’s name for department chair) for our CMS department…and, not coincidentally, it’s been almost that long since my last blog post!  But now, a year in, I’ve got my administrative legs under me and I think I can get back to posting at least semi-regularly.

As always, the first post back after a long gap is a news filled one, so here goes!

Caltech had an amazing faculty recruitment year last year!  Caltech’s claim to fame in computer science has always been pioneering disruptive new fields at the interface of computing — quantum computing, dna computing, sparsity and compressed sensing, algorithmic game theory, … Well, this year we began an institute-wide initiative to redouble our efforts on this front and it yielded big rewards.  We hired six new mid-career faculty at the interface of computer science!  That is an enormous number for Caltech, where the whole place only has 300 faculty…

Continue reading

The Forgotten Data Centers

Data centers are where the Internet and cloud services live, and so they have been getting lots of public attention in recent years. If we read technology news or research papers, it’s not uncommon that we see IT giants, like Google and Facebook, publicly discuss and share the designs of mega-scale data centers they operate. But, another important type of data center –– multi-tenant data center, or commonly called “colocation”/”colo” –– has been largely hidden from the public and rarely discussed (at least in research papers), although it’s very common in practice and located almost everywhere, from Silicon Valley to the gambling capital, Las Vegas.

Unlike a Google-type data center where the operator manages both IT equipment and the facility, multi-tenant data center is a shared facility where multiple tenants house their own servers in shared space and the data center operator is mainly responsible for facility support (like power, cooling, and space). Although the boundary is blurring, multi-tenant data centers can be generally classified as either a wholesale data center or a retail data center: wholesale data centers (like Digital Realty) primarily serve large tenants, each having a power demand of 500kW or more, while retail data centers (like Equinix) mostly target tenants with smaller demands.

Continue reading

A new ACM proposal for integrating conference and journal publication models

The relative merits (and problems) of the conference and journal publication models are an ubiquitous topic of conversation at almost every conference/workshop I’ve ever attended.    It’s been great to see lots of experiments with adjustments to the models in recent years at, e.g., ICML, VLDB, SIGGRAPH, etc.   For various reasons, these venues have been willing and able to be quite bold at experimenting with the publication models and creating various “jourference” hybrids.

It seems that the success of these experiments may now have given birth to a broader solution.  This month’s CACM outlines a proposal for a new approach toward merging conference and journal publication models in CS.  They also devote a point/counterpoint to a discussion of the relative merits of the proposal.  If you haven’t yet seen it, give it a read and be sure to fill out the survey afterwards.  This is a crucial issue for the CS community in my mind.

To put myself out there a bit — I think the proposal is a great option for CS, and I hope that we can move to a model like this across the ACM.  I think the conference publication model is great at allowing fast publication and creating a sense of community in different research areas, but that it has lots of very significant issues that come from the lack of time for revisions to address reviewer comments and the challenge of creating “journal versions” after the fact (duplicated review effort, awkwardness in assessing novelty, etc.).  As anyone who has worked at the interface of CS and other fields knows, this issue creates significant hurdles for visibility — especially for students — while also leading to significant duplication of reviewing effort from the community.

How much energy does Bitcoin require?

You hear a lot about Bitcoin these days — how it is or isn’t the future of currency… In the middle of such a discussion recently, the topic of energy usage came up: Suppose Bitcoin did take over — what would the sustainability impact be?

It’s certainly a complicated question, and I don’t think I have a good answer to it yet.  But, a first order question along the way would be: how much energy is required per Bitcoin transaction?  

There’s a nice analysis of this over at motherboard, and the answer it comes to is this: a single Bitcoin transaction uses enough electricity to power ~1.5 american households for a day!   By comparison, a Visa transaction requires somewhere on the order of the electricity to power .0003 households for a day…  

That’s certainly a much bigger gap than I would’ve guessed, and highlights that there would certainly be a lot of energy-efficiency improvement needed if Bitcoin were to grow dramatically!  Of course, it’s not clear whether that would be possible, since the security of the system depends on the difficulty of the computations (and thus large energy consumption).  So, in a sense, the significant energy usage becomes part of the protection against attacks…  I guess there are probably some interesting research questions here.

See the full article for a discussion of how they arrived at these numbers.  It’s a back-of-the-envelope calculation, but a pretty reasonable estimate in my mind.

Some thoughts on broad privacy research strategy

Let me begin by saying where I think the interesting privacy research question does not lie. The interesting question is not how do people and organizations currently behave with respect to private information. Current behaviors are a reflection of culture, legislation, and policy, and all of these have proven themselves to be quite malleable, in our current environment. So the interesting question when it comes to private information is—how could and should people and organizations behave, and what options could or should they even have? This is a fundamental and part-normative question, and one that we cannot address without a substantial research effort. Despite being part-normative, this question can be useful in suggesting directions for even quite mathematical and applied research.

The first thing I’d like to ask is, What do we need to understand better in order to decide how to address this question? I see three relevant types of research that are largely missing:
1. We need a better understanding of the utility and harm that individuals, organizations, and society can potentially incur from the use of potentially sensitive data.
2. We need a better understanding of what the options for behavior could look like—which means we need to be open to a complete reinvention of the means by which we store, share, buy, sell, track, compute on, and draw conclusions from potentially sensitive data. Thus, we need a research agenda that helps us understand the realm of possibilities, and the consequences such possibilities would have.
3. It is, of course, important to remember the cultural, legislative, and policy context. It’s not enough to understand what people want and what is feasible. If we care about actual implementation, we must consider this broader context.

The first two of these points can and must be addressed with mathematical rigor, incorporating the perspectives of a wide variety of disciplines. Mathematical rigor is essential for a number of reasons, but the clearest one is that privacy is not an area where we can afford to deploy heuristic solutions and then cross our fingers. While inaccurate computations can later be redone for higher accuracy, and slow systems can later be optimized for better performance, privacy, once lost, cannot be “taken back.”

The second point offers the widest and richest array of research challenges. The primary work to address them will involve the development of new theoretical foundations for the technologies that would support these various interactions on potentially sensitive data.

For concreteness, let me give a few example research questions that fall under the umbrella of this second point:
1. What must be revealed about an individual’s medical data in order for her to benefit from and contribute to advances in medicine? How can we optimize the tradeoff of these benefits against potential privacy losses and help individuals make the relevant decisions?
2. When an offer of insurance is based on an individual’s history, how can this be made transparent to the individual? Would such transparency introduce incentives to “game” the system by withholding information, changing behaviors, or fabricating one’s history? What would be the impact of such incentives for misbehavior, and how should we deal with them?
3. How could we track the flow of “value” and “harm” through systems that transport large amounts of personal data (for example, the system of companies that buy and sell information on individuals’ online behavior)? How does this suggest that such systems might be redesigned?