A bear playing hopscotch

Google Zanzibar for the rest of us

Greg Sarjeant

Google Zanzibar is the belle of the authorization ball. Everyone’s talking about it, and it’s easy to understand why. It powers authorization for hundreds of Google’s apps, so surely it can handle whatever you and I throw at it. But at what cost? And does that mean that it’s the right solution for the rest of us?

If you’re familiar with Zanzibar, then you know about its key features: gargantuan scale, high availability, strong consistency, a relational data model. But its defining characteristic is actually centralization. That is, to use a Zanzibar-like model for your authorization system, you must centralize all data that could ever need to render an authorization decision and store it in the authorization system itself. This centralization is a massive tradeoff that’s not practical for most companies. The Googles of the world can pull it off, but is there a Zanzibar for the rest of us?

What Is Google Zanzibar?

A bear asking "What is Google Zanzibar"

There are plenty of “What is google Zanzibar?” posts out there that provide an overview of the system (we even have one!), so we won’t belabor that here. It’s worth noting, though, that there’s no “Zanzibar” product that Google provides. The Zanzibar whitepaper doesn’t provide a complete technical specification, either. In this post, when we say “Google Zanzibar,” we are referring to the architecture described in the whitepaper, which has become the foundation of several authorization implementations.

If you haven’t read the paper, it’s worth a read. Zanzibar is an incredible feat of engineering. From the abstract:

Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use.

It achieves these eye-popping stats by using Google’s infamous Spanner database as the underlying storage and replication mechanism, along with multiple layers of aggressive, lightning fast caching. Hats off to the teams of engineers who made it possible!

Of course, when you’re Google, that’s kind of…what you have to do. You operate a trillion-dollar business, you serve billions of users, and on top of all that, there’s no room for error. Returning a stale result is strictly verboten.

Watch our video to dive deeper into the core design decisions of Zanzibar and learn about the pros and cons of each

The Key Tradeoff: Centralization

Google positions Zanzibar as a “consistent, global authorization system.” It’s no accident that “consistent” is the first adjective in that description. It was the primary consideration of the design. But strong consistency is a diabolical challenge at Google’s scale. How can you be sure that you know exactly when every change to the authorization system and all of its clients happened, when all of those systems are distributed around the world?

Google accomplished this through aggressive centralization. Everything that has to do with authorization at Google is centralized in Zanzibar – specifically, in its database, Spanner. Spanner acts as the source of truth for the authorization model, all authorization-relevant data, and even the clock that clients use to associate application changes with authorization changes.

Why is Centralization a Problem?

why is centralization a challenge in Google Zanzibar

Let’s look a bit more closely at the phrase “all authorization-relevant data” up there in the previous section. At first glance, this sounds innocuous enough. We’ll just dump our roles into Zanzibar and get on with our lives. But as far as Zanzibar is concerned, authorization-relevant data is any piece of data that an app needs to render an authorization decision. It needs to store all of that data in Spanner, because that’s how it assigns an authoritative timestamp (remember, it’s also the clock) to every authorization change. This includes the obvious things like roles, but it also encompasses a lot of application data that can be surprising, such as:

Reporting relationships

  • Can your manager view your HR info? Zanzibar needs to know who your manager is.

File/folder relationships

  • Are folder permissions inherited by their children? Zanzibar needs to know that entire hierarchy.

Repository attributes

  • Can everybody view public repositories? Zanzibar needs to know which repositories are public.

Okay, well at least it doesn’t care about things like modification dates.

  • Do issues get locked to new comments after a period of inactivity? Hi, Zanzibar.

But that data is in my database

The problem with requiring the authorization system to own all authorization data is that there’s really very little pure authorization data in any application. The majority of it is just application data that is sometimes used to make authorization decisions.

application data and authorization data venn diagram

This duality can create significant strain for development teams. Imagine that you’re working on an application and you want to offload your authorization to Zanzibar. You’ve been storing all of your data in an application database (or databases, if you have multiple services). Maybe it’s Postgres, maybe it’s something else – but it’s definitely not Zanzibar’s Spanner instance. Before you start to use Zanzibar, you need to give it all the data that it needs to make those decisions. But you also still need to use a lot of that data in your application. This leaves you with two options, neither of which is great:

  1. Fetch the application data from Zanzibar whenever you need it for an application operation
  2. Synchronize the data between your application database(s) and Zanzibar

Fetching common data like file/folder hierarchies or repository attributes from an API any time you need it in your application is at best a big performance hit. At worst, it may make simple operations impossible. Suppose you want to list all the files and folders that are under a common parent folder. This is a simple join or document lookup if you have access to the data, but your authorization system may not even let you fetch the data in this way; all it expects you to do with it is render an authorization decision.

Anyone who’s ever tried to keep data in sync between two sources knows how brittle that can be, and how tricky to remediate when things inevitably drift. You need to be sure that you have good error reporting, tracing, and auditing to ensure that every change to the application state is correctly reflected in your authorization system. When the systems diverge, you’ll encounter subtle errors that are hard to reproduce and isolate. This is the last thing you want for the authorization system that powers so much of your application’s security.

Only at Google

For any company at Google scale, a Zanzibar-style system is still probably the right move. But Google operates under different constraints than most.

Google has access to technologies like Spanner and its sister caching services, which enable it to centralize all authorization-relevant data. These technologies are not just outside of the usual stack for most of us – they’re out of reach. Most of us use Postgres, maybe with a dash of Redis.

Google is famously known for scaling by enforcing technical homogeneity across the company. It runs the world’s largest monorepo. Developers can only use a small number of approved languages and frameworks. It has scores of engineers constantly paying down technical debt. True to form, this move to Zanzibar had to be enforced at Google via a top-down directive to every development team, and it took years to migrate, app by app.

Sadly (or happily, depending on your perspective), the rest of us don’t live this life. As Jean Yang laid out in Building for the 99% Developers, most of us live in the jungle of legacy apps, polyglot codebases, a mix of old and new, and constant pressure to move forward without refactoring.

Consistency, yes – but not centralization

So the question is: what should the rest of us do? Should we deploy Zanzibar-style systems too? The answer is no. Unless you're Google (or insert the FAANG of your choice), no. We need something for the rest of us.

The good news is that there are now some commercial and open source options out there. The bad news is that, like the architecture described in the paper, they all require you to centralize your data.

We said earlier that there’s very little pure authorization data – most of it is both application data and authorization data. But there’s a meaningful distinction. Some of your data – roles, teams – is general-purpose data that is primarily used for authorization operations and is shared by multiple services. A lot of it, such as file/folder hierarchies and modification dates, is primarily used to implement domain-specific application functionality.

Rather than storing all of this data either with the authorization system or with the application, a more effective approach would be to store the general-purpose data in the authorization system and the domain-specific data in the application database. Then, you could incorporate the relevant application data only at the time that it’s needed in order to make an authorization decision.

One way that current authorization solutions do this is to pass the domain-specific data in at evaluation time. This keeps the majority of the application data in the application where you need it most, and it allows you to use it in the authorization system when you need it for an authorization decision. But there are drawbacks to this approach:

  • You’re sending a lot more data over the wire every time you make an authorization decision
  • You need to remember which authorization data lives in the authorization system and which doesn’t
  • It’s more difficult to trace how an authorization decision was made when you need to debug

We’re getting warmer, but we’re not there yet.

The Little Bear’s Authorization Solution

Goldilocks of authorization

If you’re thinking it’s a bit disingenuous of us to rail against centralization like this, we agree! After all, it’s how Oso Cloud has worked to date. But that’s never been the end state for us. We have a longer-term vision for how authorization should work where your authorization system adapts to your application architecture, not the other way around.

What if you could answer authorization questions without centralizing your authorization data AND without passing your application data for each authorization request? What would something like that look like? Could your application and your authorization system collaborate on authorization decisions?

We think a hybrid model like this is just right. It would let the authorization system take care of the general-purpose logic that it’s responsible for, while the application handles the domain-specific logic that it’s responsible for. Neither would have to pass data to the other. Your team wouldn’t have to build and maintain brittle synchronization systems.

This isn’t an easy problem, but that’s what makes it fun. We’ve actually been planning this for years, and it’s not far from being reality. We still have a couple edges to sand down, but if we’ve piqued your interest, drop us a line on Slack or set up an appointment to talk to us. Or torch us on Twitter – that works too.

Want us to remind you?
We'll email you before the event with a friendly reminder.

Write your first policy