Authorization: Build vs Buy

Everyone knows: don’t build your own database, don’t roll your own crypto. But for higher-level application features, the status quo has traditionally been to build. Then Stripe and Twilio established a new pattern: you can offload elements of your application that are necessary, but not strategic, to a third-party cloud service. This started with payments and SMS but over time it has extended into many parts of the stack, including authentication, error reporting, and feature flags. It's time to add authorization to the list

The world is not used to buying authorization as a service - everyone has always rolled their own. Why? Authorization logic can feel deeply intertwined with application code and data, which makes it hard to see how to abstract it out. Some teams feel that their authorization requirements are too specific to them to be able to make use of a third-party platform.

So why change now? Authorization has gotten harder in the last decade, and the cost of getting it wrong higher. The introduction of multi-service, microservice, and serverless architectures throws a wrench in traditional authorization designs. Many apps are adding collaboration features (e.g. sharing), which create a step-function increase in the complexity of the authorization implementation. What was once a matter of a few "if" statements is now often a significant burden for engineering teams. Companies like Oso have emerged in this environment to solve authorization and provide it as a service.

As with most technological decisions, with enough time and engineers, you can build just about anything. So, can you build your own authorization service? Potentially. Should you? It depends on how you want your engineers to spend their time.

When to build

Based on conversations with 1,000+ engineering teams, here are some of the most common reasons we see for going the do-it-yourself (DIY) route:

a. Simplicity - Most authorization starts out simple: admins can do XYZ, members can do ABC, and everyone can only see their own data. If you can meet your authorization needs with a few if statements and a ROLES table in your database, then you probably should just ship it.

b. Control - Authorization is on the critical path, so it needs to be low-latency and highly available. It’s also a core security mechanism. By building it yourself, you can keep more control, which is a way to manage future risk and maintain optionality. With a homegrown system, even if it's bad, you can find where the skeletons are (because it's all your code); you can make any changes you need in the future (again, it’s all your code); and there’s no risk of vendor lock-in (because there’s no vendor).

c. Budget - Buying software costs money. Sometimes it’s easier to assemble a team of people you already have than to secure budget to buy software that you want.

d. Procurement Process - Even if you can get budget, the actual process of procuring the software can be long and tortuous – e.g., security review, legal review, architecture committees.

When to buy

But increasingly, companies are opting to buy authorization as a service. Here’s when that makes sense:

a. Complexity - Authorization rules are often harder than you think because they evolve over time. While most authorization starts out simple, it’s a moving target. Seemingly small changes – like adding a share button or a new layer of hierarchy – can balloon into multi-month projects. Debugging what should be simple issues can turn into a week-long hunt. At Oyster HR, it took 3 months to add a single new role to its custom authorization framework before moving to Oso.

b. Time to market - Many authorization features are directly tied to revenue. You can charge more for custom roles and advanced RBAC, for instance. At Productboard, providing customers more granular control over who can see what was a key driver to expanding deeper into its enterprise accounts and driving more consumption. But for authorization requirements that extend beyond the simple ones described above, it can take months or even years to design, build, and deploy an authorization system. For an internal authorization as a service, we typically see 3-6 engineers spend at least 6 months to get to a usable v1, usually longer. This means that it’s at least 6 months before your company can realize the benefits you sought originally. This dynamic applies not just to the v1, but to the time it takes to deliver on every feature request down the line. For example, Segment spent roughly that amount of time and energy to build its first internal authorization service when the company was just a couple hundred employees; and 1.5 engineers on an ongoing basis to maintain the service and add new features.

c. Strategic Focus - Every business has its special sauce. For some businesses it’s a marketplace. For others it’s data. For others it’s user experience. While authorization is necessary – if people can see other people’s stuff, the app doesn’t work – there are few businesses for which authorization is actually strategic, i.e., the core value proposition to its customers. Your engineering team is a precious and finite resource. Offloading anything that’s not strategic lets those engineers focus on the special sauce.

d. Motivation - You also want to retain your engineering team. Sadly, authorization is not usually the sexy project on the board, and having to stop to relearn how the authorization system works when they need to make changes or fix bugs can demoralize your team. Carta built a bespoke authorization system, for example, but just a year later found it hard to get engineers excited about maintaining it.

e. Expertise - Building an authorization system is hard. You need engineers with the skills and knowledge to make it happen. There are not many of these engineers in the market, and the state of industry knowledge on authorization is astonishingly bad. (To see what ‘good’ looks like, read Authorization Academy.) Here’s what the CTO of Arc said on this: “Arc is a banking platform, so getting authorization right is critical. We knew our requirements could get complex – we’ve already got 40 permissions across 9 roles – and we wanted to lean on the experts.”

f. Capacity - It typically takes 3-6 engineers 6-18 months to complete the first major authorization effort, and anywhere from 2-6 engineers to maintain the system thereafter. Some reference points: Segment (4 engineers, 6 months); Slack (10 engineers, 12 months); Carta (4 engineers, 9 months); Airbnb (5 engineers, 24 months); Google (dozens of engineers; several years). In order to make this work, you need confidence that you have that capacity and can continue to support it on an ongoing basis. Depending on the organization, you can expect to need anywhere from 50%-100% of the original engineering capacity to maintain an authorization system over time.

g. Security and Correctness - Authorization bugs are not fun, and customers do not take them lightly (cf Loom). To be confident that all this logic is strung up correctly, you need not just the system but also tooling around it – like testing and observability. Authorization issues often have more than a dozen layers of nested if statements and can span many code paths, making them notoriously hard to debug – so you need a good debugging approach too. From a Principal Engineer at Intercom: “We (Intercom) went all in on Oso and it has been really great for us. As we moved upmarket, being able to consistently and accurately implement authz features helped us move a lot faster – and resolved a never ending source of bugs and confusion.”

h. Developer Experience, Support, and Docs - These are critical ingredients to getting adoption of a tool across a team of engineers. They’re also the kinds of things that you might be asked to shortcut when building in-house. Using a third-party that prioritizes these can be a huge lift (especially docs). From a Staff Engineer who is an Oso customer:

What is more than “simple”?

Above we said that for the most simple authorization use cases, it may be perfectly good to DIY. So what constitutes more than simple?

a. Fine-grained and user-defined rules - A common point at which teams find authorization complexity getting out of control is when they want to move from coarse-grained authorization (e.g., “Users can manage resources that they own. Admins can manage everything.”) to fine-grained or resource-specific authorization (e.g., “Users can commit code to repositories that they own, or to repositories that belong to organizations where they have been assigned the the Collaborator role, unless those repositories are archived”). Similarly, moving from predefined roles to custom roles often breaks the original authorization implementation, because what was once static is now dynamic (i.e., role definitions). This was one of the reasons PagerDuty adopted Oso: it needed to give its customers the tools to lock down permission and cordon off sensitive data for customers, like security incidents.
For more on authorization complexity, you can read about the Authorization Maturity Model in Authorization Rules are always harder than you think.

b. Iteration - A simple authorization implementation can become complex quickly when you start to iterate. That thing you hard coded just to get it done. The public variable. The raw SQL to speed up your query. These are all valid decisions that let you iterate and move quickly at first. But now even a small authorization change or feature request can take much longer than expected as you hunt down all the places where you need to make a change to accommodate it. Explaining this to product and business teams is a knock-on challenge of this dynamic too.

c. Performance - Performance can degrade when the requirements get more complex, your data model becomes more complex, and/or when your data volume grows. Filtering down lists of authorized resources (e.g., return all the files this user owns) is consistently one of the biggest challenges we see. Another one is stitching together JOINs across multiple relations to answer a single authorization question; this is especially common in businesses that want to represent hierarchical models for users (e.g., HR org chart, Salesforce hierarchy) or resources (e.g., files and folders, accounts and opportunities). Getting good performance on recursive queries is a related challenge.

d. Multiple services or microservices - Moving to microservices breaks a lot of assumptions that are valid when building a monolith. One core assumption is that at any point, you can be sure that you have all the authorization data you need to resolve a given request. For instance, if you need to see if a user has write access to a specific file, you can easily call into your database to find out, because you have just one service and just one database.
But when you move to microservices, this is no longer true. The data you need to determine whether that user has write access on that document might live in another service. We’ve written about this problem extensively (Best Practices for Authorization in Microservices, Managing Authorization Data in Microservices). In this world, all roads lead to a central authorization service.

e. Multiple teams - The booleans and SQL lookups are okay when you’re the only one who needs to know how they work. When multiple engineers, and especially when multiple teams, start collaborating on the same parts of the codebase, it creates new challenges: how does this code work? Is there a convention? What code paths does it touch? How do we test that these changes don’t break previously secure and correct logic?

Who Has Built Their Own Authorization Service?

Here are some examples of engineering teams that have gone the DIY route:

Google needed a unified authorization system across Calendar, Cloud, Drive, and Youtube (and everything else). Over several years, a dedicated team designed and built Zanzibar, a highly-available and scalable authorization service.
Slack needed a shared, modularized authorization service for their enterprise customers. Their team built a microservice that reads permissions from their monolith's data store. It took a team of about ten engineers a year to design and build this service.
Airbnb found that they were duplicating authorization checks in each of their microservices. Building and scaling their authorization service, Himeji, took them more than two years, and it continues to be supported by a full team of engineers.
Carta had the same problem—getting five different services to agree on authorization checks. They took a similar approach. It took their team about 9 months to build and deploy their service.

These levels of effort may be surprising. Authorization often sounds pretty simple - especially in the early stages. How can it turn into a months-long effort for multiple engineers? It turns out that a good authorization service is anything but simple.

Roadmap for building

A common refrain we hear from engineering teams is that they underestimated (or their management team underestimated) the work required to build and maintain a custom authorization system. In this section, we lay out some of the core capabilities required for an authorization system.

Caveat: this list is of course not comprehensive and not one-size-fits-all.

Development

There are 3 ingredients to authorization: rules, data, and enforcement.

a. Rules are the generic logic in your app that describes who should be allowed to do what – e.g., the owner of a file can delete that file.

b. Data is the input to the logic – e.g., Holden owns file 123.

c. Enforcement is when you combine rules and data at runtime to render a decision back to your application – e.g., yes, Holden can delete file 123 (because he’s an owner).

Every authorization system needs a solution to each of these three problems. To make it usable for a team of engineers, it also needs an approach to testing and debugging.

Rules

It’s notoriously difficult to build an interface for defining rules. Rules require abstraction, and abstraction is hard. Suppose you want users to be able to edit files they own. Is file ownership a role the user has on a file resource? Or a role on the team they belong to? Is it an attribute on the user? The file? Something else? Do I make a new role for this or update an existing one?

Even if you get the abstractions right, they can break down when you get to the edge cases. What if users can edit files they own unless the file has been locked by an administrator. If you didn’t account for the ability to add conditionals to your original role or attribute assignment, you now need to revisit the design to add the concept of a locked file.

The system also needs to be generic enough not just for the first use case, but for all use cases across all teams in your organization. Who’s allowed to make someone an administrator so they can lock those files? The file ownership rules won’t help you at all with that.

For more on this topic, you can read Authorization Rules are always harder than you think.

Data

Data is arguably the thorniest element of all authorization. Assuming you’re building a central authorization service, you need to solve the following problems:

a. Data model - For any shared authorization data – i.e., data that multiple services will need in order to resolve authorization decisions – you need a shared data model. Roles are a common example of shared authorization data: you might need to know users’ roles in a Document service (to determine whether the current user can edit a given document) and in an Admin service (to determine whether the current user can invite a new user to the organization). The model you choose needs to be flexible enough to accommodate the various types of shared data, while also being consistent and optimized for the authorization queries you expect to run.

b. Data syncing - Equally, you need a method for syncing shared data into your authorization service. After speaking with even the most elite teams about this, we wonder if syncing data reliably is truly the 4th hardest problem in computer science.

c. Non-shared data - There may some data that you don’t want to sync to your authorization service, because it’s inconvenient or there’s too much of it. For this data, you need to find a way to bring it to bear when needed for an authorization decision. You might send it at request time. You might try to build a mechanism for the decision to be partially evaluated in the authorization service and partially evaluated in the source service. Each approach has tradeoffs.

For more on this topic, you can read Managing Authorization Data in Microservices.

Enforcement

Enforcement is where your application or service actually makes authorization checks. Here are some of the relevant considerations:

a. SDKs - You need support for whatever programming language/s your team uses. It’s key to spend enough time on API design that you can maintain a consistent experience for developers across different stacks.

b. Query Types - We discussed previously the simple case: asking whether a user can take an action on a specific resource, yes or no. You can also imagine rotating this question in 3D space to create other questions that applications need to answer – e.g., return all the files of which Willow is an owner, or return all the permissions Willow has on file 777 (to render a UI). Your enforcement API needs to support all the potential authorization questions your service will ask, and it needs to be performant. This can be especially challenging for filtering large lists of resources.

Testing and debugging

To make the system usable for developers, you need solutions for testing and debugging.

a. Testing - Ideally you have a way for developers to run unit tests. At the very least you need documentation on how to do integration testing in each supported language/framework.

b. Debugging - A common challenge when building or iterating on an authorization feature is finding that someone has access that they shouldn’t have, or vice versa. You need some way for developers to dive into your authorization system so they can debug these issues.

Documentation and support

Finally, to enable the engineers in your organization, you need to write documentation on each of the sections described above. And you need to set up a rotation for the engineers building the system to support the other engineers in the organization. This should include both proactive guidance (e.g., how to model a given application) and break/fix support when things go wrong.

Ops

Once you have a runnable service, you now need to operate it. Authorization is on the critical path, which increases the stakes (cf GitHub). Here are some of the relevant considerations:

a. Uptime - Authorization is as sensitive to uptime as your database. If the authorization system goes down, your application goes down too. Authorization is of course a stateful service, which further complicates this challenge. Solving for these challenges is akin to solving for database reliability.

b. Performance - Authorization is equally sensitive to latency. If the authorize check is slow, your app is slow. As previously mentioned, this can get especially tricky when the requirements get more complex, your data model becomes more complex, or when your data volume grows. Filtering lists is a common challenge. If solving for uptime is like solving for database reliability, then solving for performance is like database performance tuning: you need a sound data model, indexes in all the right places, the correct WAL setting, and other things DBAs dream about.

c. Backup and disaster recovery - Continuing with our database comparison: whatever requirements for backup and disaster recovery you have for your core application database, you will need the same ones for your authorization service. This can include backup, point-in-time restore, and other approaches to disaster recovery — not just for your application data, but also for your rules.

d. Upgrades - You need a bulletproof approach to zero-downtime upgrades of the authorization system itself as well as any underlying dependencies.

e. Observability - Your authorization service is just like any other service in your application stack. You need to understand what it’s doing and why, so that you can anticipate or diagnose issues. You’ll need to invest in tracing, logging, monitoring, and alerting so that you can ensure that the service keeps up with the demands of your application, both today and as it grows.

f. Auditability - Given the security-sensitive nature of authorization, you should expect to build a mechanism that audits both who is accessing what resources, as well as who is making changers to authorization rules (e.g., what permissions a role has) and authorization data (e.g., who is assigned what role). It’s also common to need a way to show why the user was allowed to get to the resource they accessed, which can be especially tricky (it’s effectively debugging a point in time restore).

g. On-call - You need to enlist the engineers building your authorization system into an on-call rotation to troubleshoot operational issues and keep the service up. Depending on the architecture you choose, these issues can vary significantly. But again, the issues are typically comparable to those you’d face when managing a database — e.g., query timeouts, replication failures, application changes out of sync with authorization changes.

To build or not to build

The decision to build or buy is for each organization to make on their own. Other teams have done it, so it is indeed possible. But it’s also a non-trivial engineering commitment, both upfront and over the long-term.

For more information on companies adopting Oso and similar services, read Who is using authorization as a service, and why.

Authorization: Build vs Buy

When to build

When to buy

What is more than “simple”?

Who Has Built Their Own Authorization Service?