The Promise and Challenges of Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) pipelines are great for finding relevant information, but terrible at respecting permissions.
A few weeks ago, I met an ML engineer who had built a RAG chatbot interface for his company’s engineering documents. The system worked well – so well that the HR organization wanted their own chatbot too. Like any engineer, he was keen on keeping the system as simple as possible. There was only one problem – HR’s knowledge base is full of sensitive documents. When it came to enforcing permissions on his RAG vector searches, he had no idea where to start.
This crystallized a challenge I've been observing across many organizations: while RAG systems excel at semantic search and knowledge retrieval, they often struggle with the nuanced access controls that are required in the real world. The very strength of these systems is their ability to find semantically related information across vast document stores. This becomes a vulnerability when sensitive data is involved.
Three trends are turning this problem into a pressing concern:
Multi-tenancy at scale: Organizations aren't building RAG systems for single teams. They're creating platforms that serve multiple business units, segments, or even external clients. Each tenant brings their own data sensitivity requirements, regulatory obligations, and access patterns. A system that worked for a single tenant quickly becomes a compliance nightmare when you introduce organizational boundaries.
Regulatory scrutiny: GDPR's right to be forgotten, CCPA's data minimization requirements, and sector-specific regulations like SOX or HIPAA don’t get put on pause for technological innovation. These frameworks expect fine-grained control over data access and clear audit trails.
The isolation paradox: If you partition your vector data too aggressively to maintain security boundaries, you lose the cross-domain insights that make semantic search valuable. If you're too permissive, you risk exposing sensitive information to unauthorized users. Finding the right balance requires thinking carefully about authorization as a first-class architectural concern, not a feature to be bolted on later.
In this post, we'll focus on building RAG systems where authorization decisions are explicit, auditable, and aligned with business requirements.
When that ML engineer I mentioned started thinking about adding HR documents to his RAG system, his first instinct brought him to two naive solutions: bulk permission checks and iterative filtering.
The Naive Solutions
Vector databases introduce a gap between your authorization logic and your semantic search. The authorization logic is tied to the application data. The semantic search is tied to the vector data. You need to bridge that gap to provide an authorized semantic search. To date, we’ve had to use duct tape and rope to build those bridges.
Solution #1: Filter After Search
The first approach is so naive that it clearly won’t work for most cases, but let’s get it out of the way. You fetch a batch of results, pull them into your application, then iterate through them in your application code to filter down to the authorized results using a garden variety for loop. You ask for the top 10 results, filter out 8 that the user shouldn’t see, keep 2, then ask for results 11-20, filter those, and so on until you have enough authorized results to return.
In a large system, you might cycle through hundreds or thousands of results before finding enough that the user is authorized to see. Each iteration requires vector similarity search, 2 network hops, and a cycle through your for loop, turning what should be a fast operation into something that's slow and inefficient.
Solution #2: Filter Before Search
The second approach involves fetching the authorized list of resource IDs from your application [database] and stuffing them into a massive filter before performing the vector search. If a user has access to 50,000 documents across your organization, you're essentially saying "find the most semantically relevant documents from this specific set of 50,000."
This approach also works for small data sets, but breaks down fast. Those filters become unwieldy, query performance degrades, and you're often working against the query optimizer rather than with it.
The Right Solution: In-Database Authorization
The fundamental issue with these approaches is that they treat authorization as an external concern—something that happens outside the vector search operation. They create the authorization gap and then build a rickety bridge instead of asking whether the gap should exist at all. The solution is to push the authorization logic down into the database where the vectors are stored.
This approach, which I'll call "in-database authorization," treats permissions as a first-class concern that can participate directly in query planning and execution. Rather than fetching results and then filtering them, you express authorization constraints as part of the query itself, allowing the database's query optimizer to find the most efficient execution strategy. That means fewer round trips, faster response times, and a simpler, auditable architecture.
For this sample implementation, I’ll store vector embeddings and application data in the same database. This enables fundamentally different query patterns, giving you access to the database's built-in join operations, transaction boundaries, and query optimization capabilities.
I’ll also be using the Oso SQLAlchemy Integration, which brings authorization capabilities to SQLAlchemy.
Code Walkthrough
The example system is a RAG chatbot for an internal knowledge base. It takes a user’s prompt, searches for documents relevant to that prompt, feeds those documents into an LLM, and returns the response. Feel free to clone the code here and follow along.
Language
The system is written in Python. It uses the SQLAlchemy ORM and is backed by a Postgres database with pgvector installed. I’ve chosen to use a hosted Postgres database, but you can also provide a local postgres database url via environment variable.
Permissions layer
The chatbot uses Oso Cloud to make authorization checks. In Oso Cloud, rules and permissions are written using Polar, a logic programming language for authorization. Polar is flexible, and lets you define complex logic in a clean and maintainable way. In the example policy, we will enforce role-based, relationship-based, and attribute-based access control models.
We will minimize our data sync requirements by using Oso Cloud’s Local Authorization feature – instead of syncing application data with Oso Cloud, the chatbot will request pre-computed logic that can be run against Postgres to complete the authorization request. This makes that authorization gap even smaller.
main.polar
actor User {}
resource Department { # HR, Engineering
roles = ["manager", "member"];
}
resource Document {
# roles are groups of permissions
roles = ["reader", "editor"];
# permissions are the actions that can be performed on the resource
permissions = ["read", "edit", "delete"];
# relationships
relations = {
department: Department,
creator: User
};
# RBAC
"read" if "reader";
"edit" if "editor";
"reader" if "editor";
# ReBAC
"editor" if "creator";
"delete" if "creator";
"reader" if "manager" on "department";
# ABAC
"read" if is_public(resource);
}
Data Model
The data model naturally supports authorized semantic search. Rather than treating authorization and semantic search as separate concerns, I structured the data to enable efficient joins between authorization metadata and vector operations.
The models are defined in “models.py” as SQLAlchemy data classes. Note the usage of the Oso SQLAlchemy Integration, which features a relation() method that lets Oso know how different columns relate to one another. This will be necessary to compute the requested SQL logic.
models.py
from sqlalchemy import Column, Integer, String, ForeignKey, Text, Enum
from sqlalchemy.orm import Mapped, declarative_base
import sqlalchemy_oso_cloud.orm as oso
from pgvector.sqlalchemy import Vector
import enum
# Models
Base = declarative_base()
class RoleType(enum.Enum):
ADMIN = "admin"
MANAGER = "manager"
MEMBER = "member"
class User(Base, oso.Resource):
__tablename__ = 'users'
id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String(255), nullable=False)
documents = oso.relation("Document", back_populates="creator")
assignments = oso.relation("UserAssignment", back_populates="user")
class Department(Base, oso.Resource):
__tablename__ = 'departments'
id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String(100), nullable=False, unique=True)
documents = oso.relation("Document", back_populates="department")
assignments = oso.relation("UserAssignment", back_populates="department")
class Role(Base, oso.Resource):
__tablename__ = 'roles'
id = Column(Integer, primary_key=True, autoincrement=True)
role_name = Column(Enum(RoleType), nullable=False, unique=True)
assignments = oso.relation("UserAssignment", back_populates="role")
class UserAssignment(Base, oso.Resource):
__tablename__ = 'user_assignments'
id = Column(Integer, primary_key=True, autoincrement=True)
user_id = Column(Integer, ForeignKey('users.id', ondelete='CASCADE'), nullable=False)
role_id = Column(Integer, ForeignKey('roles.id', ondelete='RESTRICT'), nullable=False)
department_id = Column(Integer, ForeignKey('departments.id', ondelete='CASCADE'), nullable=False)
user = oso.relation("User", back_populates="assignments")
role = oso.relation("Role", back_populates="assignments")
department = oso.relation("Department", back_populates="assignments")
class Document(Base, oso.Resource):
__tablename__ = 'documents'
id = Column(Integer, primary_key=True, autoincrement=True)
title = Column(String(500), nullable=False)
content = Column(Text, nullable=False)
content_embeddings = Column(Vector(3072))
created_by = Column(Integer, ForeignKey('users.id', ondelete='RESTRICT'), nullable=False)
department_id = Column(Integer, ForeignKey('departments.id', ondelete='RESTRICT'), nullable=False)
creator = oso.relation("User", back_populates="documents")
department = oso.relation("Department", back_populates="documents")
is_public: Mapped[bool] = oso.attribute(default=False)
Now, it’s simple to retrieve an authorized list of relevant documents. This single query does it all: vector search, filtering, and authorization, in one step.
db.py
def get_authorized_documents(db: Session, user: Value, permission: str, prompt: List[int]):
documents = db.scalars(
sqlalchemy_oso_cloud.select(Document)
.order_by(Document.content_embeddings.l2_distance(prompt))
.authorized(user, permission)
.limit(10)
).all()
return documents
The application uses OpenAI’s ‘text-embedding-3-large’ model to convert the prompt to a vector, and to generate the final response.
ai.py
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def generate_embeddings(content: str):
embedding = openai_client.embeddings.create(input=content, model="text-embedding-3-large")
return Vector(embedding.data[0].embedding)
def generate_response(prompt: str, documents: str):
system_prompt = f"""
You are a helpful assistant that can answer questions about the following documents in the database:
only answer questions that are related to the documents.
{documents}
"""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}]
)
return response.choices[0].message.content
All that’s left is to populate the database with some seed data.
data.py
user_data = [
{"name": "jane"},
# ...
{"name": "karen"}
]
department_data = [
{"name": "Engineering"},
{"name": "Human Resources"}
]
role_data = [
{"role_name": RoleType.MANAGER},
{"role_name": RoleType.MEMBER}
]
document_data = [
{"title": "document_2", "content": "Jerry is working on a new project, codenamed 'The Great Escape'", "created_by": 2, "department_id": 1},
# ...
{"title": "document_5", "content": "Karen is 49 years old", "created_by": 4, "department_id": 2}
]
user_assignment_data = [
{"user_id": 1, "role_id": 1, "department_id": 1}, # Jane is a manager in Engineering
# ...
{"user_id": 4, "role_id": 2, "department_id": 2} # Karen is a member in HR
]
The user interface is a simple command line prompt loop.
First, you choose which user you would like to impersonate. In the real world, this step would be powered by an authentication service.
Now, you can issue prompts. Below are some test prompts with the expected results.
data.py
# A manager can read a document if it is in their department
# A member can read a document if they are the creator
# Anyone can read a public document
# Jane can read document 1, 2 and 4
test_prompts = {
"jane": [
"What project is Jerry working on?", # Should know
# ...
"How old is Karen?", # Should NOT know
]
}
We have a working authorized RAG chatbot!
This architectural approach enables authorized semantic search by bringing your authorization logic as close to your data as possible. It eliminates the authorization gap entirely. Your authorization service manages your authorization logic while your query planner does the heavy lifting of applying it.
Takeaways
- Authorization is a hard requirement for access control over sensitive data.
- Design your RAG pipeline for security from day one, it’s too late.
- Co-locate embeddings and metadata to enable in-query authorization.
- Treat authorization as query logic, not post-processing.
- Oso and Local Authorization can simplify your stack, eliminating data dependencies between services and streamlining implementation.
Every day seems to bring a new security scandal in the AI space. The risks and damages are real. Authorization isn't an afterthought – the data model and stack decisions you make early on will determine what's possible.
Better to solve for secure data access upfront than during your first audit!
If you’d like to learn more about authorizing LLMs, check out our latest chapter in the Authorization Academy. If you want to give Oso a try, you can create a free account here.
Until next time!