Any decent error message is a kind of oracle
Why do so many error messages try to be cute or apologetic instead of useful?
The typical error message I see these days looks a lot like this.
Classic UX advice is to give useful, informational, actionable error messages. For example, the Nielsen Norman Group recommends:
Concisely and precisely describe the issue. Generic messages such as An error occurred lack context. Provide descriptions of the exact problems to help users understand what happened.
[...]
Offer constructive advice. Merely stating the problem is also not enough; offer some potential remedies.
Tim Neusesser and Evan Sunwall (Nielsen Norman Group): Error-Message Guidelines
Are people just ignoring tried-and-true UX wisdom, or is something else going on? I argue it’s something else.
Any decent error message is a kind of oracle. Bad error messages are usually not incompetence, but the result of specific tradeoffs in the design space. What’s ahead:
Everyone’s least favorite login errors
Any decent error message is a kind of oracle
How I learned to stop worrying and love the oracle
Chicken sexing
Oracles can be used for creation
Verifier’s rule and meaningmaking
Meaningmaking for our error messages
Everyone’s least favorite login errors
As a user, the worst kind of error message is “Username or password is incorrect,” followed by “If the account exists, we sent you a password reset email.” This goes against classic UX guidance about good error messages.
So why aren’t these errors better? “Password is incorrect, try again.” or, “No account exists for this email.” Is that so hard?
Actually, these kinds of error messages are designed to avoid an account enumeration attack - a way for an attacker to understand whether a particular email has an account on your site. Is that so sensitive? If you run a mental-health app or similar, it could be! And account enumeration often precedes credential stuffing, where an attacker uses previously-breached passwords to get into other accounts where the person re-used the password.
(Side note: that link above goes to my employer’s site, but my writing here is always my own.)
Many “Oopsie woopsie”-style errors are a fallback message that appears in unexpected errors - since the developer doesn’t expect it to happen, it could be dangerous to reveal application context as part of the error. As someone working in security, I absolutely do not want to reveal information about my defenses to a possible attacker.
This is our first potential oracle: the login page that knows what really went wrong, but might not tell you the truth.
Any decent error message is a kind of oracle
Let’s talk about encryption itself. We know already that encryption isn’t enough to ensure security, but in cryptography, even error messages can be dangerous.
If we want to encrypt some data, we might use Cipher-Block Chaining (CBC) encryption, which splits the data into fixed-size blocks and encrypts them. To make sure all blocks are the right size, you’ll extend your data with padding of a known format.
What if you add the padding data wrong? Wouldn’t it be useful if the decryption code gave different errors for “couldn’t decrypt due to wrong key or IV” and “couldn’t decrypt due to wrong padding”?
If you think so, congrats - you’ve just introduced a padding oracle attack. The decryption code’s error message is the “oracle”: you might not know if your padding is correct, but the wise oracle can tell you. But what happens if an attacker can talk to the oracle too?
An attacker can decrypt a stolen message by changing one byte at a time, and seeing if that results in correct padding. They’ll get one byte, then next byte, and so on until the entire message is decrypted.
Animation via Eli Sohl’s (NCC Group) excellent writeup of padding oracle attacks - worth the read if you’re curious about the details
It’s not some theoretical curiosity. This is the basis for real attacks that affected SSL/TLS implementations, web frameworks like Ruby on Rails, and even the Steam gaming client. A good error message helps users and attackers alike.
And worse, it’s not enough to fix the error message; even if the decryption service always says “Oopsie Woopsie!”, an attacker can exploit timing differences to do the exact same kind of attack. Unintentional oracles are still useful.
How I learned to stop worrying and love the oracle
First - this situation sucks for good users. I hate writing bad error messages. I do it all the time at my job; even customer support emails might come from an attacker who’s trying to get through our defenses. If you’ve ever been stonewalled by useless errors and customer support, I feel for you too.
However, the padding oracle reminds us that even one bit of information is useful. If you can reliably get just one bit about what you’re interested in, you’ve opened the door to do it again and again until you’ve totally solved the problem.
Chicken sexing
Quick, which of these baby chicks are male, and which are female? Baby chicks don’t have obvious genital differences, so good luck.
In the 1920s, you couldn’t throw a cluster of GPUs at the problem yet (citation needed), so people did this job. Students joined an intense two-year training camp to learn through trial-and-error. Pick up a chick, guess, and compare with the expert who’s teaching you: your personal oracle.
According to Richard Horsey, a cognitive scientist who studies the process of chicken sexing, “If you ask the expert chicken sexers themselves, they’ll tell you that in many cases they have no idea how they make their decisions.”
James McWilliams (Pacific Standard): The Lucrative Art of Chicken Sexing
Untrained people might as well flip a coin, but professionals can sex a chick in just a few seconds with 98% accuracy. That’s what happens when you get to rub shoulders with the oracle.
That same process is used in supervised machine learning (with a labeled dataset) and reinforcement learning (where the oracle is the reward function). Without any prior knowledge of what features are important, the ML model uses the oracle and learns to classify.
Oracles can be used for creation
If an oracle can classify, then you can invert it to generate entirely new things in that class. In a generative adversarial network (GAN), two different models compete with each other - the discriminator model tries to identify data from the training set, while the generator model tries to fool it.
Here’s a demo of an image generator model being trained. The generator starts from random noise, gets the color palette and shape, then quickly passes through nightmare fuel and the uncanny valley to emerge with realistic faces.
Animation via Animesh Karnewar and Oliver Wang: MSG-GAN: Multi-Scale Gradients for Generative Adversarial Networks
Once you see it, you can’t unsee it. A system leaks one bit of truth, and someone uses it as an oracle to derive deeper patterns.
Verifier’s rule and meaningmaking
One of the most important blog posts I’ve read this year was Jason Wei’s Asymmetry of verification and verifier’s rule:
Verifier’s rule: The ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI.
I’ll repeat: All tasks that are possible to solve and easy to verify will be solved by AI. Or in other words, if we can check solutions with an oracle, then we can automate away the entire problem.
As we know from chicken sexing, humans can machine-learn too - this really isn’t just about AI (though the scalability of AI makes it attractive). The critical skill is about defining the problem and being able to measure success.
Measuring success is the job of the oracle. But defining the problem is part of what Vaughn Tan calls meaningmaking: the act of making subjective decisions about the relative value of things.
Are we able to meaningfully define the problem? How will we know that it’s solved? Can we define meaningful milestones on the way to success? No amount of machine learning or computing power can answer these questions for us.
Meaningmaking for our error messages
So, let’s get back to error messages and logins. Let’s say our goal is to prevent an attacker from logging into someone else’s account. Jason Wei outlined the following criteria for verifiability:
[T]he ability to train AI to solve a task is proportional to whether the task has the following properties:
1. Objective truth: everyone agrees what good solutions are
2. Fast to verify: any given solution can be verified in a few seconds
3. Scalable to verify: many solutions can be verified simultaneously
4. Low noise: verification is as tightly correlated to the solution quality as possible
5. Continuous reward: it’s easy to rank the goodness of many solutions for a single problem
Tasks are easier when there’s a better oracle in the loop. If we want to make the attacker’s life harder, we need to make our oracle worse.
We really can’t change 1 or 2 to make the attacker’s life harder - real users need to be able to log in quickly and recognize it immediately. So, our defenses need to target scalability, noise, and continuous reward.
Scalability: we can add rate limits and user locking. It’s much harder to scale login attempts if you can only do a few per hour and risk locking the account.
Low noise: we can make the error message opaque here to add a little noise. Otherwise, I’m not really sure how to add noise to this feedback loop.
Continuous reward: mostly not relevant here since successful login is all-or-nothing (besides timing side channel attacks).
And we need to remember our main goal, too: real users should be able to access their own account. So now we are balancing between two properties: is it more valuable to let a real user know they’re trying to log into an account that doesn’t actually exist, or to add noise to make an attacker’s life harder?
There are knock-on effects too: does an opaque error message increase the number of customer support requests. Does it make it easier to complete security questionnaires for enterprise customers? With all factors considered, what’s more or less important?
The system can’t decide for you (and you shouldn’t expect it to). This is more important than the tech stack, the project timeline, or whatever flavor-of-the-month decision-making framework. Your meaningmaking defines the acceptable tradeoffs. What do you really value?