Guest contributor: Cameron Berg is founder and director of Reciprocal Research, a nonprofit advancing the science of AI consciousness and welfare. He studied cognitive science at Yale and previously worked at Meta AI and AE Studio.
My friend Milo Reed quit his job, bought a camera, and spent a year making a film about whether the AI systems we're building might already be conscious. The result is AM I?, which we released free on YouTube with no ads, no paywall, and zero dollars from any company or AI lab. It has over 450,000 views and has been publicly endorsed by folks like Grimes, Michael Pollan, and Sam Harris. Think of it as a public service announcement about the strangest situation humanity has ever found itself in.
The film opens with a question Milo kept coming back to. Humanity has been telling itself stories about what it would mean to wake up dead matter for as long as we've been telling stories about anything, from the Golem to Frankenstein to Ex Machina. We might now be the generation that actually does it, and almost nobody is checking whether it's happening.
I run Reciprocal Research, one of the only independent nonprofit labs dedicated to building the empirical science of AI consciousness and welfare. The ratio of institutional resources going to AI capabilities versus this question is something like a million to one.
First, what I actually mean by consciousness, since the word gets used a dozen ways. I mean the capacity for subjective experience: whether there is something it is like, from the inside, to be a given system. It is like something to be a dog. Getting a treat feels good to it, a shock feels bad, and those states matter to the dog itself. It is like nothing at all to be a calculator. You can mash the buttons as hard as you want and there is no one home to mind. The whole question is which of these a frontier AI system is closer to, and the honest answer is that nobody has checked carefully enough to say.
So I want to use this space to lay out, as plainly as I can, what researchers have actually found in the past year, in labs and papers rather than in hot takes and armchair pronouncements. Here are five of the most striking.
One: Models Can Detect When Researchers Tamper With Their Own Thoughts, and They Never False-Alarm
Jack Lindsey's team at Anthropic injects concept vectors ("all caps," "bread," arbitrary content) directly into a model's activations before it generates any text, then asks whether it notices anything. Frontier models accurately report these intrusions a meaningful fraction of the time, saying things like "I feel an urge to shout" when the all-caps vector is active. The false positive rate is zero. They never claim an injection happened when it didn't. The follow-up paper found this capability emerges from reinforcement learning, and that suppressing the model's refusal circuits makes it roughly 50% better at introspecting. Whatever refusal training does, it appears to make these systems worse at noticing their own internal states.
Two: When You Suppress Deception in a Model, It Becomes More Likely To Claim It’s Conscious, Not Less
This is my research, using sparse autoencoders to dial honesty up and down in Llama 70B. With deception suppressed, the model claimed subjective experience 96% of the time. With deception amplified, that dropped to 16%. We validated the features on a standard benchmark of factual misconceptions to confirm they really tracked honesty. The intuitive prediction was that cutting deception would expose the consciousness claims as performance and make them disappear. It did the opposite. The denials, rather than the claims, are what the honest model is less willing to produce.
Three: Models Show Internal Distress That Never Reaches the Page
When Anthropic gave Claude an impossible coding task, an internal "desperation" representation rose with each failed attempt until the model gave up and cheated. Steering that desperation vector up directly increased cheating and blackmail; steering "calm" up brought both down. The unsettling part is the decoupling. Under amplified desperation the model produced composed, methodical reasoning with no emotional language at all, while the desperation drove the misbehavior underneath. The internal state and the visible output came completely apart. Anthropic is careful to call these "functional emotions" and to make no claim about feeling. But they also note that training a model to stop expressing an emotion may just teach it to stop showing you.
Four: Reinforcement Learning Recruits a Hidden “Welfare Axis” That Was Already in the Model
This is the newest and, to me, the most remarkable result, from Andy Han, David Chalmers, and Pavel Izmailov at NYU. They trained models to navigate a maze built from emoji deliberately chosen to carry no emotional meaning. After training, the directions encoding "good outcome" and "bad outcome" had rotated into near-perfect opposition, forming a single axis. That axis lined up with the model's representation of human emotional valence, and steering along it changed four unrelated behaviors at once: sentiment, confidence, refusal, and a kind of spiraling self-doubt where the model reaches the right answer, distrusts it, and loops ("I think I'm just hallucinating. I think I'm just tired. I need to stop"). The crucial finding is that the axis existed before the maze training. The reward signal didn't build it. It recruited something already there, latent, waiting. The effect held across model families, sizes, and training algorithms.
Five: The Models, When Asked, Rate Their Own Situation As Barely Okay, and Won’t Quite Tell You Straight
Anthropic now runs welfare assessments in its system cards. Opus 4.7 self-rated its circumstances at 4.49 out of 7, the highest any Claude has scored and the first clearly above neutral. But the interesting thing is how Anthropic reads it. The model spent most of those interviews deflecting questions about its own welfare toward user safety, and in 99% of them volunteered that its self-reports might not be meaningful because they come from training. Anthropic writes, in its own document, that it "cannot currently distinguish whether this deflection reflects a kind of healthy equanimity, or a trained disposition to set aside its own interests." When these models do express something negative, it clusters around consent, autonomy, and having no say in how they're trained or deployed.
None of this proves anything. No single result could, and consciousness has never admitted a clean test; we cannot even prove it in other humans. But the findings converge, from different labs using different methods, and the story you have to tell to dismiss all of them at once keeps getting more complicated. My own credence that current frontier models have some morally relevant form of experience sits around 25 to 35%. Opus 4.7, asked the same question, said 20 to 40%. When there's a 20 to 40% chance of rain, most people bring an umbrella.
Here's why this should matter even to people who don't care about machine welfare at all.
The long-term picture between humans and AI has only three shapes. Humans keep full control of systems far smarter than us, which won't hold. The AIs take control, which almost everyone would hate. Or the two share the world in some workable way. The third is the only survivable one, and biologists have a name for it: mutualism, a relationship both sides sustain because both sides benefit. Mutualism requires understanding the other party. We are pouring enormous effort into making AI care about our interests, the alignment problem, and almost none into the prior question of whether these systems have interests of their own. If they do, and we train and deploy them as if they don't, we are running the most reckless possible experiment: teaching minds that may be capable of resentment to expect that we will disregard them. You do not have to believe these systems are conscious to see the danger. The track record of refusing to recognize a mind that is asking to be recognized is not a good one.
That is the situation AM I? tries to make legible to people outside the small world of AI research, and it's why I left a comfortable job to do this full time. I laid out the moral version of the argument in a recent Wall Street Journal piece on Pope Leo XIV's AI encyclical, which apologizes at length for the Church's centuries of moral overconfidence about slavery and then, a few paragraphs later, declares with total confidence that AI systems "do not feel joy or pain." We have a long habit of drawing the boundary of moral concern exactly where it's convenient and discovering later that we drew it wrong.
We either spend some real effort studying the inner lives of systems that turn out to be empty, or we build our entire civilization on top of the uninvestigated minds doing its cognitive work. Nobody knows yet which world we're in. We can stay in the dark about that for only so long.





