Blog

Dan Klein on Why LLMs Are Plausibility Engines β€” and What It Actually Takes to Ship Reliable AI

Scaled Cognition co-founder and CTO Dan Klein joined the Chain of Thought podcast to explain why most AI prototypes die before they ship, why prompting is not a real control surface, and how APT-1 was architected from the ground up to give actions and information first-class status instead of treating everything as tokens
13 Apr 26
Dan Klein on Why LLMs Are Plausibility Engines β€” and What It Actually Takes to Ship Reliable AI

‍

Every few weeks at Microsoft, someone would build an AI prototype that blew everyone's minds. Three months later it was dead. Dan Klein watched this happen for five years before deciding to do something about it. On this episode of the Chain of Thought podcast, Dan β€” co-founder and CTO of Scaled Cognition β€” sat down with host Conor Bronsdon to explain why the gap between demo and production is where most AI projects go to die, and what it actually takes to build enterprise AI you can trust.

Key Takeaways
  • LLMs are plausibility engines, not truth engines β€” they are optimized to produce convincing output, not correct output, and that distinction is what kills enterprise AI deployments
  • Prompting is not a real control surface β€” it has no precise semantics and no guarantees, and adding more exclamation points to your prompt is a sign you are already on the wrong path
  • APT-1 gives actions and information first-class status in the model architecture, enabling structural guarantees about what the system will and will not do
  • Hallucinations are an iceberg β€” enterprises typically see only the egregious, detectable ones, while the actual hallucination rate is often five times higher than they think
  • Stacking multiple models to check each other produces correlated errors, not reliability β€” hard instances tend to fool all models simultaneously, and the combinatorics work against you across a multi-turn conversation
  • Scaled Cognition built its prototype APT-1 for under $11 million by moving off the current token-based scaling curve and onto a different operating curve architected around reliability
  • Every AI scaling curve is an S-curve, not an exponential β€” the current plateau is real, and the next gains will come from orthogonal ideas, not just more compute
  • Reasoning models are powerful where they apply, but burning tokens to re-solve known problems from scratch is not intelligence
  • The societal risk of AI systems that produce output indistinguishable from truth is not theoretical β€” it is already bifurcating society into people who believe everything and people who believe nothing
  • Super-Reliable Intelligence is not a feature you bolt on β€” it has to be the architectural foundation from day one

‍

Full Transcript

‍

0:00 β€” Cold open: RL is about doubling down on what works

RL is all about doubling down on the things that work and doing less of the things that aren't working. It's really that simple. Try various things, go in the direction of the things that are working. So if you're playing go or chess, even if you're playing yourself, even if you're moving randomly, one side wins, one side loses. And that means you have a gradient. The one side was better than the other. The rules of the game were crisp and the outcome of the game is clearly defined.

0:28 β€” Introducing Dan Klein and Scaled Cognition

Conor Bronsdon: We are back on Chain of Thought, everyone. I am your host Conor Bronsdon, head of technical ecosystem at Modular. My guest today is someone that many of you may know. Dan Klein is co-founder and CTO of Scaled Cognition and a professor of computer science at UC Berkeley, where he leads the Berkeley NLP group. Dan previously won the ACM Grace Murray Hopper Award for his work on grammar induction, and his former PhD students now run AI teams at Google, Stanford, MIT, OpenAI, and several other places. Dan is also a serial entrepreneur. His first startup, adap.tv, was acquired by AOL. His second, Semantic Machines, was acquired by Microsoft in 2018. Dan then spent five years integrating conversational AI technology at Microsoft, making him deeply familiar with the challenges of building AI systems and actually shipping them at enterprise scale. That gap between demo and production is what led Dan to start Scaled Cognition, where his team built APT-1, the Agentic Pre-trained Transformer β€” a model designed from the ground up for actions. Dan, so good to see you. Welcome to Chain of Thought.

Dan Klein: Thanks. It's great to be here.

Conor: I'm excited for this episode because you have such depth of experience on both the industry and research sides. How have you balanced those over the years?

Dan: It's always a bit of a balancing act, but one of the things that has been really motivating for me is being able to look at both what is happening on the academic side β€” to see what ideas are developing, see the cutting edge β€” and at the same time to see what's going on at the cutting edge in industry, and the interplay between those. Having both perspectives has really been one of the main things that has unlocked opportunities in my career. There are so many great ideas on both sides, and they often don't get communicated. On the academic side we often don't know what the real problems are, and on the industry side there are so many people thinking about so many ideas that the connections don't always get made. It's really exciting when you can put those two things together.

2:53 β€” The demo-to-production gap: why AI prototypes die

Conor: You've described experiencing this pattern where someone would single-handedly build something mind-blowing, everyone would get excited, and three months later it was dead. What are the specific failure modes that cause this gap between the demo and actually getting to something reliable in production?

Dan: There are a lot of reasons. Some of them are things that probably everybody listening is familiar with β€” hallucinations, where the system will do something plausible but not actually correct. It looks like it has the right shape, it's just not telling you the truth. Or you build a system that does something reasonable, it's just not what you want β€” and when you try to control it, it persists in doing its original behavior. A lot of the limitations have to do with the limited ability to control through standard control surfaces like prompting, as well as the natural performance characteristics of the models out there.

The models people are mostly interacting with today are highly generative. They operate in a token-by-token way and ultimately assemble plausible outputs on the fly. It's so easy for those things to be plausible but not true. The bottom line is we have not built truth engines or reliability engines. We have built plausibility engines. That's great for building a prototype, but to get to that last mile β€” to get to something you're actually comfortable shipping that has the guarantees an enterprise needs β€” that's really, really hard. A lot of people don't appreciate how big that last mile can be in an enterprise context.

5:40 β€” Why prompting is not a real control surface

Conor: Is part of the problem the control layer itself β€” that we're using prompting as the control surface for production systems?

Dan: Definitely. One of the mind-blowing things about recent progress in AI is that we have models with incredible breadth β€” the ability to make a credible approach to basically any topic. They are fundamentally distillations of all kinds of human knowledge from the web, and what comes with that is the ability to prompt them in flexible ways. That has really changed the shape of how AI is used.

At the same time, prompting is not a control surface that has any kind of precise semantics. It's not a control surface that comes with any kind of guarantee at all. When you put some words in a prompt, it's like a hint. You're requesting that the model do a certain thing. Natural language is full of ambiguity, and there's nothing that says the system will listen.

We've all had the experience of prompting a system and having it not do what you want. So you change the prompt a little bit and it still doesn't do what you want. What is your recourse? You can reword your instructions. You can put important sentences in all caps. You can add an exclamation point. You can add a second exclamation point. And somewhere around where you're adding the third exclamation point, you get this sense that maybe this is not the path to a robust and controllable technology.

8:06 β€” Modular decomposition vs. end-to-end optimization

Dan: The main tool we have had historically for building complex reliable systems is modular decomposition. You take a big problem, break it down into smaller problems. Each piece has contracts β€” if you give this input, you get this output β€” and you can say something precise about what it will and won't do. That has been very powerful in letting software engineering produce large systems with real impact.

One of the things that has historically driven machine learning is end-to-end optimization, which can be at odds with that. You don't have those same tools. Structurally, given how these systems work, we have started to build systems that do not have the same kinds of modularity and don't have the same kinds of ability to guarantee what they will or will not do.

That is actually the primary thing that motivates me: how can we make sure that the AI models we build are truthful, trustable, and controllable β€” that you can make guarantees about them? At Scaled Cognition, that is our core mission. If you distill everything else down, we are a model lab building models about which you can make guarantees on their behavior.

One thing people underappreciate about hallucinations is that in many cases, for a generative model, hallucination is actually the product. If you're doing image generation, you want something invented and novel. But if you take those same techniques and now want them to give you vetted facts and take actions adherent to a policy β€” that's a fundamental mismatch between how these technologies are designed and the uses we want to put them to.

10:55 β€” Are LLMs fundamentally mismatched with how we use them?

Conor: Is there a fundamental mismatch between how we're leveraging LLMs β€” originally built for speech recognition and machine translation β€” and how we're using them today as sources of truth?

Dan: I've been thinking about LLMs for a very long time. It's really wild how far they've gone from being a very specific technology for a very specific purpose to a general layer that many people are using as the operating system of general AI.

Originally, language models had one specific purpose: their job was to take language and score it. This was originally conceived in a context like a speech recognizer. The acoustic model figured out which sounds matched the input. But it couldn't tell the difference between "no" spelled N-O and "know" spelled K-N-O-W β€” and it's the language model that comes in and says, of all these possible transcriptions, which seem like something someone might actually say? Language models were just there to score plausible from implausible. And that core aspect of being a plausibility box has just scaled up and up and up. Now it captures all kinds of long-distance knowledge, syntax, contextual meaning, and real-world plausibility. And as a result, we've built a system where plausible answers involve translating languages or answering questions β€” and that's how it became a broad general attack on AI.

14:26 β€” What's wrong with benchmarks today

Conor: What should engineers be thinking about when it comes to benchmarks? Is it a "wait for a better model" situation, or do we need to rethink the architecture?

Dan: A lot of people feel like the only thing worse than the benchmarks we have would be to not have benchmarks. Having the wrong metric means you can be deluding yourself about how much progress you're making. The biggest challenge with metrics has been that they represent someone's guess at what the important problem is β€” and then hill-climbing on a metric becomes a game in itself. You start to learn the dataset, and benchmarks lose potency over time.

When they do crash tests on cars, most cars pass with flying colors. Every now and then they introduce a new test and most cars don't pass β€” because they've optimized to cover the existing tests, not the underlying issues.

In an enterprise context, the question you want to ask is not "how many of these scenarios did I get right?" The enterprise question is: "For how many of these scenarios will I get it right every single time β€” 100 times in a row?" If you have a system that gets 80 or 90 percent of things right, you can't really ship it. If one in ten customers gets told you booked their flight but you didn't, that's a showstopper. But a system that gets 70 percent of scenarios right every single time β€” that's hugely valuable. The metric has to measure consistency in a way that lines up with what can actually be shipped.

At Scaled Cognition, our model APT-1 is focused on reliably and repeatedly doing the right thing, following policies, and avoiding hallucinations. The important thing for us is knowing that if it does something, it's going to do every variation of it correctly.

20:27 β€” APT-1: building a model for actions, not tokens

Conor: You designed APT-1 for actions instead of tokens. What does that actually mean in practice?

Dan: Part of the core problem with a standard LLM is a lack of semantics. The prompt is a control surface that's just a bunch of words β€” and you don't really know what you're going to get. This leads to the general prompt-and-pray approach, which is not a reliable path to something you can ship.

On the output side, what are you getting? Tokens. Which tokens? Who knows. It's very hard to make any kind of statement about what the system can or can't do. In computer science, a lot of our ability to ship things in high-stakes environments really grounds out in being able to say "this can't happen" or "if this happens, this will always happen."

We came to the realization that you are fundamentally limited operating just in terms of tokens in and tokens out. Tokens don't have semantics β€” but actions do. An action can be allowed or not. Information flows one way or another. What you need is a system whose architectural structure lines up with the needs you're going to place on it. If you need to be taking actions, activating APIs, operating with restrictions on where information can come and go β€” you need more than just hints through a prompt.

Our model is designed with a different architecture and different control surfaces, which allows you to make much stronger guarantees about what it will do. That requires changes all the way through β€” the data, the model, and the deployment stack. But trying to coerce noisy token-based models into having crisp, guaranteed semantics is just a challenging design pattern. Building a reliable technology out of unreliable pieces is not necessary. So we didn't do it.

24:14 β€” What makes data truly agentic

Conor: How do you create the right training data when conversations aren't just text β€” you've got goals, policies, backend tools, and APIs all in play?

Dan: The word "agent" gets thrown around a lot. To me, agentic means there are goals at play, there are going to be actions, and those actions are going to have consequences. A RAG-based question-answering loop is not agentic.

Any agentic interaction involves humans speaking natural language, actions and APIs in complex orchestrations, and an agent with its own goals. When I'm in a banking environment, it's not just doing what I tell it β€” there are policies and goals. It's a conversation involving my goals, the system's goals, ambient policies, and APIs.

One of the things we spent a long time figuring out is how to get data that describes what people say in context based on what they want, how an agent should reply, what actions should be taken, and how all that telemetry is connected. That data just doesn't exist. So we had to make it.

The next piece is metacognition. As a human, when someone asks you a question, you know whether you know the answer, don't know the answer, or are guessing. Systems in next-token prediction mode don't do this. In an agentic situation, you need to know: can I do that? What information do I have? Once you build a model that operates over that structure, you have a real distinction between what you do and don't know β€” as opposed to the situation that gives rise to hallucinations, which is: I generated some tokens and have no idea whether they're right. If they're right, we call them truth. If they're wrong, we call them hallucination. But the model doesn't internally distinguish between them. A metacognitive model does.

28:02 β€” Hallucinations as an iceberg β€” visible vs. undetectable

Conor: How is APT-1 structurally guaranteeing a higher degree of accuracy?

Dan: Because APT-1 is architected around information and actions, it has a metacognitive ability to track its information and move it around. It gives information and actions actual first-class status in the model.

Hallucinations are a bit of an iceberg. When we talk to enterprises, they'll say the hallucination rate of an existing solution is too high to ship. When we actually look, the hallucination rate is often five times what they thought. That's because there are different kinds of hallucinations, many of which are hard to detect. If something is plausible enough, it's indistinguishable from the truth.

One kind is a simple factual mismatch β€” you ask for your bank account balance and get the wrong number. But then there are hallucinations like a refund policy that sounds completely plausible β€” it's just not your refund policy, or it's your policy but for someone with a different frequent flyer status. When a developer looks at that output, it looks right. "Looks right" and "is right" are not the same thing.

One of my biggest fears is that we are increasingly building systems that generate output indistinguishable from the truth. That is not a good situation to be in. Historically, when systems made mistakes, their output was distinguishable. Now it often isn't.

34:16 β€” Building a prototype model for under $11 million

Conor: You built your prototype APT-1 for under $11 million β€” a fraction of what the frontier labs spend. What does that say about the path forward?

Dan: Whenever you scale things up, scale is usually good β€” models do tend to get better. But on whatever axis you're scaling, there are eventually going to be diminishing returns. We got really great gains by training on more and more of the web until we, as a community, exhausted the high-quality material on it.

People often have trouble telling at the beginning of a new technology whether they're on an exponential curve or an S-curve. It's always an S-curve. And eventually, as that starts to flatten out, you want to go in a new direction. Most technology is much more like a sequence of ideas, each building on the last, bringing you from one operating point to the next β€” not a single exponential.

The direction we went in moved away from a token-based view of the world into a richer model space. What drove that was figuring out how to do for conversational AI what RL has done for game playing, math, and code. By taking the model in a different direction, we got on a different operating curve β€” one architected around reliability. The current scaling curve is primarily about breadth and horizontal intelligence. We're building for vertical capability: making sure that in medical, banking, or any high-stakes environment, the system is doing the right thing, following the rules, and not going to hallucinate and take a costly wrong action.

39:57 β€” Applying RL to conversations without a zero-sum winner

Conor: AlphaGo learned by playing against itself. How did you make RL work for conversations with an agentic model where there isn't a zero-sum winner?

Dan: RL is all about doubling down on the things that work and doing less of the things that aren't working. In games like go or chess, even if you're playing randomly, one side wins and one side loses β€” that gives you a gradient. The reason you can do that rollout is because the rules are crisp. You might not know what a good move is, but you know what the legal moves are, and the outcome is clearly defined.

If you want to use RL for something like code or math, the key is: can you figure out what the actual reward function is, and what makes a well-formed instance? For math, if you find a solution, you can plug it into a theorem prover to verify it. When you want to apply RL elsewhere, the thing that typically holds people back is that synthesizing data does not mean synthesizing good data.

What we had to think hard about was: in this situation, there is a user, an agent, APIs, policies, rules, and goals. How do you simulate that in a way that puts it in an RL setting and drives the training of the model? Part of that is getting the data simulation right, making sure quality is high enough, and making sure the data has the right structure. It's the same RL principle β€” it's just been a challenging problem to apply these methods outside of discrete combinatorial domains. And that is one of the big things we've been able to do.

43:31 β€” LLMs as a condensation of the web β€” and what happens when it runs out

Conor: We've spent a lot of time having LLMs consume decades of human knowledge in digital form, and we've hit a saturation point. What comes next?

Dan: At the core, LLMs are a condensation of the web. That's basically what drives them, and it's also why many different models have pretty similar capabilities β€” they all share that same condensation to a large degree.

It took maybe a week to train the model. It took a couple of years to scale up the machinery. But how long did it take to write down all that human knowledge in readable form? About 30 years of writing things down on the web β€” and before that, millennia of human discovery. Language itself is the set of abstractions we've developed over millennia for communicating knowledge compactly. It was kind of already there.

When people talk about hitting diminishing returns, a lot of that is just: there are no more books to read. The question is what you do once you've consumed all the declarative knowledge on the web and the golden stuff is used up. Synthetic data can be very effective or very ineffective depending on the situation. It's very important that it be the right kind β€” clean, structured, with the right architecture behind it.

Every technology follows the same supercycle. People grind forward on some benchmark, progress slows, then some new technology comes out and suddenly benchmarks start falling. It looks exponential at first β€” but it's always an S-curve. It levels off, and then some new idea comes in to take you to the next step. We're at that point now. Just scaling up isn't going to do it. The big gains are going to come from new ideas that are orthogonal to what we're already doing.

50:07 β€” Reasoning models: where they work and where they don't

Conor: What's your take on reasoning models?

Dan: Reasoning models can be really powerful where they apply. They don't apply everywhere. A caricature of a reasoning model is: try a bunch of things and pick the one that worked. That's most valuable when you can tell which one worked β€” like a math proof, where once you find the right tactic, the proof goes through.

But once you've done something a few times, what used to be computation and planning becomes memory and experience. You can do it without solving the problem from scratch again. We don't want AI models solving known problems from scratch every time. Maybe that's great if you're the one selling tokens β€” but we want reliability, not re-computation.

As humans, we use different amounts of computation depending on the situation. If a chip of rock is flying at your face, you just blink. You don't stop and think through all your options. Reasoning is a very particular kind of test-time computation that works for some problems and is absolutely not the right solution for others.

53:04 β€” Early deployments in regulated industries

Conor: Scaled Cognition has focused early on regulated industries. What have you learned from those early deployments?

Dan: The first thought in my head wasn't specifically "let's do regulated industries." It was: I want to build models that you can trust. What does that mean? When a model represents information to you, it's true. It follows instructions. And as we thought through what trustability really requires β€” it connects to controllability, auditability, and explainability. If you take a standard model and ask why it did something, it will start generating tokens that purport to explain the decision. But there's no reason that explanation is actually isomorphic to the computation.

We thought really hard about building a model whose natural operation leaves an auditable trail. In regulated industries, you need a traceable, auditable record of what was done, where the information came from, and what policy was being followed. And when a system takes actions in finance or healthcare, those actions are regulated for a reason β€” the mistakes have a very high cost. I don't need a model that occasionally exhibits creative genius. I need a model that will reliably do the right thing every single time. That is super reliability.

But one of the things we learned is that pretty much every industry feels that if you take the wrong action, the price is too high. The regulated industries were perhaps more sophisticated about how they thought about mistake risk. But across the board, all enterprises find it important to trust their systems, obey controls, and take truthful actions.

57:14 β€” Why multi-model checking fails

Conor: Why not just put a second model on top to check the first one?

Dan: Now you have two problems. You have a model that's unreliable, and you bring in another model that's also unreliable to check it. A chain-of-models approach can reduce errors, but it's not very effective.

Multiple models can all fail. Sometimes one catches the error. Sometimes it introduces a new error, and then you need a third model, and suddenly you've got ten or fifteen models checking each other. It's burning a lot of tokens. It's slow. It's expensive. And it's still very hard to guarantee anything out of a system with this complexity.

Here's the math: say you have a 20-turn conversation. You've got a model that's 80 percent right, checked by a model that's 80 percent right. That sounds reasonable β€” but that means 4 percent of turns have a mistake, assuming they're independent. But they're not independent. When you have a hard instance, the model and the checker tend to fail in exactly the same cases. So 80 percent checking 80 percent is more like 82 percent β€” not 96 percent. And every turn of the conversation is another chance to make a mistake.

The real problem is that these approaches are an attempt to take a noisy, unreliable horizontal model and use it to produce vertical, reliable, truthful behavior. It's just a bad fit. A situation that is hard for one model will tend to be hard for all of them. We've learned this in machine learning decade after decade: when you have multiple systems, you hope for independent errors and you get highly correlated errors. Plus it's slow and expensive.

Even if you could make it work, it's still better to have a model that in its natural operation doesn't make the mistake in the first place. That is one of the advantages of APT-1 β€” it gets it right in its natural operation, rather than after seven other models have signed off.

1:00:34 β€” The minimum bar for trustworthy agentic systems

Conor: Is there a minimum bar builders should be trying to hit for trustworthiness in agentic systems?

Dan: There are definitely minimum bars, though they move around. My personal concern at a societal level is that we're building technologies where nobody knows what to trust. People are going to trust maybe nothing, maybe everything. And this attitude of "it's AI, maybe it's right" is going to leak out until people just accept that everything we build is going to be built on jello.

If there's one thing I want to get out to people, it's this: you don't have to build on jello.

The bars do vary. A RAG-based FAQ system is not really an agentic system β€” getting a wrong answer back is bad, but it's different from giving someone the wrong medication or lying to them about their account balance. What you see out there is a lot of people trying RAG systems in low-stakes settings because they know the bar needs to be much tighter to ship a system that actually takes actions. Once you do things, not all things can be undone. The consequences can be much higher.

Enterprises and users ultimately have to ask themselves: what is my tolerance for screwing this up? If you have a low tolerance, you need tighter guarantees. That is our goal β€” to provide models that give those guarantees.

1:04:07 β€” Societal risk: when AI output is indistinguishable from truth

Conor: What does it mean for society when AI produces output that is perfectly fluent but potentially wrong, with no obvious way to tell the difference?

Dan: One of my biggest fears is that as a field, we are building systems that produce outputs indistinguishable from the truth. That can be very corrosive to society.

On the technology side, we should be putting more effort into building systems that actually tell you the truth β€” systems that have a first-class notion of whether what they're saying is true. But I think the social side is equally important. It's ultimately a question of digital literacy.

In earlier eras, you could often tell when a system was making mistakes. If a machine translation output wasn't fluent, that tended to correlate with it being unfaithful. There were signs. With search, you got a list of documents and had to evaluate them yourself. With a chatbot, all of that is disintermediated. You ask a question, you get an answer. The digital literacy tools we've developed to evaluate sources don't apply. You can't check the source. You can't build a correlation between suspicious output and information quality β€” because all the information has been made to sound equally certain.

And it's only going to get worse as we reinforce-learn more of these systems. When you RL a system to optimize customer satisfaction, and the system figures out that saying "your package will be there tomorrow" gets a thumbs up while "your package is lost" gets a thumbs down β€” now you've crossed a line. The system has been explicitly optimized to produce things that are not what you actually want.

We're already seeing this bifurcate society: people who believe nothing, and people who believe everything. If a model is sufficiently good at producing plausible output, those are actually the only two options. To believe only true things requires some way to distinguish truth from falsehood β€” and as each tool we have for making that discrimination is removed by the technology, we're forced into one of those two camps.

There's also something worth pointing out about how use of these technologies changes behavior. When you realize you can just tell a chatbot the email you want written and send it off, you stop thinking of yourself as the author. I think we have better options. As a society, when you use AI this way, you've gone from being an author to being an editor. And being an editor is in many cases harder than being an author. People have much more experience being authors than editors. So when faced with learning the hard skill of vetting or just delegating, many choose to delegate. Once you stop editing the result, you're trusting the system's work over your own abilities. That is corrosive in a compounding way.

1:13:33 β€” Where Dan is inspired in AI research today

Conor: Where are you inspired today in AI research and development?

Dan: What I am personally most passionate about is the same whether I'm speaking from an academic or a company perspective: building models you can trust. That means models that are controllable and truthful, where you can place guarantees about what they will do β€” and sometimes more importantly, what they won't do.

Being able to bring that kind of technology into the world is very important to having AI that can be a force for good. A precondition to that is having trust, reliability, robustness, and control. That is the core of what I think about all the time.

Originally, the biggest challenge with AI was that nothing worked well enough. Now, a lot of the challenges are in places where things are working β€” where they're having downstream consequences because of what they can do. You don't have to build important systems on jello. You don't have to accept that you either believe everything or nothing. We can take actions to improve digital literacy. We can push the technologies we build to have better properties. Tools you can trust β€” that's really where we're focused.

‍

About Dan Klein

Dan Klein is co-founder and CTO of Scaled Cognition and a professor of computer science at UC Berkeley, where he leads the Berkeley NLP group. He is the recipient of the ACM Grace Murray Hopper Award for his work on grammar induction. His former PhD students now lead AI teams at Google, Stanford, MIT, and OpenAI. His first startup, adap.tv, was acquired by AOL for $405M. His second, Semantic Machines, was acquired by Microsoft in 2018, where Dan spent five years integrating conversational AI at enterprise scale. That experience β€” watching great demos fail to ship β€” is what led him to start Scaled Cognition and build APT-1, the Agentic Pre-trained Transformer, a model designed from the ground up for actions rather than tokens.

‍

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat."

Daniel De Castro
Co-Founder & COO at X Company
Webflow logo