In February 2024, Klarna's AI agent handled 2.3 million customer conversations in a single month — the equivalent work of 700 full-time employees — cutting average resolution time from 11 minutes to 2 minutes. That's not a prediction. That's a company that shipped something, measured it, and published the results.
If you've been using ChatGPT as a browser chat window and wondering how to actually build an AI agent with ChatGPT that does real things, this guide covers the full picture: how agents think, two concrete paths to building your first one, real code, and the honest cost breakdown nobody else includes.
What is an AI agent — and how is it different from a chatbot?
Most "AI chatbots" you see today are sophisticated autocomplete. Send a message, get a reply. Done. An AI agent is different in one fundamental way: it takes actions.
A chatbot answers. An agent decides what to do next, calls tools to do it, checks the result, then decides again. That loop keeps running until the task is complete — not until it generates one response.
The anatomy of any agent looks like this:
- LLM (the brain) — GPT-4o processes your goal and chooses what action to take
- Tools — functions the model can call: search the web, read a file, send an email, query a database
- Memory — short-term (conversation history) and long-term (vector store with embedded documents)
- Loop — the agent cycles through reasoning and action until the task is finished
Sam Altman, CEO of OpenAI, described the shift in January 2025: "We are now at the point where AI can do many things in an 'agentic' setting — taking sequences of actions, doing research, writing and executing code."
This loop has a name: the ReAct cycle (Reason + Act). Perceive the situation. Think about what tool helps. Call it. Observe the result. Repeat until done. It's not magic — it's a while-loop with a language model making the decisions.
How does an AI agent actually make decisions?
Think of it like a chef, not a recipe card. A recipe follows steps in sequence. A chef looks at what's available, decides what to make, reaches for tools as needed, tastes along the way. That's closer to how an agent works.
The decision cycle:
- Perceive — receive the user's goal
- Think — which tool helps here?
- Act — call the tool with specific parameters
- Observe — read what the tool returned
- Repeat or respond — is the goal achieved? If not, loop again.
GPT-4o reaches roughly 94% accuracy on tool-calling benchmarks — meaning it correctly identifies which function to call and with what arguments almost every time. Not perfect. Good enough for production.
The four core components of every AI agent
Before you write a single line of code, understand what you're building. Every functional agent has four parts:
1. A language model as the decision-maker
GPT-4o is the current default for most production agents. It handles function calling reliably, and OpenAI's Agents SDK (released March 2025) is built around it. GPT-4o-mini works well for simpler routing tasks at roughly one-tenth the cost — useful once you know what you're doing.
2. Tools the agent can actually use
Tools are Python functions you write and describe to the model using JSON Schema. The model decides when to call them and with what arguments. Your code actually runs them. Common examples:
search_web(query)— live search via Bing or Brave APIread_file(path)— read a local documentquery_database(sql)— run a SQL query against your datasend_email(to, subject, body)— send through Gmail API
The model doesn't execute tools. It issues instructions. You control what actually runs.
3. Memory at two levels
Short-term memory is the conversation history — every message passed to the model each call. Long-term memory requires a vector store: you embed documents and retrieve relevant chunks at query time. OpenAI's built-in File Search handles this cleanly if you're on the Assistants API.
Without memory, your agent forgets everything between sessions. With it, it can reference past conversations, company policies, or product documentation.
4. An orchestration layer
This is the code running the loop. Options: OpenAI Agents SDK directly (simplest for single agents), LangGraph (better for complex conditional flows), or CrewAI and Agno (best when you need multiple agents collaborating). We've shipped systems with all of them. For a first agent, OpenAI's SDK is the right move — built-in tracing, guardrails, and tool registration without extra dependencies.
Two paths: build with code or without
Not everyone needs Python. Here's the real distinction:
Path 1 — No code (OpenAI GPT Builder or Assistants playground)
If your goal is a custom assistant with specific knowledge and some tool access, the OpenAI platform handles this without a line of code. Define instructions, upload documents, connect integrations. Works well for internal Q&A bots, customer support, and document search.
Des Traynor, co-founder of Intercom, described building their Fin support agent: "Building Fin on top of GPT-4 took our team weeks, not years. Fin now resolves over 50% of support questions without any human involvement — that is a step-change in what's possible." That result came from a focused use case, not a complicated architecture.
Path 2 — Python + OpenAI Agents SDK
For anything more complex — agents that write and execute code, chain external API calls, or manage multi-step workflows — you need code. Here's a minimal working agent skeleton:
from openai import OpenAI
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for current information on a topic",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query"}
},
"required": ["query"]
}
}
}
]
def run_agent(user_message: str) -> str:
messages = [
{"role": "system", "content": "You are a research agent. Use tools to answer accurately."},
{"role": "user", "content": user_message}
]
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
# No tool call means the agent is done
if not message.tool_calls:
return message.content
# Process each tool call and feed results back
messages.append(message)
for tool_call in message.tool_calls:
# Replace this with your actual tool implementation
result = execute_tool(tool_call.function.name, tool_call.function.arguments)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
This is the skeleton. Add real tool implementations, error handling, and logging — and you have something worth shipping.
What does it actually cost?
Nobody publishes this honestly. Here's the real breakdown for GPT-4o at early 2025 pricing:
| Use case | Approximate tokens/day | Estimated daily cost |
|---|---|---|
| Light internal tool (10 queries) | ~50K | ~$0.25 |
| Customer support agent (100 queries) | ~500K | ~$2.50 |
| Heavy research agent (500 queries) | ~2.5M | ~$12.50 |
GPT-4o-mini cuts costs by roughly 10x for simpler subtasks. Most production agents use a mix — mini for classification and routing, full 4o for complex reasoning steps. Use the Agents SDK's built-in tracing from day one. Debugging a black-box agent loop is genuinely painful without it.
Limitations worth knowing before you ship
After 50+ projects building AI systems for clients in fintech, legal, and e-commerce, our team has learned some things the hard way. Agents struggle with:
- Long multi-step tasks without checkpoints — they can drift from the original goal after many tool-call loops
- Tasks requiring precise numerical accuracy — always validate math with a dedicated tool, never rely on the model's arithmetic alone
- Real-time data without proper tool access — a model trained on 2024 data doesn't know what happened last week
When we implemented a RAG-based support agent for a fintech client, it reduced support tickets by 40% in three months. But the first two weeks were spent adding guardrails — the agent occasionally generated policy details that weren't in the source documents. The fix was straightforward: constrain it to only answer from retrieved context, never from model memory. But you have to build that constraint in intentionally.
The ecosystem is also genuinely noisy right now. Hugging Face surpassed 1 million available models by mid-2025, and top agent frameworks together accumulated over 400,000 GitHub stars. Don't let that overwhelm you into framework-hopping. Pick one, build something that works, then evaluate whether you need more complexity.
If you're aiming to move past proof-of-concept into something production-grade, Yaitec's team of 10+ specialists has shipped agents across fintech, legal automation, and content systems. We're happy to help you design the right architecture for your specific use case — contact us and tell us what you're trying to build.
Start small, then scale what works
Gartner forecasts that by 2025, 50% of enterprises using generative AI will have at least one agent in production — up from fewer than 1% in 2023. The gap between "thinking about it" and "running in production" is smaller than most people expect.
Pick a single, boring, well-defined task in your workflow. Something that follows clear rules and has a measurable outcome. Build that agent first. Get the tool calls working. Add memory. Ship it. Then build the next one.
The companies winning with agents aren't the ones with the most sophisticated architectures. They're the ones who stopped planning and started measuring.