With billions being invested into AI and the infrastructure around it. The industry has picked up a breakneck pace ever since the popularization of ChatGPT several years ago. Now, the entire semiconductor industry is seemingly revolving around skyrocketing demand for AI data centers. The question on everyone’s lips: Are the models good enough to make a material impact, and what risks come with using AI?
Machine learning technology has certainly helped make strides in many areas of industry and research. Voice recognition is far more reliable, medical analysis is faster and more accurate, materials science is quickly evolving, and even weather prediction and climate tracking are seeing massive strides, thanks to the ability of bots to vastly speed up or add precision to processes performed by humans.
Is AI really getting better?
And yet, anyone who’s lived through the past few years has witnessed the almost-monthly improvement across every front: ChatGPT keeps getting smarter and doesn’t forget context as easily, Perplexity digs information ever more effectively, Midjourney no longer creates six-fingered humans, and video generators like Sora don’t defy basic physics so often. Gigantic disasters can and do happen due to over-eager, agentic bots, but the error rate is being reduced by the day, and the number of guardrails continues to grow.
Anthropic’s CEO said that AI could cause up to 20% of unemployment in the next five-years, and Microsoft’s ongoing ceaseless charge to integrate Copilot into every facet of its OS means that AI is inescapable for the average user. So, if AI is going to be everywhere, what makes it tick, and what factors could improve a given model?
To understand that, we must break down what makes AI function, and what could make any given model better. After all, the models’ outputs need to become more trustworthy and/or of higher quality than a common bowl of digital slop.
How LLM’s work
To that end, LLM-based models (both text and agentic) are expanding their reasoning capabilities and reducing the hallucination rate. This is achieved in several ways, but one common theme among all the latest versions of popular models is extra-large context windows and hundreds of billions, sometimes trillions, of parameters.
Context windows for LLMs are measured in tokens (words, fragments, or symbols) and grew from around 512 tokens in 2018 to over 1 million in the current-generation models, an improvement of over 2,000x over just 7 years. Larger windows give the model a bigger workspace to formulate its response, enabling much more detailed “thinking,” better conversation memory, contextual awareness, and the ability to consult additional data like webpages, documents, and even entire code repositories.
A larger window doesn’t imply a model is smarter, but it is necessary to support more advanced reasoning, particularly multi-step reasoning and multi-modal reasoning (more on those below). Image and video generators don’t use context windows per se, and their tokens are instead pixels and movement vectors, but the respective analogs to context windows enable the much-improved final rendering quality we see these days, as they’re able to consult more images/videos as source material.
Parameters are values in the model that lend more or less weight to certain connections between their training information, like relationships between words and facts. Having more parameters generally allows models to capture more complex, interconnected information, though increasing the number also increases the cost of running queries. A high number of parameters is essential for research-grade models, while simple search/classification engines will be fine with “only” a few billion.
Multi-modality is also one of the lynchpins of contemporary models of various types. The advancement means that models consider not just text (or pixels for images, or vectors for video) when generating their output. For example, chatbots now know to read images, charts, code, and even videos, and use them as references in their replies when formulating and answering your queries. Retrieval-Augmented Generation (RAG) is becoming commonplace, where a bot refers to and/or verifies its information using external information it looked up.
Conversely, visual generators can rely on textual information to better understand prompts (prompt adhesion), provide captions, and cross-reference information. One particularly neat trick is “zero-shot learning,” in which the model infers what a certain animal (say, a lion) is and generates a picture of it, having obtained information from textual context and description rather than being specifically trained on images of lions.
Multi-step reasoning is another feature you might have noticed about some bots, but it is quickly becoming commonplace. It’s probably the closest analog to human reasoning: a bot breaks down a task or question into separate parts, effectively using most of its brainpower for each step and evaluating the results before moving on. You might even have noticed some bots backtracking on their footsteps when hitting a dead end, just like humans would.
This type of reasoning is powerful, but since it takes a long time to compute, it’s generally reserved for premium usage plans. Models like Anthropic’s Claude are particularly adept at multi-step reasoning, having been designed with development tasks in mind, even going as far as saving its “state” to files for better handling long-term tasks. Most, if not all, contemporary models have “fast” and “thinking” modes of operation.
Tool use is quickly becoming critical. Almost by definition, a repetitive task should be automated by a computer, and to that end, a model needs to integrate with and use APIs for commonly available tools. As examples, Google’s Gemini can interact with most of the Google Workspace ecosystem, while Anthropic’s Claude made a living from day one as a coding assistant, integrating with many developer tools. Anthropic is also testing how LLMs run entire businesses, with mixed results. ChatGPT also has a plug-in system of its own. In effect, these models can now interact with these services just as well (or much better) as any human.
Training set sizes. Any bot of whichever type is only as good as the data it’s trained on. This characteristic’s evolution is fairly predictable, given that it’s mainly limited by the capabilities of the underlying hardware, and that too has seen massive leaps in under a decade.
For an LLM, the average training set size was around 13 billion tokens in 2018, and the amount is now estimated to be well over 20 trillion. Image generators were initially trained on less than 10 million images, a stark contrast to the multiple billions of today. Videos take up a lot of space and RAM, and early generators made do with under 1 million videos evaluated, while today they analyze billions of clips.
All combined, the techniques detailed above help lower hallucination rates, make for “smarter” bots overall, that are capable of executing more tasks than before. Answer accuracy is improving all the time, and the agentic bots are also much less prone to making boneheaded decisions when manipulating their respective tools.
Trust in a bot’s output or operations includes a concept of safety — not just in the politico-social sense of defining what information is safe for a bot to provide, but also the relative safety of its operations when using tools. After all, it’s not ideal for your bot to suddenly email everyone in your contact list because it misinterpreted an exclamation, executed irreversible operations on a batch of images you want touched up, or cleaned up your thesis’s formatting by removing all the content.
Safety is a fairly hot topic right now, given the growth of agentic and tool-based AI. Grok has been under the microscope for safety in particular, as legislation begins to surface as a result of AI’s ease-of-use.
Each vendor has its own mixed set of approaches to this topic, called “guardrails.” Safety is, however, a trade-off, as some models will be far more conservative than others when answering questions or executing tasks and can err too much on the side of caution, refusing to answer innocuous questions. Generally speaking, the more capable they are, the more careful they tend to be. After all, with great power comes great responsibility.
Highlights of popular models
The characteristics and improvements described above generally apply to most any contemporary, full-sized model, but here are a few key highlights from each vendor:
GPT 5.2 (OpenAI): The newer version of OpenAI’s flagship model claims to have a much lower hallucination rate (37%, down from 62%) and should be up to 10x more computationally efficient, as well as have much-improved response quality, whether on text or code. It’s now fully multi-modal and can interpret images, video, and audio to formulate responses. It’s also capable of using real-time information.
Although it’s a generalistic model at its core, its plugin architecture allows it to be integrated almost anywhere, serving as easily as a browser search or a coding assistant. ChatGPT is also customizable with custom instructions and has multiple personalities available, letting the user select the desired style and tone for responses. However, when GPT-5 was initially released, some users were not happy with its outputs.
Gemini 3 (Google): Released in late 2025, although Gemini 3 is a generalist model, equipped with Deep Think architectures that allow it to plan, pause, and self-correct before responding. Google claims the multi-step reasoning improvements let it top benchmarks in coding and reasoning tasks. It’s natively multi-modal, taking in common types of digital media and code repositories as inputs. Users of the Google ecosystem (Gmail, Chrome, Workspace, etc) can benefit from Gemini’s tight integration with those services.
There are also Gemini Gems, shareable chatbots that you can tailor for specific tasks. Google’s AI Studio ought to make it easy for developers to integrate Gemini into their applications, too. Google’s Antigravity platform also allows users to expand on its abilities for bigger tasks, but it doesn’t quite stick the landing. In one infamous example, one of Gemini’s agents wiped a user’s entire HDD.
Claude 4.5 (Anthropic): Claude has been designed as a model for programmers from the get-go, so it’s no wonder that it claims to be optimized for multi-hour tasks and scores particularly well in coding and reasoning benchmarks. It excels at complex operations and uses hybrid reasoning (a mix of fast and accurate reasoning modes), and is naturally well integrated with GitHub and other development tools, being capable of using several in parallel.
All Claude 4.5-based models are multimodal and multilingual. Anthropic prides itself on designing Claude with a safety-first approach and particularly strong guardrails, with the model reportedly scoring quite high on safety tests. That’s a particularly welcome feature in a bot whose main output is code, which intrinsically needs to be scientifically correct. Interestingly, Claude can write its “state” to files if given access to, letting it improve its continuity on long-term tasks.
Grok 4.1 (xAI): Grok 4.1 is one of the most powerful AI models on the planet, and that’s due to its multi-modality, high two-million-token context window, and reasoning capabilities, built on a MoE (Mixture of Experts) architecture, in which the model activates specialist parts of itself to answer a question rather than its entirety, making for faster answers and more efficient computing while retaining answer quality. This has led to the Elon Musk-led company’s flagship thinking model excelling in various benchmarks, including text generation and search in particular.
Unlike other models, like GPT-5 and Claude, Grok 4.1-thinking exists on a real-time data set, which may give it an advantage, as it has a later knowledge cutoff. While safety is an issue on Grok imodels in particular, it excels in thinking and reasoning.
Mistral Large and variants (Mistral AI): Mistral has the Mistral Large model as its flagship offering (released in 2024), but the company focuses on offering multiple variants for integration into products and services, each optimized for a particular type of task and/or desired computing efficiency. As examples, Mixtral uses a mixture-of-experts, Codestral and Devstral are targeted at development services, Pixtral and Voxtral handle visual and audio recognition, and Magistral excels at reasoning.
Many of Mistral’s models are published as open-weight models under the Apache 2.0 license, while generally the higher-end variants require a commercial license. They’re generally better thought of as models-as-service; Mistral doesn’t have many end-user applications compared to other models, like ChatGPT.
Where AI is headed next
At this point, you may be asking yourself what’s beyond “models keep getting smarter”. In the short term, that’s definitely where all the low-hanging fruit is, enabled by Nvidia and AMD’s technological advancements with their respective accelerators, plus all the investment in AI data centers. Though TSMC itself is reportedly ‘very nervous’ over an AI bubble.
In AI, optimization is also paramount, as Total Cost of Ownership (TCO) is king for an AI data center, due to the power-guzzling nature of the tasks at hand. Any optimization is welcome, and for example, a few years ago, it would have been difficult to predict that a data format like FP4 (4-bit floating point) would ever become useful. Now, Nvidia is spinning off its own standard, NVFP4.
The first endgame goal is for AI to become deeply integrated into software ecosystems. From web- or device-based applications, to operating systems. A good portion of the internet and devices are already dependent on cloud services like Amazon Web Services (AWS), Azure, et al.
AI services will soon be no different — as their APIs and models get integrated into every single bit of software, in the medium-term, a good portion of the computing world will cease to function without them.
For example, almost every application has a search function of some sort, something that AI bots are particularly adept at. Yes, on-device AI is widespread, but much like it happened with cloud service providers, the convenience and ease of development of using an external API will trump almost everything else, implicitly sending out lots of your data for processing.
Agents and integrations
AI Agents set most of the scene for the future of AI. Theoretically, you can ask an agent to perform a task, and it will do it for you, feeding into a larger LLM, which is working on a larger task. However, the main issue for Agentic AI is trusting their actions, just ask the person who had their application’s production environment wiped by Replit for no apparent reason. At least the bot was honest; not every employee is that forthcoming.
Getting developers hooked into using AI APIs in apps is one thing, but you can cut out the middleman if you are the app. OpenAI’s ChatGPT’s Atlas, Perplexity’s Comet, and Atlassian’s Arc are all browsers that put their respective services front and center, conveniently bypassing Chrome, Firefox, Safari, and other points of entry into the internet.
Being the internet’s gatekeeper is an absolute position, as you have control over the user’s eyeballs, can collect advertising money, and suggest, cajole, plead, and strong-arm users into using your services. Last year, Perplexity and Search.com put in offers to buy Chrome from Google to the tune of $35 billion, a deal that ultimately didn’t go through.
A revenue stream selling the abilities of your bots is all well and good, but trading in user data is the business gift that keeps on giving. The amount of data that conventional services already know about people is already staggering, but with heavy AI usage, it may elevate itself to another level.
AI’s privacy problem
The issue is twofold: firstly, people have long, in-depth conversations with LLMs, where they provide lots of personal details, rather than just a handful of Google searches. Secondly, once you grant a bot access to your data or services, there’s little more than a Terms of Service statement stopping it from siphoning it all away. Many developers might not even be aware of just how much of the user’s data is traveling through their app and being sent elsewhere.
Chatbot logs have already been used in court multiple times, and their much longer and detailed nature makes them far better proof of conditions or intent than simple search terms. At one point, an AI bot (or all of them) may well have a better insight into your life and patterns than you do yourself. Such detailed information is worth a lot of money to the right bidder, and the amount, accuracy, and price of said information are all likely to rise.
AI companies like OpenAI are planning to go one step further and make their own devices. It’s not that difficult to imagine that at some point, OpenAI or Meta might release their own smartphones where everything is AI-centric, and intimately know each byte of your documents. The Ray-Ban Meta Glasses may have interesting utilities, but it’s a chilling awareness knowing that one day, AI might be watching and parsing every inch of it.
All told, there might not be one grand unifying vision on AI companies, but one thing is fairly certain: they’re all looking, and will likely become fully entrenched in your professional and personal lives.

التعليقات