Developers of agentic AI have been making some big claims. The promise has been of autonomous systems that can do everything, from booking our flights and keeping an eye on competitors in real time to handling entire procurement cycles, all without needing an actual human to hit “confirm.” And while the technology needed to achieve most of these marvels already largely exists, the infrastructure necessary to make it work reliably at scale still leaves much to be desired.
Gartner recently projected that over 40% of agentic AI projects will be canceled before the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. That’s pretty striking, especially in view of the expectation that autonomous agents would finally herald AI’s coming-of-age. And yet, this should not really surprise anyone who has seen the undeniable limitations these agents exhibit in the real world. Most people assume the underlying issue to be related to the quality of the models themselves. Although this might seem plausible, it is a little off the mark.
Why the Web Resists Agents
Consider what a capable agent actually needs. Accessing a website and getting a response is just the start; it then has to translate that response into something usable. Not only that, it has to do it consistently, in real time, and at a scale that makes the whole exercise worthwhile to begin with.
Given the web’s current shape, this is a daunting task. Just take online platforms as an example. There is no technical reason why an independent agent could not compare different platforms and make the choice that best suits users’ preferences. However, those same platforms currently depend on that information not being readily available. To maintain their advantage, they work on increasingly personalized results, sponsored placements, and urgency cues to shape user behavior and tip the scales in their favor. Without access to pertinent data, no AI agent will ever be able to complete tasks on the web or automate selecting the best option for its users.
The result of this is a web that works reasonably well for general browsing but systematically discourages automated access. I will give a sneak peek into some of the findings that provide a clear illustration. Oxylabs is about to release the Web Openness Index, which scores over 120 countries based on various aspects of web accessibility. The findings show:
- The global average score for practical reachability — essentially, how well a site responds to standard automated HTTP requests — stands at 83.4 out of 100.
- The score for anti-automation friction (the lower, the more friction there is) — such as CAPTCHAs, rate limiting, fingerprinting, and bot detection — is, on average, 62.8.
- And structured data interoperability — whether sites return data in formats that machines can actually work with — drops even further to 60.3.
Those 20-plus-point differences reflect a structural gap. Sites generally respond to requests for automated access. At the same time, restrictions abound, and data is often returned in machine-unfriendly ways. Agents that depend on reliable, timely, structured information will often fall into that gap.
Data-Starved AI
Within organizations, agents face a different but related problem: a lack of usable data. In other words, the relevant data exists but has not been cleaned, tagged, or structured in a way that an AI system can understand. The same applies to customer-facing applications built on agentic systems. Without real-time web data — current prices, live inventory, policy updates, market movements — they have no other choice than to reason based on a frozen version of the world.
Latency is another problem. Put simply, an agent that eventually returns the right answer is far less useful than one that returns it fast enough to act on. When dealing with autonomous systems, the tolerance for delay is even lower. In each case, the constraint is the same: agents need context they can trust, and they’re not getting it — not from their own organizational data, and not from the web.
This data starvation is pervasive. Many enterprises have invested heavily in data lakes and warehouses, but the information remains siloed, inconsistent, or outdated. For example, a customer service agent trying to resolve a complaint needs access to the latest order status, return policies, and inventory levels — all of which may reside in separate systems with different data formats. Without a unified, real-time data layer, the agent cannot perform effectively. The same holds for internal agents tasked with supply chain optimization: they require live feeds from suppliers, logistics providers, and demand forecasts. When these data sources are not integrated and cleansed, the agent's decisions become unreliable.
Moreover, the issue extends to the web at large. Consider a travel agent that needs to compare flight prices, hotel availability, and local weather conditions. Each airline, hotel chain, and weather service may present data in a different structure, often behind CAPTCHAs or with rate limits. The agent must spend significant computational resources just to parse and normalize the data, reducing its speed and reliability. This is why many early agentic AI projects have failed to scale — they underestimated the effort required to obtain clean, structured, and timely data from the open web.
Solving a Problem That’s Been Solved Before
It is easy to forget, but this is actually not the first time the sheer volume of information has eclipsed our capacity to process it. The early web is particularly instructive here. It already held so much knowledge but it could not be useful in its raw state. What made the difference back then was infrastructure built for scale. Namely, web crawlers were deployed to index pages, scrapers were used to compare prices online, and monitoring systems were put in place to track fraudulent ads and brand impersonation across thousands of domains. All of these innovations require the ability to collect public web data reliably and at scale.
A more recent example comes from the pro bono Project 4β partners Debunk.org. This non-profit, fighting online disinformation and fraud, conducted an investigation that uncovered a large-scale, multilingual scam operation targeting former fraud victims. The investigation identified over 50,000 ads, 459 domains, and more than 1,100 related web pages, with an estimated reach of 52 million people across Europe. That kind of coverage requires systematic, automated data collection at scale.
Agentic AI needs an infrastructure of the same kind, except with even higher demands, because agents do more with data than any previous application. They need information that is structured, current, complete, and returned fast enough to support real-time action. The historical precedent shows that when proper data pipelines and scraping infrastructure are built, remarkable outcomes are possible. However, many organizations still treat data collection as an afterthought, relying on manual processes or fragile scripts. To succeed with agentic AI, they must invest in robust, scalable data extraction and normalization tools.
Another lesson from the early web is the importance of standards. The rise of APIs (Application Programming Interfaces) in the 2010s provided a structured way for systems to communicate. Yet today, many websites still lack comprehensive APIs, forcing agents to resort to screen scraping or parsing HTML, which is error-prone and slow. The industry would benefit from a push towards open, standardized data formats and APIs that are agent-friendly. Until that happens, third‑party data aggregation services and proxy networks will remain essential for agents to function at scale.
The Three Cs of Reliable Agent Infrastructure
As noted above, all of this is unlikely to happen organically. For platforms, opening up to frictionless automated access means ceding control over discovery, ranking, and customer relationships. While this is beneficial for the consumer and invites reshaping business models accordingly, it is also a threat to short-term revenue. The infrastructure that makes agentic systems work reliably has to be built independently. Three requirements, or three Cs, stand out:
Consistency: agents that encounter unreliable data sources produce unreliable behavior, and unreliable behavior is the fastest route to project cancellations. Consistency means that the data an agent receives must be predictable in format, availability, and timeliness. This requires not only robust scraping infrastructure but also fallback mechanisms when a primary data source fails. For example, a price comparison agent should be able to switch to a secondary source if the main source rate-limits or goes offline. Caching strategies, redundant data providers, and health monitoring are all critical to ensure consistency.
Currency: real-time access to prices, inventory, availability, and policy is what separates an agent reasoning based on current facts from one reasoning by reference to stale assumptions — in most commercial contexts, the latter creates more problems than it solves. Currency demands low‑latency data pipelines. Even a few minutes of delay can cause an agent to book a flight at an outdated price or recommend a product that is out of stock. Technologies like streaming data, webhooks, and incremental updates are necessary to keep the agent’s knowledge fresh. In financial services, where seconds matter, agents must operate on sub‑second data feeds.
Compliance: access built outside fair standards tends to provoke countermeasures that raise barriers for all automated systems, so any infrastructure worth building has to be sustainable, not just technically but in practice. Compliance involves respecting robots.txt, avoiding excessive request rates, and adhering to terms of service. It also means ensuring that data collection does not infringe on copyright or data privacy regulations like GDPR. Building compliant infrastructure is not just a legal necessity; it also reduces the risk of being blocked. Proxy rotation, ethical scraping practices, and clear data usage policies are essential components.
These three Cs are interdependent. An agent that is consistent but not current will make decisions on outdated data; one that is current but not consistent may produce erratic results; and one that ignores compliance will eventually lose access to its data sources. Therefore, any serious agentic AI deployment must address all three simultaneously.
Organizations that succeed with agentic AI will treat data infrastructure as a core asset, not an afterthought. They will invest in dedicated teams to build and maintain data pipelines, negotiate data access agreements, and develop monitoring tools. They will also work with specialized vendors that provide reliable web data extraction services, offering pre‑built connectors to hundreds of sites, automated CAPTCHA solving, and proxy management. This is already happening in domains like e‑commerce, travel, and finance, where real‑time data is a competitive necessity.
Looking ahead, the emergence of new standards like schema.org and the increasing adoption of headless commerce architectures may eventually make the web more agent‑friendly. But until then, the onus is on developers and businesses to build the infrastructure that bridges the gap. The web was not designed for agents. Within organizations, the context agents need is often not easily accessible to them or even readily available. These are data quality problems that can be solved and infrastructure problems that we are actively solving. Finally, what we as a society truly need is to decide if we are ready to welcome AI agents or if we want to keep holding them back.”