Why This Guide Exists
In 2025, investors poured $110 billion into private AI companies globally, more than doubling the previous year’s total, according to the Stanford AI Index Report 2025[1]. More than 80% of those AI projects will fail before reaching production, according to the RAND Corporation[2], a rate twice as high as non-AI technology projects. S&P Global[3] found that the percentage of enterprises abandoning the majority of their AI initiatives surged from 17% to 42% between 2024 and 2025.
Builder.ai burned through $445 million before filing for bankruptcy[4] after audits revealed its AI was a facade. The shopping app nate raised $42 million on claims of AI automation that, according to the SEC[5], amounted to zero percent. The SEC and DOJ have now charged its founder with securities fraud and wire fraud.
These failures share a root cause. Standard financial and legal diligence evaluates revenue, contracts, intellectual property, and corporate structure. These evaluations remain necessary, but they cannot determine whether the AI actually works, whether it will continue working at scale, or whether its known technical limitations create downstream liability. The evaluation frameworks investors rely on were built for a different category of software, and the gap between how capital enters the AI market and how rigorously it evaluates what it funds continues to widen.
This guide introduces the core technical concepts that investors need to evaluate AI companies, illustrated through documented cases where inadequate technical diligence produced measurable losses. Each section concludes with a set of questions designed to surface the risks that financial diligence alone cannot capture.
How AI Differs from Traditional Software
Most software investors are familiar with SaaS, or Software as a Service, a model where cloud-hosted applications are sold on a subscription basis. SaaS products follow a well-understood repair cycle. If a customer reports a bug, an engineering team can diagnose the issue, write a fix, and deploy a patch within hours or days. The cost is low and the turnaround is predictable.
AI systems built on machine learning operate on fundamentally different economics. When an AI model produces incorrect, biased, or harmful outputs, the fix is rarely a simple code change. The team must first identify the root cause, which requires transparency and explainability tools to understand what the model is doing when it produces the undesired result. Once the cause is identified, the team must determine the appropriate correction: adjusting the training data, modifying model weights, or adding output filters. If retraining is required, the process involves cleaning or augmenting the training dataset, retraining the model from scratch or near-scratch, and then evaluating the retrained model across every use case to confirm the fix did not introduce new failures.
This process is expensive. The Epoch AI research group[6] documented that the cost of training frontier AI models has grown at 2.4x per year since 2016. Google spent an estimated $191 million training Gemini Ultra. OpenAI spent approximately $78 million training GPT-4. Meta spent around $170 million on Llama 3.1 405B. Ongoing maintenance, retraining, and infrastructure scaling[7] account for 20% to 30% of total cost of ownership over a three-year period.
Elon Musk provided a high-profile example of what happens when these costs compound. In March 2026, after ten of twelve original co-founders departed his AI company xAI, Musk acknowledged the platform was fundamentally flawed and would require complete rebuild, as reported by CNBC[8] and TechCrunch[9]. Training Grok 3 required 200,000 GPUs, the specialized processors that power AI computations. The rebuild timeline extends to mid-2026. Even with access to virtually unlimited capital and compute, the timeline for correcting foundational AI errors cannot be compressed the way a SaaS hotfix can.
Amazon provided another instructive example. The company spent years developing an AI-powered recruiting tool to automate resume screening, only to discover it had learned to systematically discriminate against women. The system, trained on a decade of hiring data from a male-dominated industry, penalized resumes containing the word ‘women’s’ and downgraded graduates of all-women’s colleges. Amazon attempted to correct the bias but ultimately scrapped the project entirely, according to MIT Technology Review[10], after concluding the system could not be made reliably fair.
For investors, this means the cost of getting AI wrong is structurally different from the cost of getting traditional software wrong. A SaaS product with a bug loses some customers temporarily. An AI model with a foundational flaw may require a complete teardown, and the capital required to rebuild may exceed remaining runway.
Questions for Evaluating Technical Foundation
- What is the estimated cost of a full model retraining cycle, and how does that figure compare to the company’s current runway?
- How frequently has the team retrained or significantly updated its core model since initial deployment?
- What percentage of the engineering team’s time is spent on model maintenance versus new feature development?
When the AI Is a Facade
The most direct form of AI investment risk involves companies that claim to use artificial intelligence while relying on human labor to perform the core function. Industry observers sometimes call these “AI facades” or systems where the appearance of automation masks manual operations.
Builder.ai, a London-based startup valued at $1.5 billion with Microsoft among its backers, marketed itself as an AI-powered app development platform. The company filed for bankruptcy in May 2025 after audits revealed that nearly 700 engineers in India were manually coding customer projects behind the scenes, as reported by The Register[11] and Rest of World[12]. Builder.ai had reported $220 million in 2024 revenue. Actual revenue was $55 million. The Wall Street Journal[13] had flagged discrepancies as early as 2019, yet subsequent funding rounds proceeded.
Amazon’s Just Walk Out technology, marketed as a fully AI-powered cashierless shopping experience, relied on approximately 1,000 contractors in India who manually reviewed roughly 70% of all transactions, according to The Verge[14]. Amazon withdrew the technology from its Fresh grocery stores in 2024 after these operational details became public.
Albert Saniger, founder of the shopping app nate, raised over $42 million from investors by claiming his platform completed e-commerce purchases through proprietary AI. According to the SEC’s complaint[5], the app had an automation rate of effectively zero percent. Hundreds of contractors in the Philippines manually processed every transaction. During demonstrations for investors, employees manually completed purchases to simulate automated functionality. In April 2025, the SEC and DOJ jointly charged Saniger with securities fraud and wire fraud, each carrying a maximum sentence of 20 years, as reported by Fortune[15].
These cases share a pattern. Standard financial diligence examined revenue, contracts, and growth metrics. Legal diligence reviewed corporate structure and intellectual property filings. In each case, the evaluation process accepted the company’s characterization of its own technology without independent verification. A technical review of the system architecture, the codebase, or the operational workflow would have surfaced the deception before investors deployed capital.
The SEC has begun treating AI misrepresentation as a priority enforcement area. In March 2024, the SEC charged Delphia (USA) Inc. and Global Predictions Inc.[16] for making false and misleading statements about their use of AI. In January 2025, Presto Automation settled similar charges[17] after its AI drive-through product required human intervention for the majority of orders. In February 2025, the SEC established a dedicated Cybersecurity and Emerging Technologies Unit focused on AI-related misconduct.
Questions for Verifying AI Authenticity
- Can the team provide a live demonstration where the AI system processes entirely novel inputs without advance preparation?
- At what points in the product workflow, if any, do human operators review, correct, or complete tasks initiated by the AI?
- Has an independent third party reviewed the system architecture to confirm the core functionality is automated?
When the AI Exceeds Its Limits
The second category of risk involves companies that genuinely use AI but deploy it beyond its verified capabilities. Understanding this risk requires understanding how the most widely deployed form of AI, called a large language model (or LLM), actually works.
An LLM is a mathematical prediction engine. It takes a text prompt, converts that text into numerical representations, and generates the statistically most probable next sequence of words. LLMs produce language by predicting it, one word at a time, based on patterns absorbed from massive training datasets. Think of them as extraordinarily sophisticated autocomplete systems that can generate entire paragraphs instead of single words.
This architecture creates a structural limitation called hallucination, the technical term for when an AI system generates confident-sounding information that is entirely fabricated. Hallucination emerges from the mathematics of prediction itself. Because the model generates text based on statistical probability rather than factual understanding, it will occasionally produce sequences that sound plausible but have no basis in reality.
Peer-reviewed research has established that hallucination cannot be fully eliminated from large language models. Kalai and Vempala (2024)[18] at MIT and Georgia Tech demonstrated through formal computational theory that hallucination is an innate limitation of LLMs when applied as general-purpose problem solvers. Banerjee and colleagues (2024)[19] published a complementary finding, establishing that hallucination stems from the fundamental mathematical structure of transformers (the neural network architecture powering all modern LLMs) and persists regardless of architectural improvements, dataset enhancements, or fact-checking mechanisms.
Private Sector Cases
The practical consequences of deploying LLMs without adequate mitigation strategies have been well-documented across both the private and public sectors.
11x, an AI sales automation startup backed by Andreessen Horowitz and Benchmark, experienced customer churn rates as high as 70% to 80% within three months of onboarding during the summer of 2024, according to employees who spoke to Sifted[20]. Former customers cited hallucinated information, poor lead quality, and underperforming email functionality as reasons for cancellation. ZoomInfo, listed as a client on 11x’s website, stated publicly that the product did not meet expectations. TechCrunch[21] reported that several companies listed as 11x clients explicitly denied any business relationship, and that actual recurring revenue after trial periods may have been $3 million against a claimed $14 million.
In February 2024, the British Columbia Civil Resolution Tribunal ruled Air Canada liable after its AI chatbot fabricated a bereavement fare refund policy from whole cloth, as documented by the American Bar Association[22]. The chatbot told a grieving customer, Jake Moffatt, that he could apply for a discounted fare retroactively within 90 days. The model invented the policy entirely. The tribunal ordered Air Canada to pay $812.02 in damages and rejected the airline’s argument that the chatbot should be treated as a separate legal entity. The ruling established a precedent: companies bear legal responsibility for the hallucinations their AI systems produce.
Google’s rollout of AI Overviews in May 2024 demonstrated hallucination risk at consumer scale, as analyzed by MIT Technology Review[23]. Google launched the feature, powered by its Gemini model, to hundreds of millions of users. The system recommended that people eat rocks for digestive health (sourced from a satirical Onion article) and put glue on pizza to keep cheese from sliding off (sourced from an eleven-year-old Reddit joke). Google CEO Sundar Pichai acknowledged publicly that hallucinations remain an ongoing challenge.
Public Sector Cases
New York City’s MyCity chatbot, launched by Mayor Eric Adams to help business owners navigate city regulations, began advising businesses to break the law, as investigated by The Markup[24]. The chatbot told employers they could take workers’ tips (a violation of New York Labor Law Section 196-d), advised landlords they could discriminate against tenants with Section 8 vouchers, and provided incorrect minimum wage figures. The chatbot remained publicly accessible for months after these errors were documented. When New York City’s incoming administration reviewed the program, budget officials described it as inadequate for public-facing deployment.
Alaska’s court system developed a chatbot called AVA (Alaska Virtual Assistant) to help residents navigate probate proceedings, as reported by NBC News[25]. The system repeatedly hallucinated, at one point directing users to contact alumni from a law school that does not exist in Alaska. The project, originally scoped as a three-month effort, extended past fifteen months because of the persistent difficulty in preventing the model from generating fabricated information. Project leads noted that people relying on incorrect probate information could make costly legal errors.
These cases illustrate a consistent pattern. Hallucination is a known, documented, mathematically established property of the models these products rely on. When companies deploy LLM-based products without adequate mitigation strategies, the resulting harm flows to customers, to end users, and ultimately to the investors whose capital funded the deployment. Startups that claim to have eliminated hallucination are making a claim that contradicts the peer-reviewed research. Investors evaluating these companies need to assess the mitigation strategy rather than accepting elimination claims at face value.
Questions for Evaluating Hallucination Risk
- What is the measured hallucination rate for this product in production environments, and how is that rate tracked?
- What guardrails prevent hallucinated outputs from reaching end users?
- Does the company acknowledge hallucination as a structural property of the underlying model, or does the company claim to have eliminated it?
- What liability framework governs customer-facing deployments, and has legal counsel reviewed the terms of service for AI-generated output?
When Claims Outpace Evidence
The third category of risk involves AI companies that make extraordinary technical claims without submitting those claims to rigorous, standardized evaluation. The pitch deck describes a breakthrough. The demo is polished. The founding team projects confidence. The question is whether independent evidence supports the narrative.
In artificial intelligence, the standard mechanism for verifying performance claims is called a benchmark. A benchmark is a standardized test, developed and maintained by the research community, that measures how well an AI model performs on specific tasks under controlled conditions. Benchmarks provide the only objective, reproducible way to compare one model against another and to evaluate whether a given model performs as its developers claim.
Several benchmarks have become widely recognized across the industry. MMLU (Massive Multitask Language Understanding) tests broad knowledge across 57 academic subjects using 15,908 questions. HumanEval measures code generation ability across 164 programming challenges. HELM (Holistic Evaluation of Language Models)[26], developed by Stanford University, evaluates models across 42 real-world scenarios, measuring accuracy alongside fairness, robustness, and efficiency.
The Stanford HAI AI Index Report 2025[1] found that standardized responsible AI evaluations remain rare among major model developers, and noted that responsible AI evaluations are conducted inconsistently. McKinsey’s 2024 Global AI Survey[27] reported that more than 80% of organizations saw no tangible impact on enterprise-level EBIT from their use of generative AI. These findings suggest that the gap between AI investment levels and demonstrated, measurable returns remains significant.
For investors, the implication is clear. A team claiming technical excellence should be prepared to substantiate that claim through independently verifiable benchmarks appropriate to the product’s use case. The absence of benchmark data is itself a finding.
Questions for Evaluating Technical Claims
- Has the team evaluated its model against established, third-party benchmarks appropriate to the stated use case?
- What were the specific benchmark results, and how do they compare to publicly available alternatives?
- Have any independent third parties validated these results?
- What is the model’s measured failure rate in production, as distinct from its performance in controlled testing environments?
The Model Dependency Question
A growing number of AI startups build their products on top of third-party foundation models accessed through APIs (Application Programming Interfaces, the technical connections that allow one software system to use another’s capabilities). A company may market a sophisticated AI product while the core intelligence is provided by OpenAI’s GPT, Anthropic’s Claude, or Google’s Gemini.
This architecture creates a specific category of investment risk that the financial services industry has begun calling fourth-party risk, where a portfolio company’s core capability depends on a vendor’s vendor, as described by Swept AI[28] and Founder Shield[29]. When the foundation model provider updates, reprices, or discontinues its model, every downstream product built on that model is affected. AI-related securities class actions reached 12 filings in the first half of 2025 alone, surpassing the 15 cases recorded in all of 2024, and many of these lawsuits targeted companies accused of inflating their AI capabilities.
The OECD Due Diligence Guidance for Responsible AI[30], published in February 2026, establishes a framework that explicitly addresses AI supply chain risk and calls on enterprises to map their full AI dependency chain. For investors, the question is whether a portfolio company has built defensible technology or has built a user interface on top of another company’s model, a distinction that materially affects valuation, competitive moat, and long-term viability.
Questions for Evaluating Model Dependency
- Does the company use proprietary models, third-party foundation models, or a combination?
- If the company relies on a third-party model, what happens to the product if that model provider changes pricing, terms of service, or discontinues the model?
- What percentage of the product’s core functionality could be replicated by a competitor using the same third-party API?
Closing
The AI investment landscape is evolving faster than the diligence frameworks used to evaluate it. The cases documented in this guide represent billions of dollars in investor losses, regulatory enforcement actions with criminal penalties, and a growing body of case law establishing corporate liability for AI-generated outputs. Adequate technical evaluation conducted before deployment would have prevented each of these outcomes.
The distinction between marketing and capability is the most consequential filter an investor can apply to the current AI market. Whether the intelligence lives in the code or in a back office, whether the system can handle real-world deployment without generating liability, and whether the team has honestly evaluated the limitations of its own technology are questions that financial models and legal reviews cannot answer. Technical due diligence is the only evaluation framework designed to surface these risks before capital is deployed. The investors who adopt it will not avoid every loss, but they will stop funding the ones that were knowable from the start.
Sources
- AI Index Report 2025[1], Stanford University Human-Centered AI Institute, 2025
- The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed[2], RAND Corporation, 2024
- AI Experiences Rapid Adoption, but with Mixed Outcomes[3], S&P Global Market Intelligence, 2025
- Builder.ai’s $450M Fall: Microsoft And QIA-Backed AI Darling Files For Bankruptcy[4], Yahoo Finance, 2025
- SEC Charges Alberto Saniger Mantinan for Fraudulent AI Claims[5], U.S. Securities and Exchange Commission, 2025
- The Rising Costs of Training Frontier AI Models[6], Epoch AI, arXiv:2405.21015, 2024
- How Much Does It Cost to Train Frontier AI Models?[7], Epoch AI, 2024
- Elon Musk Says xAI Must Be Rebuilt as Co-Founder Exodus Continues[8], CNBC, 2026
- Not Built Right the First Time: Musk’s xAI Is Starting Over Again, Again[9], TechCrunch, 2026
- Amazon Ditched AI Recruitment Software Because It Was Biased Against Women[10], MIT Technology Review, 2018
- Builder.ai Coded Itself Into a Corner, Now It’s Bankrupt[11], The Register, 2025
- Builder.ai Promised AI-Built Apps. Who Really Did the Work?[12], Rest of World, 2025
- AI Startup Engineer.ai Inflated Its AI Capabilities[13], The Wall Street Journal, 2019
- Amazon’s Just Walk Out Technology Relied on 1,000 Workers in India[14], The Verge, 2024
- A Tech CEO Has Been Charged with Fraud for Saying His Startup Was Powered by AI[15], Fortune, 2025
- SEC Charges Two Investment Advisers with Misleading AI Claims[16], U.S. Securities and Exchange Commission, Press Release 2024-36, 2024
- SEC Charges Presto Automation for Misleading AI Product Statements[17], U.S. Securities and Exchange Commission, 2025
- Hallucination Is Inevitable: An Innate Limitation of Large Language Models[18], Kalai, A.T. and Vempala, S.S., arXiv:2401.11817, 2024
- LLMs Will Always Hallucinate, and We Need to Live With This[19], Banerjee, S. et al., arXiv:2409.05746, 2024
- 11x Faces Scrutiny over Customer Churn and Toxic Office Culture[20], Sifted, 2025
- a16z- and Benchmark-Backed 11x Has Been Claiming Customers It Doesn’t Have[21], TechCrunch, 2025
- BC Tribunal Confirms Companies Remain Liable for AI Chatbot Information[22], American Bar Association, Business Law Today, 2024
- Why Google’s AI Overviews Gets Things Wrong[23], MIT Technology Review, 2024
- NYC’s AI Chatbot Tells Businesses to Break the Law[24], The Markup, 2024
- Alaska’s Court System Built an AI Chatbot. It Didn’t Go Smoothly.[25], NBC News, 2026
- Holistic Evaluation of Language Models (HELM)[26], Stanford Center for Research on Foundation Models, 2023
- The State of AI in Early 2024: Gen AI Adoption Spikes and Starts to Generate Value[27], McKinsey & Company, Global AI Survey, 2024
- AI Vendor Risk in Financial Services: How the FS AI RMF Changes Third-Party and Fourth-Party AI Oversight[28], Swept AI, Financial Services AI Risk Management Framework, 2025
- Understanding AI Investor Risk: Analyzing Recent Claims and Their Impact[29], Founder Shield, 2026
- OECD Due Diligence Guidance for Responsible AI[30], Organisation for Economic Co-operation and Development, 2026