Beyond Model Collapse: Understanding 'Retrieval Collapse' as AI Pollutes the Web

What is Retrieval Collapse? How AI-Generated Content is Silently Corrupting Search and RAG Ecosystems
As Large Language Models (LLMs) continue to flood the internet with synthetic content, the web is shifting. While machine learning researchers have long warned about Model Collapse (what happens when a model is trained on its own output), a different, ecosystem-level failure mode has arrived: Retrieval Collapse.
Presented at the ACM Web Conference (WWW 2026), our latest paper, "Retrieval Collapses When AI Pollutes the Web," formalizes and quantifies this structural risk to information retrieval, search engines, and Retrieval-Augmented Generation (RAG) systems.
- Full Paper (arXiv): https://arxiv.org/abs/2602.16136v1
- ACM Digital Library: https://dl.acm.org/doi/10.1145/3774904.3792955
- Official Code Repository: https://github.com/dongchankim-io/retrieval-collapse
Defining the Concept: Model Collapse vs. Retrieval Collapse
To protect future AI systems, we must understand how retrieval-time degradation differs from training-time degradation.
| Metric / Dimension | Model Collapse | Retrieval Collapse |
|---|---|---|
| Primary Mechanism | Occurs at training time when an LLM iteratively trains on synthetic data, causing statistical distribution shifts. | Occurs at retrieval time when RAG or search systems fetch homogenized or malicious synthetic content. |
| System Impact | The model completely loses its ability to generate diverse or rare tokens, forgetting the true data distribution. | The retrieval pipeline quietly shifts toward synthetic evidence, destroying source diversity and provenance. |
| Surface Symptom | Output becomes gibberish or heavily repetitive over generations. | Outputs appear deceptively healthy and highly fluent, masking the underlying systemic corruption. |
The Two Stages of Retrieval Collapse
Our framework defines Retrieval Collapse as a progressive, two-stage degradation process:
1. Dominance and Homogenization
High-quality, SEO-optimized synthetic content successfully captures top search engine rankings. Because these pages are grammatically flawless and topically aligned, standard retrieval metrics show "healthy" accuracy. However, source diversity suffers a severe collapse, divorcing downstream RAG models from authentic human-generated evidence.
2. Pollution and System Corruption
Once synthetic content dominates the retrieval pool, the pipeline's defenses weaken. Malicious actors can exploit ranking algorithms to systematically inject low-quality, biased, or adversarial content. This directly compromises the factual integrity of any system depending on Web-grounded retrieval.
Key Empirical Findings
Using controlled experiments evaluating standard BM25 and dense retrieval setups under both SEO-optimized and adversarial scenarios, we discovered striking vulnerabilities in modern search architecture:
- The SEO Amplification Effect: A 67% pool contamination of synthetic content translates into over 80% exposure in search ranking settings. High-quality synthetic data is systematically over-indexed by current search mechanics.
- Adversarial Vulnerability: Under direct adversarial optimization attacks, standard lexical matching techniques like BM25 show a ~19% structural vulnerability to malicious injection.
- The Deceptive Health Trap: Surface-level answer accuracy often remains stable due to the fluency of LLM outputs, meaning current evaluation setups will completely miss a Retrieval Collapse until it is too late.
Moving Toward Defensive Ranking
The findings in our paper underscore an urgent necessity: search engines and RAG pipelines can no longer optimize purely for topical relevance.
To mitigate the self-reinforcing loop of quality decline, the AI community must pivot toward Defensive Ranking strategies. Future retrieval models must jointly optimize for three core pillars:
- Topical Relevance (Is the information useful?)
- Factuality (Is the content grounded in reality?)
- Provenance & Source Diversity (Where did this information originate, and is it human-vetted?)
For full implementation details, datasets, and scripts to test your own retrieval pipelines against synthetic pollution, check out our GitHub Repository.
How to Cite This Work
If you are researching RAG vulnerabilities, data poisoning, web data provenance, or LLM-driven SEO dynamics, please cite our WWW 2026 paper:
@inproceedings{yu2026retrieval,
author = {Yu, H. and Kim, Dongchan and Kim, Young-Bum},
title = {Retrieval Collapses When AI Pollutes the Web},
booktitle = {Proceedings of the ACM Web Conference 2026 (WWW '26)},
pages = {8745--8748},
year = {2026},
publisher = {ACM},
doi = {10.1145/3774904.3792955},
url = {[https://arxiv.org/abs/2602.16136](https://arxiv.org/abs/2602.16136)}
}