How Criminal Data Threatens Trust in AI
By Chris Kubecka Author of The Drone Wars OSINT Field Guide to Russian Drone Footage & Verification and How to Hack a Modern Dictatorship with AI: The Digital CIA/OSS Sabotage Book & espionage target.
Originally pitched to China Daily
In the borderlands between Myanmar and Thailand, roughly thirty thousand people, some of them Chinese; are enslaved by human trafficking gangs in compounds run by criminal syndicates. They are forced to spend their days building fake online companies, romance profiles, and customer-service chats that power digital scams. What few realize is that their coerced words are now creeping into artificial-intelligence systems trusted by billions.
These operations have discovered a new kind of extraction economy. Using generative-AI tools, they spin up thousands of slick corporate websites that look indistinguishable from legitimate firms. Each page contains registration numbers, executive biographies, employee portraits, testimonials, and even press releases. All synthetically generated. Search-engine optimization ensures that these sites appear authentic. Once indexed, they are scraped into the training datasets used to refine large-language models.
The contamination works quietly. A model such as DeepSeek or any locally trained AI Large Language Model derivative encounters these pages and records them as genuine business entities. The poison is subtle but powerful: when a user later asks whether a company or job offer is legitimate, the model may confidently reply that it is. What should have been a safeguard becomes a deception mechanism.
I first warned of this vector I presented at the University of Oxford’s “Cybersecurity in AI” conference in 2019, Protect against espionage, skewing of tags and categorization, data manipulation, secure data in cloud computing and auditing algorithms (DOI: 10.13140/RG.2.2.16142.38722). I argued then that data itself would become the new attack surface for artificial intelligence. Six years later, the prediction has materialized.
This poisoning does not require control over vast amounts of information. Studies show that a few hundred false samples. Around 250 data points, can reliably distort an LLM’s perception of truth. That scale matches the way these criminal websites operate: each is packed with a few hundred fabricated testimonials, staff profiles, or investment tips. Together they form a lattice of artificial legitimacy that machines cannot easily distinguish from reality.
This is not a problem of the “dark web.” It thrives on visibility. Every fake company has a phone number, a building photograph, and polished Chinese- and English-language copy. The very metrics used to rank credibility online: links, traffic, freshness. Become weapons for deceit. When an AI model absorbs that architecture of fraud, it reproduces it in natural-language form, giving false reassurance to the next potential victim.
Safeguarding against this requires more than new filters. It demands a cultural shift inside the AI community: treating training data as critical infrastructure. Data provenance, traceability, and auditability should be as fundamental as encryption or access control. Researchers must be able to trace where a model learned a claim and how that knowledge might have been manipulated.
Transparency, however, collides with commercial secrecy. Model builders guard their corpora as trade secrets, while data brokers profit from selling aggregated “clean” datasets that often hide their origins. Without an industry standard for source verification, bad data moves freely between research centres, start-ups, and cloud vendors. Once a poisoned dataset is folded into a foundation model, the infection becomes mathematically permanent.
The consequences extend beyond consumer scams. Information operations can exploit the same technique to distort public understanding or discredit competitors. Poisoned corpora can make an AI system appear biased, untrustworthy, or even supportive of extremist narratives. It is a slow-motion form of cognitive warfare; plausible, scalable, and difficult to reverse.
Protecting the integrity of machine learning therefore means protecting the integrity of the humans behind it. That starts with recognizing forced digital labour for what it is: not just a serious humanitarian and cybersecurity issue. The people coerced into writing scam scripts are, unwillingly, the first layer of training data for models that shape our perception of truth.
AI should illuminate deception, not inherit it. The next generation of standards must embed that principle. Without transparency in data collection, algorithmic auditing, and genuine accountability, we risk building intelligence systems that echo the language of exploitation.
The smartest machines we have ever built are only as honest as the data we feed them. Right now, too much of that data is written by criminal networks.
China Daily source: https://global.chinadaily.com.cn/a/202501/10/WS6780c2b6a310f1265a1da236.html
Oxford citation URL: https://www.researchgate.net/publication/341617134_Chris_Kubecka_SecEvangelism
If you want to support further articles and research consider becoming a paid subscriber or buy one of my books :-)
📌 More on Me • Chris Kubecka — Wikipedia
#CyberSecurity #CyberCrime #China #NationStateThreats #Hacking #OSINT #Myanmar #TheHacktress #AI
Chris Kubecka is the founder and CEO of Hypasec NL an esteemed cyberwarfare expert, advisor to numerous governments, UN groups and freelance journalist. She is the former Aramco Head of Information Protection Group and Joint Intelligence Group, former. Distinguished Chair of the Middle East Institute, veteran USAF aviator and U.S. Space Command. She specializes in critical infrastructure security and unconventional digital threats and risks. When not getting recruited by dodgy nation-states or embroiled in cyber espionage, she hacks dictatorships & Drones (affiliate link to my books) and drinks espresso.
@SecEvangelism on Instagram, X, BlueSky LinkedIn Substack
[image error]

