The Hidden Dangers of Data Augmentation

AI thrives on data. Without large, diverse, and representative datasets, even the most sophisticated models will fall short. But in many sensitive areas—like detecting harmful or abusive content—collecting real-world data is extremely difficult. Privacy concerns, ethical constraints, and the emotional toll on annotators all mean researchers often work with small or outdated datasets.

To bridge the gap, the field has embraced data augmentation. This means creating new training examples from existing ones, or generating fresh samples using large language models (LLMs) such as GPT-4. In theory, augmentation is the perfect solution: it scales quickly, protects people from exposure to harmful content, and allows researchers to generate almost limitless data.

Diagram showing an original image of a tabby cat and six augmented images with variations: flip, rotation, blur, exposure, contrast, and grayscale—illustrating data augmentation techniques often used in data science.

But as we’ve learned—and as research confirms—synthetic data is not a cure-all. It helps fill gaps, but it can’t fully replace the messiness, nuance, and unpredictability of the real world.

What Research Shows about Data Augmentation

Kazemi et al. (2025) found that combining small amounts of real data with larger pools of LLM-generated data slightly improved performance in harmful content detection. Synthetic examples can stretch limited resources and boost results when used carefully.

But quality depends on generation. Poor prompts produce unrealistic outputs, and even advanced LLMs often filter out toxic or extreme content. Kumar et al. (2024) had to bypass safety filters (“jailbreak” models) just to get authentic bullying language—showing how difficult it is to capture harsher, real-world patterns.

Another key limitation: most datasets are static snapshots. They don’t capture how language evolves over time—new slang, memes, or platform-specific behaviors—which leaves models less prepared for emerging trends.

Why Synthetic Data Falls Short

AI-generated text is often too clean, too polished, and too generic. It misses the slang, in-jokes, emoji, and shifting cultural references that real users employ every day. Content filters make the problem worse, since they prevent the generation of the most severe or explicit cases, leaving synthetic datasets biased toward mild examples.

This isn’t unique to language. In self-driving research, cars train on endless simulated scenarios but still fail in unfamiliar real-world edge cases. And in AI research more broadly, there’s the risk of model collapse: when models are trained repeatedly on synthetic data, they can drift further from reality as errors and biases accumulate.

Why Real Data Still Matters

Even small sets of authentic examples provide critical value:

Anchor models to reality instead of artificial patterns.
Capture change as slang, memes, and abuse styles evolve.
Include outliers and rare cases that synthetic data tends to miss.
Build trust with stakeholders who want assurance the model works in real conditions.

Without these grounding points, models risk learning only approximations of human behavior—useful in theory, but fragile in practice.

The Balance

Synthetic data is powerful for scaling quickly and reducing exposure to harmful content. But it can’t replace the messiness of real-world communication. The best approach is balance: use augmentation for breadth, and ground every system with a core of high-quality real examples collected over time.

Our Dataset

In our own cyberbullying detection work, we’ve adopted this principle. Synthetic examples help us cover a wide range of insults, neutral statements, and ambiguous edge cases. At the same time, we deliberately incorporate carefully collected real-world cases—especially those gathered across different years and platforms. These authentic examples serve as anchors, letting us see how online language evolves and making sure our models don’t drift into learning only sanitized or artificial patterns.

References

Ataman, A. (2025). Synthetic Data vs Real Data: Benefits, Challenges in 2025. AIMultiple.
Kazemi, A., et al. (2025). Synthetic vs. Gold: The Role of LLM-Generated Labels and Data in Cyberbullying Detection. arXiv preprint arXiv:2502.15860.
Kumar, Y., et al. (2024). Bias and Cyberbullying Detection and Data Generation Using Transformer AI Models and Top LLMs. Electronics, 13(17), 3431.
Myers, A. (2024). AI steps into the looking glass with synthetic data. Stanford Medicine.

August 22, 2025

Research and Evidence, AI in Behavioral Healthcare, Cyberbullying, Our Stories, Responsible Technology

Grace Li

Originally from San Diego, CA, I’m currently a sophomore at Columbia University studying Computer Science and Mathematics. I grew up as a competitive dancer, taking a gap year before college to pursue professional ballet. Now, I join curaJOY as part of the Impact Fellowship’s Tech Cohort.

Touched by what you read? Join the conversation!

Tell Your Story

Empathy Helps Relationships and Careers

Research shows that strong empathy in children equates to less bullying, aggressive behavior, and emotional disorders, and to better relationships, communication skills, classroom engagement, and success in school. Another study finds that teenagers’ ability to experience and express empathy changes and can continue to develop during this important yet often turbulent life stage. Parents can…

Read more >> about Empathy Helps Relationships and Careers
Who doesn’t procrastinate?

These are the most common excuses people use when they procrastinate—delay doing what they need to do. How many of these have you personally used? According to the American Psychological Association, almost 80% of the people surveyed admit to lying to themselves about the reasons they put off doing things. So, who doesn’t procrastinate? The…

Read more >> about Who doesn’t procrastinate?
The Extraordinary You

My autistic daughter has mentioned a Netflix show called “The Extraordinary Attorney Woo” a few times this year, and we finally got to watching the show today. I didn’t want to like it at first because it seemed to fall into the stereotypical savant portrayal of autistic individuals in the media. Hollywood’s infatuation with the…

Read more >> about The Extraordinary You

The Hidden Dangers of Data Augmentation

What Research Shows about Data Augmentation

Why Synthetic Data Falls Short

Why Real Data Still Matters

The Balance

Our Dataset

References

Leave a Reply Cancel reply

Touched by what you read? Join the conversation!

Empathy Helps Relationships and Careers

Who doesn’t procrastinate?

The Extraordinary You