top of page

Keeping Synthetic Data in Check: Lessons from Henrietta Lacks

  • Gayathri
  • Oct 16, 2024
  • 3 min read

AI and ML models become the primary tool of decision making in more and more sectors - healthcare, engineering, finance, they need more fuel. And their fuel is data.  When they coined the phrase, "Data is the new Oil", I’m sure they didn't think it would start to run out just as fast! What once seemed like an endlessly available, ever-expanding resource is now scarce

 

 In the quest for more fuel each of the big foundational model providers has made acquisitions and deals. Companies like Reddit and Shutterstock from an older, more trusting, freeware age of the internet suddenly gained value as sources of "real" data.  Apart from the copyright lawsuits and loss of user trust, this also is turning out to be a finite source. The next frontier is "synthetic data". Make data that models the real world using the very algorithms that are trained to mimic them. There are many paid and open-source tools in this space. (We have one too. Click here to know more), but the rules for their creation and use are ill-defined.

 

A book and incident from a different space - Medical Research - can serve as a cautionary tale and framework for thinking about synthetic data generation. "The Immortal Life of Henrietta Lack", published in 2010, and its movie version made in 2017, tell the story of an African American woman who died tragically at 31 of a particularly aggressive cervical cancer. The cancer cells, christened as Hela cells, were taken to a lab for testing and subsequently the cultured cells were supplied commercially to research laboratories around the world. Research labs were hungry for cell lines on which to test medication, research disease, viruses etc. Meanwhile, the late Mrs. Lacks' family languished in poverty with limited access to the medical advances found using the cells harvested from their mother.

 

The book raises several important questions on the nature of consent, medical ethics etc. Substitute the word cells with data, and many of the questions are completely relevant to questions around using scraped data. Synthetic data is presented as an answer. One portion of the book talks about how most of the tissue samples (animal and human) in most of the labs on earth are contaminated with the HeLa line of cells. This compromises results of other studies and can be dangerous.

 

 Fewer people are asking the question of what will happen when algorithm created data "contaminates" real world data. How will we separate the artificial from the real?  Such contamination is a very real possibility.

 

At the highest end of the scale are malicious deepfakes, synthetically generated voice or video of real people. These are clearly at the criminal end of the spectrum and are already on the radar of law makers and enforcement agencies worldwide. At the lower end of the scale though, even using synthetic data made to help models be more “accurate. For example, substituting missing values with representative data in a forecasting model for climate change could mean we are ignoring real change or running the risk of making incorrect assumptions. Using synthetically generated speech in an LLM could mean that we are not finding and using “real changes” in speech patterns or, even more damagingly, not representing speech of marginalized groups who might not have been represented on the internet. Try looking for text or transcripts in a language like Tulu. Even well-represented languages like Hindi, do not get good translations from LLM’s in many cases. (Link to our previous work). Increasing dependence on synthetic data, will worsen this issue.

 

Unless we limit the use of synthetic data to very specific use cases of modeling and understanding we run the risk of data running wild and technology that is less friendly to humans.


At AGH Advisors we have the capability to generate synthetic data and augment existing data. Click on the video below for a quick glimpse of how the data augmentation can work.


 



 

References:

 

Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.
bottom of page