Generative artificial intelligence (AI) systems are only as good as the data they’re trained on.
And complicating matters is that the training data sets for today’s most popular, and most effective, AI systems have a notorious transparency problem.
After all, the data that makes up their training regime is typically what makes up their competitive advantage, too. Companies don’t want to just give away the recipe for free.
Researchers from MIT, Harvard, Carnegie Mellon, ML Commons, and seven other institutions audited and traced over 1,800 data sets used in fine-tuning the large language models (LLMs) that are pushing today’s generative AI ecosystem forward.
In a paper published last week (Oct. 30), entitled “The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI,” the researchers argued for “a radical rethink” of how AI systems are built.
“Increasingly, widely used dataset collections are treated as monolithic, instead of a lineage of data sources, scraped (or model generated), curated, and annotated, often with multiple rounds of re-packaging (and re-licensing) by successive practitioners,” the paper stated.
“This lack of understanding can lead to data leakages between training and test data, expose personally identifiable information (PII), present unintended biases or behaviours, and generally result in lower quality models than anticipated,” the researchers added.
The paper found that the highest percentage of training data is pulled from online encyclopedias (21.5%), with Wikipedia.org making up 14.6% of that; from social media (15.9%), with sites like Reddit at 6.2% and Twitter at 4% being among the most popular; and with the general web (11.2%), news (11.1%), and entertainment web resources (9%) rounding out the top five source domains.
But despite the researchers’ initiative, the ongoing lack of training-model transparency makes it difficult for many organizations to assess the reliability and quality of the data used in building today’s top LLMs, which in turn impacts their ability and willingness to integrate generative AI solutions into their workflows, as well as to trust and stress-test their outputs.
Read also: Walled Garden LLMs Build Enterprise Trust in AI
Data preprocessing, including cleaning, labeling and augmentation, plays a crucial role in preparing data for AI models. Lack of transparency in these processes can introduce errors and reduce the overall reliability of AI systems.
Additionally, many enterprises operate in heavily regulated industries such as healthcare, finance and telecommunications. Without a clear understanding of the data used in AI models, it becomes challenging to demonstrate compliance with data protection laws.
And of course, without transparent data, it is difficult for businesses to predict the performance of AI systems accurately.
“Data is foundational to building the models, training the AI — the quality and integrity of that data is important,” Michael Haney, head of Cyberbank Digital Core at FinTech platform Galileo, the sister company of Technisys, told PYMNTS in March.
That’s why, to train an AI model to perform to the necessary standard, many enterprises are relying solely on their own internal data to avoid compromising model outputs.
Read also: Google and Microsoft Spar Over Training Rights to AI Data
PYMNTS Intelligence reports that more than 8 in 10 business leaders (84%) believe generative AI will positively impact the workforce, and Elon Musk has gone on the record saying that the technology will “render all jobs obsolete.”
But for that future to come to fruition, organizations will need to overcome many obstacles to build out their own AI models. For one, with most companies still relying on technologically outdated legacy systems, establishing the infrastructure alone needed to accommodate AI systems can be a tall task for many businesses.
Once the infrastructure is in place, preparing data remains a time-consuming obstacle for businesses. That’s why some observers believed that vertically oriented AI models, or LLMs and GPT systems trained for industry-specific use cases on validated and audited data sets that are regularly retrained and updated per sector-specific guidelines, and are able to be fine-tuned by organizations using their own data, offer enterprise the safest, most productive way forward.
“Enterprises are right to push for data privacy, up to and including running the whole stack on their own network,” Taylor Lowe, CEO and co-founder of AI developer platform Metal, told PYMNTS in an interview published in July.