news-13092024-224735

Today, we’re discussing the issue of AI hallucinations and how DataGemma is addressing this challenge head-on by integrating real-world data into large language models (LLMs). While LLMs have the ability to process vast amounts of text and generate valuable insights, they sometimes provide inaccurate information, known as “hallucination.”

Data Commons is a comprehensive repository of trustworthy data sourced from reputable organizations such as the UN, WHO, CDC, and Census Bureaus. This vast database contains over 240 billion data points across various statistical variables, enabling policymakers, researchers, and organizations to access reliable insights on a wide range of topics.

DataGemma is the first open model designed to connect LLMs with real-world data from Data Commons. By utilizing this extensive dataset, DataGemma aims to reduce hallucination in AI models and enhance the accuracy of generated responses. The integration of Data Commons with Gemma models enhances factuality and reasoning through two distinct approaches:

1. RIG (Retrieval-Interleaved Generation) involves proactively querying trusted sources and fact-checking against information in Data Commons to enhance the capabilities of language models. This methodology ensures that responses generated by the model are rooted in authoritative data.

2. RAG (Retrieval-Augmented Generation) allows language models to incorporate relevant contextual information beyond their training data, resulting in more comprehensive and informative outputs. DataGemma leverages Gemini 1.5 Pro’s long context window to retrieve contextual information from Data Commons, minimizing the risk of hallucinations.

Preliminary findings using RIG and RAG have shown promising results in enhancing the accuracy of language models, particularly when handling numerical facts. This improvement indicates that users can expect fewer hallucinations when utilizing AI models for research, decision-making, or general inquiries.

As research on DataGemma continues, the goal is to further refine these methodologies, subject them to rigorous testing, and eventually integrate the enhanced functionality into Gemma and Gemini models. By making DataGemma an “open” model, the aim is to encourage broader adoption of these Data Commons-led techniques to ensure the reliability and trustworthiness of AI tools for everyone.

Researchers and developers can access DataGemma through quickstart notebooks for both RIG and RAG approaches. For more information on how Data Commons and Gemma collaborate, refer to our research post. The ultimate objective is to empower individuals with accurate information, enabling informed decisions and a deeper understanding of the world through AI technology.