OpenAI aims to collaborate with external organizations in order to develop new AI training data sets

openai
openai

It is widely acknowledged that the existing data sets used to train AI models suffer from significant flaws. 


One of these flaws is the over-representation of U.S. and Western-centric image corpora, which can be attributed to the dominance of Western images on the internet during the compilation process of the data sets.


Additionally, recent research by the Allen Institute for AI has highlighted the presence of toxic language and biases in the data sets used to train large language models such as Meta's Llama 2.


These flaws in the data sets are subsequently magnified by the models, causing potential harm in various ways. To address this issue, OpenAI has introduced Data Partnerships. Through this initiative, OpenAI intends to collaborate with third-party organizations to create new data sets that are expected to be improved compared to the existing ones. 


The announcement states that Data Partnerships aim to facilitate the involvement of more organizations in shaping the future of AI and enabling them to benefit from more useful models.


OpenAI firmly believes that in order to create AI that is safe and beneficial to humanity, AI models must possess in-depth knowledge across different subject matters, industries, cultures, and languages. This necessitates the utilization of broad and comprehensive training data sets. 


OpenAI emphasizes that incorporating organizations' content into AI models can enhance their usefulness by improving their understanding of specific domains.


AI training data sets
AI training data sets


Under the Data Partnerships program, OpenAI plans to collect expansive data sets that accurately reflect human society and are not readily accessible online today. While the company envisions incorporating data from various modalities such as images, audio, and video, it particularly seeks data that represents human intention across different languages, topics, and formats, including long-form writing and conversations.


OpenAI is committed to working closely with organizations to digitize training data if required, utilizing optical character recognition and automatic speech recognition tools. Any sensitive or personal information will be removed to ensure data privacy.


Initially, OpenAI aims to develop two types of data sets. Firstly, an open-source data set will be created, which will be accessible to the public and can be used for AI model training. Secondly, a set of private data sets will be generated for training proprietary AI models


These private data sets cater to organizations that prefer to maintain the privacy of their data while benefiting from OpenAI's models, which will possess a better understanding of the specific domains. 


OpenAI has already collaborated with the Icelandic Government and Miðeind ehf to enhance GPT-4's ability to understand Icelandic and with the Free Law Project to improve its models' comprehension of legal documents.


OpenAI explicitly seeks partners who are eager to contribute to the endeavor of teaching AI to comprehend the world, ultimately aiming to maximize assistance for everyone.


While OpenAI's efforts to build better data sets are commendable, the task of mitigating data set biases has proven to be challenging for many experts worldwide. At the very least, one hopes that OpenAI maintains transparency throughout the process and openly acknowledges the challenges involved in creating these data sets.


Despite the ambitious language used in the blog post, it is important to recognize that there is a clear commercial motivation driving OpenAI's desire to enhance the performance of its models, potentially at the expense of others. 


Furthermore, the issue of compensating the data owners seems to have been neglected, which raises concerns in light of open letters and lawsuits filed by creators who allege that OpenAI has utilized their work without permission or payment.





Post a Comment (0)
Previous Post Next Post