{"id":11536,"title":"We need responsible data foundations for Foundation Models","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/2024\/10\/17\/we-need-responsible-data-foundations-for-foundation-models\/","date":"17. October 2024","date_unix":1729148820,"date_modified_unix":1779960565,"date_iso":"2024-10-17T07:07:00+00:00","content":"<p>Common-good organizations have a clear mission: They work for the benefit of society, whether in social work, environmental protection, or supporting marginalized groups. For example, \u201cPaulinenpflege Winnenden\u201d runs a facility for youth and disability assistance, \u201cGr\u00fcne Erde\u201d sells ecological and sustainable products, and the association of therapy clinics \u201cLudwigsm\u00fchle\u201d helps people with addictions. These organizations\u2019 structures and actions are designed to prioritize social goals. A <a href=\"https:\/\/dserver.bundestag.de\/btd\/20\/064\/2006465.pdf\" target=\"_blank\" aria-label=\"Opens in a new tab\" >formal definition<\/a> (in German) of these enterprises states that their business activities are based on common-good goals, they reinvest most of their profits to achieve social objectives, and their structures promote participation and social justice. But what happens when these common-good goals do not align with the digital tools they use?<\/p>\n<p>In recent years, common-good organizations have also started using large foundation models. Language models like GPT and products built on them, such as ChatGPT or Co-Pilot, are increasingly being used, for instance, <a href=\"https:\/\/www.digitalisierung-und-gemeinwohl.de\/2023\/12\/13\/smartes-wissensmanagement-potenziale-generativer-ki-fuer-humanitaere-organisationen\/\" target=\"_blank\" aria-label=\"Opens in a new tab\"  target=\"_blank\" rel=\"noopener\">in the knowledge management of a large humanitarian organization<\/a> (Link in German). These models, often developed by big tech companies, offer various advantages: organizations can increase their efficiency through support with routine tasks like documentation and reporting, and the ability of the models to process and analyze large amounts of data in a short time is useful in many work contexts \u2013 for example, evaluating donation campaigns, climate data, or patient information. However, they are based on training data that often do not reflect the values essential to common-good organizations, such as fairness, equality, diversity, and the protection of human rights.<\/p>\n<h2><strong>Biased training data and its impact on foundation models<\/strong><\/h2>\n<p>Training data \u2013 the data foundations of these models \u2013 are often large datasets sourced from the internet through automated processes like web scraping. These training data are a key reason why the models are so powerful. But the crucial question is: powerful for whom? Numerous examples show that the models not only reproduce existing biases and stereotypes but even amplify them: from <a href=\"https:\/\/unesdoc.unesco.org\/ark:\/48223\/pf0000387483\" target=\"_blank\" aria-label=\"Opens in a new tab\"  target=\"_blank\" rel=\"noopener\">cases of gender-based violence<\/a> in generative AI models, to <a href=\"https:\/\/arxiv.org\/pdf\/2310.09219\" target=\"_blank\" aria-label=\"Opens in a new tab\"  target=\"_blank\" rel=\"noopener\">gender bias in letters of recommendation<\/a> generated by language models, to <a href=\"https:\/\/theintercept.com\/2022\/12\/08\/openai-chatgpt-ai-bias-ethics\/\" target=\"_blank\" aria-label=\"Opens in a new tab\"  target=\"_blank\" rel=\"noopener\">racist advice<\/a> and <a href=\"https:\/\/www.npr.org\/sections\/goatsandsoda\/2023\/10\/06\/1201840678\/ai-was-asked-to-create-images-of-black-african-docs-treating-white-kids-howd-it-\" target=\"_blank\" aria-label=\"Opens in a new tab\"  target=\"_blank\" rel=\"noopener\">stereotypical portrayals of doctors<\/a> and <a href=\"https:\/\/www.npr.org\/sections\/goatsandsoda\/2023\/10\/06\/1201840678\/ai-was-asked-to-create-images-of-black-african-docs-treating-white-kids-howd-it-\" target=\"_blank\" aria-label=\"Opens in a new tab\"  target=\"_blank\" rel=\"noopener\">other professions<\/a>. This data mostly reflects Western and patriarchal perspectives and values. Biases are thus inevitable: the one-sided data foundation can lead the models to favor or disadvantage certain groups, genders, or viewpoints, failing to adequately represent the diversity and complexity of society, as the numerous examples strikingly illustrate.<\/p>\n<h2><strong>A faulty foundation leads to flawed outcomes<\/strong><\/h2>\n<p>Foundation models serve as the foundation for many different AI applications, whether for chatbots, translation, or knowledge management tools. If the foundation of a house is skewed, the walls built on it will be unstable. The same is true for AI models: if the training data already contain biases and errors, these will carry over to the applications built on these models. For instance, if a common-good organization uses a language model based on racist training data for a chatbot providing advice to migrants, the consequences could be problematic.<\/p>\n<p>The decisions and responses of the model would certainly not meet the ethical standards or the specific needs of the organizations and the people affected. The examples above also highlight how vulnerable certain \u2013 particularly marginalized \u2013 groups and individuals can become when AI-generated predictions or evaluations box them into categories. A good example is the use of AI in human resources, where algorithms, due to biased training data, may systematically disadvantage certain applicants \u2013 for instance, women with migrant backgrounds. This can happen when the model incorrectly assumes that people from certain countries or with certain names are less suitable for specific positions. In working with migrants, biases embedded in the models can also lead to people being disadvantaged due to stereotypical assessments \u2013 for example, being passed over for housing assignments or misclassified in language courses. Such decisions are often based on biased data and entrench existing inequalities rather than reducing them.<\/p>\n<h2><strong>Values-based models: a better choice for common-good oriented organizations?<\/strong><\/h2>\n<p>Common-good oriented organizations are increasingly relying on AI foundation models without considering the values embedded in the underlying training data. They should think about choosing AI models whose training data align better with their own values. It is in their interest to use technology that is not only efficient but also ethically sound \u2013 fair and inclusive.<\/p>\n<p>Alternative approaches should also be made more transparent. After all, having a choice gives us as customers and users power. And a value-based choice reinforces the credibility of common-good enterprises <em>and<\/em> strengthens their impact.<\/p>\n<h2><strong>Our study: a close analysis of training data<\/strong><\/h2>\n<p>To address these challenges, at reframe[Tech], we are conducting a study that closely examines the training data of foundation models. We are collaborating with Professor Anne L. Washington from New York University, currently a fellow at Stanford University and author of <a href=\"https:\/\/academic.oup.com\/book\/55143\" target=\"_blank\" aria-label=\"Opens in a new tab\"  target=\"_blank\" rel=\"noopener\">&#8220;Ethical Data Science: Prediction in the Public Interest&#8221;<\/a>, to develop criteria for evaluating the data foundations of foundation models. This study will be published in early 2025.<\/p>\n<p>Common-good organizations have the opportunity to represent their values in the digital space as well \u2013 and we want to help them do so responsibly. Just like a stable foundation ensures the security of an entire house, the data foundations of AI models must also be stable and value-based so that the applications built on them meet the ethical standards of the organizations that use them.<\/p>\n<h2><strong>Stay informed!<\/strong><\/h2>\n<p>For continuous updates and to engage in discussions on this topic, follow our <a href=\"https:\/\/bertelsmannstiftung-my.sharepoint.com\/personal\/teresa_staiger_bertelsmann-stiftung_de\/Documents\/linkedin.com\/company\/reframe-tech-algorithmen-f\u00fcrs-gemeinwohl\" target=\"_blank\" aria-label=\"Opens in a new tab\"  target=\"_blank\" rel=\"noopener\">LinkedIn account<\/a>\u00a0or feel free to contact me at <a href=\"mailto:teresa.staiger@bertelsmann-stiftung.de\" target=\"_blank\" rel=\"noopener\">teresa.staiger@bertelsmann-stiftung.de<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","excerpt":"<p>The rapid spread of foundation models like GPT has sparked a lot of excitement but also raises questions about the values embedded in these technologies. This is especially true of the underlying training data, which significantly influence how AI models function and the results they produce. For organizations focused on the common good, the question should arise whether they would prefer models whose training data better represent their values and convictions. However, there is too little evidence so far. That\u2019s why we will be conducting a study to take a closer look at the training data \u2013 to raise awareness among organizations and increase transparency in the use of foundation models.<\/p>\n","thumbnail":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/wp-content\/uploads\/sites\/34\/2024\/10\/AdobeStock_893035854_KONZERN_ST-DG_Montage.png","thumbnailsquare":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/wp-content\/uploads\/sites\/34\/2024\/10\/AdobeStock_893035854_KONZERN_ST-DG_Montage.png","authors":[{"id":7882,"name":"Teresa Staiger","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/blogger\/teresa-staiger\/"}],"categories":[{"id":739,"name":"Artificial Intelligence","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/category\/artificial-intelligence\/"},{"id":738,"name":"Data","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/category\/data\/"}],"tags":[{"id":315,"name":"Artificial Intelligence","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/tag\/artificial-intelligence\/"},{"id":710,"name":"data","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/tag\/data\/"},{"id":274,"name":"Ethics","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/tag\/ethics\/"},{"id":711,"name":"foundation models","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/tag\/foundation-models\/"},{"id":712,"name":"LLM","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/tag\/llm\/"},{"id":709,"name":"training data","link":"https:\/\/www.digitalisierung-und-gemeinwohl.de\/en\/tag\/training-data\/"}]}