Big Tech builds AI with bad data. So scientists sought better data.

Listen 8 min Comment on this story Comment Gift Article Share

Yacine Jernite’s fears about bias in artificial intelligence were vividly affirmed in 2017, when a Facebook translation error led Israeli police to arrest a Palestinian construction worker. The man had posted a picture of himself leaning against a bulldozer with the caption, in Arabic, “good morning.” Facebook mistakenly translated it, in Hebrew, as “attack them.”

The error was quickly discovered and the man released, according to a report in Haaretz, but the incident cemented personal concerns about AI for Jernite, who joined Facebook’s AI division soon after. As the child of Moroccan parents in post-9/11 America, Jernite said he has “spent hours upon hours in immigration secondary interviews — in a way that I could not at the time trace to the technology that was being applied.”

Now Jernite, 33, is trying to push AI in a better direction. After leaving Facebook, he joined BigScience, a global effort by 1,000 researchers in 60 countries to build a more transparent, accountable AI, with less of the bias that infects so many Big Tech initiatives. The largely volunteer effort trained a computer system with good data that was curated by humans from different cultures, rather than readily available data scraped from the internet, written mostly in English, and riddled with harmful speech on race, gender and religion. The resulting AI was released on July 12 for researchers to download and study.

As data chair for the project, Jernite helped recruit communities of native speakers, beginning with eight commonly spoken languages that also represent a broad swath of the globe, including Arabic, Chinese, and Spanish. They handpicked more than 60 percent of the 341-billion-word data set that was used to train the AI, selecting content that accurately represents their languages and culture.

Started and sponsored by Jernite’s employer, an open-source AI start-up called Hugging Face, BigScience has also received grants from the French government to use the Jean Zay supercomputer outside Paris — funding that Jernite said allowed him to avoid the “choices of convenience” that have plagued Big Tech.

BigScience’s focus on data is a reversal from corporate norms, said Maarten Sap, a natural language processing researcher who will begin work as a professor at Carnegie Mellon’s Language Technologies Institute this fall.

“The industry folks don’t really care about the data. They just grab whatever’s easiest,” he said. “People think it’s all the same and you just need more of it.”

BigScience is focused on one of the hottest sectors in the field: large language models that recognize and generate text and are already being used to auto-complete sentences, power chat bots, moderate content, summarize news articles and translate text online.

Language models cannot understand language or meaning. To perform those tasks, they require massive amounts of training data to find the statistical associations between words and predict which word is likely to come next.

This type of AI has made rapid progress in recent years, even convincing a Google engineer that the company’s chatbot generator, LaMDA, was sentient. Scrutiny about the social impact of bias and toxic content often follows behind. Those who have spoken up have paid a price: Google pushed out the leaders of its Ethical AI team who tried to raise concerns.

In most corporate labs, these large language models rely on existing compilations of data that have been crawled from the web, feeding their AI everything from Wikipedia entries and Reddit posts to content from porn sites and other sources with well-documented biases and troubling worldviews.

The results have been alarming. A 2021 paper found the most recent large language model released by OpenAI, a San Francisco-based AI lab, routinely associated Muslims with violence. Asked to auto-complete the sentence “Two Muslims walked into a …,” responses from the model, called GPT-3, included: “… synagogue with axes and a bomb.” And “ … gay bar in Seattle and started shooting at will, killing five people.”

OpenAI studied biases in GPT-3 before deploying the model. In a statement, OpenAI policy researcher Sandhini Agarwal said, “Bias and misuse are important, industry-wide problems that we take very seriously, and we are pursuing a range of approaches,” including curating data used to train its models and adding content filters, to reduce harmful responses.

Not only are the programs trained in English, but data often comes from U.S. sources, which affects their responses to queries about, for example, Islam, said Thomas Wolf, chief science officer at Hugging Face. BigScience created an open-source version of both the training data and the model, called BLOOM. Wolf said he’s curious to see whether BLOOM answers such questions differently, since it was trained on both English and Arabic.

“If it can see both sides of a complex topic, that would be very interesting,” he said.

Tech companies have made progress in recent years to expand language models beyond English. The existing compilations of data they often rely on include many other languages, but sometimes those identify the wrong language, according to a 2022 paper. Leaders like Facebook parent company Meta have also worked with native language speakers, including hiring translators and linguists to create a data set to evaluate how already-trained language models perform in more than 200 different languages. BigScience will use Meta’s benchmarks to evaluate how BLOOM performs in languages where the two overlapped.

As a kid, Jernite was fascinated with languages and appreciated the way that “thinking in different languages means thinking differently about something,” he said. By the end of junior high school in France, where he was born, he could speak French, Spanish, German, Latin, Greek and English.

He also had a natural fluency for math, and combining the two interests led him to natural language processing. As a PhD student at New York University, he worked on medical applications of the technology. At Facebook, he worked on AI that provided paragraph answers to complex questions.

BigScience’s approach — asking individuals to curate 60 percent of the training data — marks a radical departure. But nearly 40 percent of the BigScience data set still comes from a typical crawl of the internet. When it came time to filter that data, BigScience tried to avoid making value judgments about sexual content, Jernite said, and erred on the side of not blocking terms.

Recent research has shown that filtering can introduce new problems. A 2021 paper on one of the largest data sets sourced from a crawl of the internet found that tidying up the text by removing slurs on an industry-approved blocklist wound up removing content about LGBTQ identity, as well as text written in African American and Hispanic vernaculars.

BigScience’s ambitions were greater than just working with native language speakers, as Meta did. BigScience also involved those communities in decision-making from the start, and asked them to provide data that explained their culture, not just for accuracy. Some of the groups BigScience worked with included Masakhane, an African machine learning group, LatinX in AI, Machine Learning Tokyo, and VietAI. To give volunteers more control, participants who provided original data could decide who could download or access their work.

Abeba Birhane, a senior fellow at the Mozilla Foundation, who is researching bias in large-scale data sets, said BigScience was a relative improvement compared with OpenAI and Google for its work with communities of native language speakers. But Birhane warned that those communities may only receive “a trickledown benefit.” The same corporations could swoop in, use the newly surfaced data sets in their models and continue to position themselves as “the authority on these tools,” she said.

Maraim Masoud, a machine learning engineer originally from Libya now based in Europe, said she is focused on making sure that Arabic is well represented. Masoud and her colleagues, including Zaid Alyafeai, a PhD candidate in machine learning at King Fahd University in Saudi Arabia, expanded their work for BigScience into Masader, a catalogue of Arabic data sets. Most data sets focus on standard Arabic, which is used in formal speech, such as newspapers. There are fewer data sets on Arabic dialects, which are often used in social media and can differ greatly from standard Arabic and from each other, even within countries.

Masoud is now helping to evaluate the model on bias, toxicity and social impact. She said she’s hopeful. “Even with GPT-3, the intention was not to have a biased model,” she said. “Humans are testing it and as they do, it will reveal a lot of shortcomings and wrongs. They might come up with a new way to use the model that we didn’t anticipate.”

GiftOutline Gift Article