Facebook hopes researchers will use the open-source data set, which it announced Thursday, to help judge whether AI systems work well for people of different ages, genders, skin tones, and in different types of lighting. (The data set is not meant to be used to train AI to identify people by their gender, age, or skin tone, the company said, as this would violate the terms of the data use.) Facebook also released the data set internally for use within Facebook itself; the company said in a blog post
that it is “encouraging” teams to use it.
The data set, called “Casual Conversations,”
includes 3,011 people from around the United States and 45,186 videos. Facebook gave the data set that name because participants were recorded while giving unscripted answers to a variety of pre-chosen questions.
Facebook had humans label the lighting conditions in videos and label participants’ skin tones according to the Fitzpatrick scale, which was developed in the 1970s by a dermatologist to classify skin colors.
Though there are some AI data sets that include people who agreed to participate, it is often the case that people are unaware that they have been included in some manner. That’s been the case with images used to build some of the key data sets for training facial-recognition software
. And tech companies including Facebook
have used ImageNet
, an enormous data set of all kinds of images (including those of people) gathered from the internet, to advance their progress in AI.
The Casual Conversations data set is composed of the same group of paid actors that Facebook previously used when it commissioned the creation of Deepfake videos
for another open-source data set (Facebook hoped people in the artificial-intelligence community would use that one to come up with new ways to spot technologically manipulated videos online and stop them from spreading). Cristian Canton Ferrer, research manager at Facebook AI, told CNN Business that the Casual Conversations data set includes some information that was not used when Facebook created the Deepfake data set.
Canton said paying participants — who had to spend several hours being recorded in a studio — seemed fair given what Facebook got in return. Participants in this data set can also tell Facebook to remove their information in the future for any reason, he said.
Canton knows much more work needs to be done to make AI systems fair. He said he hopes to get feedback from academic researchers and companies so that, over time, fairness can be better measured.
One area he is considering expanding on in the future is the way gender tends to be defined in data sets. Computers are typically tasked with looking at gender in a very narrow way
— as binary labels of “male” or “female” that may be applied automatically — while humans increasingly recognize gender with a growing number of terms that may change over time. In the Casual Conversations data set, participants were asked to self-identify as “male,” “female,” or “other,” Canton said.
“‘Other’ encapsulates a huge gamut of options there,” he said.