Wed. Dec 1st, 2021

If the AIs of the future, as many tech companies seem to hope, are going to see through our eyes in the form of AR glasses and other wearables, they need to learn to understand the human perspective. We’re used to it, of course, but there are remarkably few first-person video recordings of everyday tasks out there — which is why Facebook put a total of a few thousand hours into a new publicly available dataset.

The challenge Facebook is trying to master is simply that even the most impressive of object and scene recognition models today have been trained almost exclusively in third-person perspectives. So it can recognize a person cooking, but only if it sees the person standing in a kitchen, not if the view is from the person’s eyes. Or it recognizes a bike, but not from the rider’s perspective. It’s a change of perspective that we take for granted, as it’s a natural part of our experience, but which computers find quite difficult.

The solution to machine learning problems is generally either more or better data, and in this case it can not hurt to have both. So Facebook contacted research partners around the world to collect first-person video of common activities like cooking, grocery shopping, lacing up or just hanging out.

The 13 partner universities collected thousands of hours of video from more than 700 participants in 9 countries, and it must be said at the outset that they were volunteers and controlled the level of their own involvement and identity. Those thousands of hours were worn down to 3,000 by a research team that watched, edited and hand-commented the video while adding their own footage from staged environments they could not capture in nature. It’s all described in this research article.

The recordings were captured using a variety of methods, from spectacle cameras to GoPros and other devices, and some researchers also chose to scan the environment in which the person operated, while others tracked gaze and other measurements. It all goes into a dataset Facebook called Ego4D that will be made freely available to the research community as a whole.

Two images, one showing computer vision that successfully identified objects and another showing that it failed in the first person.

Two images, one showing computer vision that successfully identified objects and another showing that it failed in the first person.

Two images, one showing computer vision that successfully identified objects and another showing that it failed in the first person.

“For AI systems to interact with the world as we do, the AI ​​field must evolve into a whole new paradigm of first-person perception. This means learning AI to understand the activities of daily life through human eyes in real-time motion, interaction and multisensory. observations, “said lead researcher Kristin Grauman in a Facebook blog post.

As hard as it may be to believe, this research and the smart nuances of Ray-Ban Stories are completely independent, except that Facebook clearly believes that first-person understanding is becoming increasingly important for more disciplines. (However, 3D scans can be used in the company’s Habitat AI training simulator.)

“Our research is highly motivated by augmented reality and robotics applications,” Grauman told TechCrunch. “First-person perception is crucial to enable the AI ​​assistants of the future, especially as wearables like AR glasses become an integral part of how people live and move through everyday life. Think about how beneficial it would be if the assistants on your devices could remove the cognitive overload from your life, understand your world through your eyes. “

The overall video of the collected video is a very deliberate move. It would be basically short-term to only include images from a single country or culture. Kitchens in the United States look different from French, Rwandan and Japanese. Making the same dish with the same ingredients or performing the same general task (cleaning, training) can look very different, even between individuals, let alone entire cultures. So as Facebook’s post puts it: “Compared to existing datasets, the Ego4D dataset provides a greater diversity of scenes, people, and activities, increasing the applicability of models trained for people across backgrounds, ethnicities, occupations, and ages. “

Examples from Facebook of first-person video and the environments in which it was taken.

Examples from Facebook of first-person video and the environments in which it was taken.

Examples from Facebook of first-person video and the environments in which it was taken.

The database is not the only one that Facebook is releasing. With this kind of advancement in data collection, it is common to also set up a set of benchmarks to test how well a given model uses the information. For example, with a set of pictures of dogs and cats, you might want a standard benchmark that tests the model’s effectiveness of telling which one is which.

In this case, things are a little more complicated. Simply identifying objects from a first person’s point of view is not that difficult – it’s just another angle, and it would not be so new or useful either. Do you really need a pair of AR glasses to tell you “it’s a tomato”? No: Like any other tool, an AR device should tell you something, you does not know, and to do so it needs a deeper understanding of things like intentions, contexts, and linked actions.

For this purpose, the researchers came up with five tasks that could theoretically still be performed by analyzing this first-person image:

  • Episodic memory: tracking objects and concepts in time and space, then arbitrary questions like “where are my keys?” Can be answered.

  • Forecasts: understanding events, then questions like “what’s next in the recipe?” can be answered, or things can be pre-emptively noted, such as “you left your car keys in the house.”

  • Hand-object interaction: to identify how humans grasp and manipulate objects and what happens when they do so, which can feed into episodic memory or perhaps inform actions from a robot that is to mimic those actions.

  • Audiovisual diarisation: to connect sound with events and objects so that speech or music can be tracked intelligently in situations such as asking what the song was playing in the cafe or what the boss said at the end of the meeting. (“Diarrhea” is their “word”).

  • Social interaction: understand who is speaking to whom and what is being said, both for the purpose of informing the other processes and for immediate use as captions in a noisy room with several people.

Of course, these are not the only possible applications or benchmarks, just a set of initial ideas to test whether a given AI model actually gets what happens in a first-person video. Facebook researchers performed a basic run on each task, described in their paper, which serves as a starting point. There is also a kind of pie-in-the-sky video example of each of these tasks if they were successful in this video that summarized the research.

While the 3,000 hours — carefully hand-commented on over 250,000 research hours, Grauman was careful to point out — are an order of magnitude more than what is out there now, there is still plenty of room to grow, she noted. They plan to expand the dataset and are also actively adding partners.

If you’re interested in using the data, keep an eye on the Facebook AI Research blog and maybe get in touch with one of the many, many people listed on the paper. It will be released in the next few months when the consortium finds out exactly how to do it.

Leave a Reply

Your email address will not be published. Required fields are marked *