of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Who is Mistaken? Carl Vondrick MIT.



Publish on:

Views: 4 | Pages: 12

Extension: PDF | Download: 0

Who is Mistaken? Benjamin Eysenbach MIT Carl Vondrick MIT Antonio Torralba MIT We present a simple model for learning to recognize mistaken characters
Who is Mistaken? Benjamin Eysenbach MIT Carl Vondrick MIT Antonio Torralba MIT We present a simple model for learning to recognize mistaken characters in short sequences. Our model uses personcentric representations of scenes and combines information across several timesteps to better recognize mistaken characters. Experiments show that our model learns to recogarxiv: v1 [cs.cv] 4 Dec 2016 Abstract Recognizing when people have false beliefs is crucial for understanding their actions. We introduce the novel problem of identifying when people in abstract scenes have incorrect beliefs. We present a dataset of scenes, each visually depicting an 8-frame story in which a character has a mistaken belief. We then create a representation of characters beliefs for two tasks in human action understanding: predicting who is mistaken, and when they are mistaken. Experiments suggest that our method for identifying mistaken characters performs better on these tasks than simple baselines. Diagnostics on our model suggest it learns important cues for recognizing mistaken beliefs, such as gaze. We believe models of people s beliefs will have many applications in action understanding, robotics, and healthcare. 1. Introduction In Figure 1, one person has a mistaken belief about their environment. Can you figure out who is mistaken? You likely can tell the woman is about to sit down because she incorrectly believes the chair is there. Although you can see the complete scene, the character inside the scene has an imperfect view of the world, causing an incorrect belief. The ability to recognize when people have incorrect beliefs will enable several key applications in computer vision, such as in action understanding, robotics, and healthcare. For example, understanding beliefs could prevent accidents by warning people who are unaware of danger, help robots to have more fluid interactions with humans [15], provide clues for anticipating human actions [14, 27], and better generate visual humor [6]. How do we give machines the capability to understand what a person believes? In this paper, we take a step towards understanding visual beliefs. We introduce the novel problem of recognizing incorrect beliefs in short visual stories. We propose two new tasks aimed at understanding which people have false beliefs. Given a visual story, we aim to recognize who is mistaken and when they are mistaken. For example, in Figure 1, the woman is mistaken in the third frame. To study this problem, we present a dataset of abstract scenes [34] that depict visual stories of people in various types of everyday situations. In each story, one or more people have mistaken beliefs, and we seek to recognize these people. Abstract scenes are ideal for studying this problem because we can economically create large datasets that focus on the high-level human activities, such as ones influenced by people s beliefs. The scenarios in our dataset are diverse and characters are mistaken for many reasons, such as occlusion or unexpected actions. Time Figure 1: Can you determine who has a false belief about this scene? In this paper, we study how to recognize when a person in a short sequence is mistaken. Above, the woman is mistaken about the chair being pulled away from her. Figure 1: Can you determine who believes something incorrectly in this scene? In this paper, we study how to recognize when a person in a scene is mistaken. Above, the woman is mistaken about the chair being pulled away from her in the third frame, causing her to fall down. The red arrow indicates false belief. We introduce a new dataset of abstract scenes to study when people have false beliefs. We propose approaches to learn to recognize who is mistaken and when they are mistaken. woman wonders where her food went The woman thinks the boy broke the painting, but it was the girl. The girl thought the boy would get off the teeter totter safely. The Blonde Man Thinks the Dusty-Haired Boy is Flirting with Him Time The couple mistakenly thinks it's ok to eat the mushrooms. Figure 2: Visual Beliefs Dataset: We introduce a new dataset of abstract scenes to study visual beliefs. We show five example scenes from our dataset. The red arrows indicate that a person has a false belief in that frame. Each scene (row) contains eight images, depicting a visual story when read left to right. The caption below each scene was collected during annotation for visualization purposes only. ple use gaze-following for theory of mind, and failing to solve this problem may indicate a disability. From a computation perspective, [23] study theory-of-mind in robotics and its applications to human-robot interaction. [30] explore people s intentions in real-world surveillance footage. [3] propose a Bayesian model for learning beliefs based on a partially-observable Markov Decision Process. [33] propose using probabilistic programming to infer the beliefs and desires of people in RGBD videos. Our approach similarly captures the beliefs of people, but we focus on how these beliefs differ from reality. We focus on learning the beliefs of characters directly from visual scenes. Common Sense: Our work complements efforts to learn common sense. Understanding how people interact with each other and their world is an important step towards identifying their beliefs. [31] extract common sense from object detection corpora, while [8] learn visual common sense by browsing the Internet. [26] use abstract clip art images to learn how people, animals and objects are likely to interact. [17], [29] and [19] learn a model for physics given videos of colliding objects. [1] explore understanding social interactions in crowded spaces. [21] explore causality in unconstrained video to understand social games. In this work, we study the subset of common sense related to visual beliefs. Activity Understanding: Our work is related to activity understanding in vision. [5, 28, 7, 20, 10]. Systems for understanding human actions typically leverage a variety of cues, such as context, pose, or gaze [22]. Our work complements action understanding in two ways. First, we study nize people s beliefs better than baselines, suggesting that it is possible to make progress on understanding visual beliefs. Although we only train our model to predict mistaken beliefs, experiments suggest that our model is internally learning important cues for beliefs, such as human gaze. The first contribution of this paper is introducing two new computer vision tasks for understanding people s beliefs in short visual sequences of images. The second contribution is the release of a new dataset for both training and evaluating models for recognizing beliefs. The third contribution is a model for starting to tackle these belief tasks. In the remainder of this paper, we describe these contributions in detail. In Section 2, we review related work. In Section 3, we describe our new dataset and its collection procedure. In Section 4, we motivate the who and when tasks. In Section 5, we present several simple baseline models. In Section 6, we show experiments to analyze our models. We believe recognizing mistaken people in scenes can have a large impact in many applications, and we hope this paper will spur additional progress on this important problem. Code, models, data, and animated scenes will be available at 2. Related Work Beliefs and Intentions: Our paper builds off several works that study the beliefs of people. In psychology, previous work has focused on theory of mind with people ability to reason about the beliefs of others. [24] note that peo2 Figure 3: Dataset Statistics: We summarize statistics of our dataset. (left) Our dataset contains more not-mistaken characters than mistaken characters, but the ratio of mistaken/not-mistaken characters is the same for characters facing left and right. (middle) We show the (x, y) location of every character in every frame. The distribution for mistaken characters and notmistaken characters appears similar. (right) There is a slight bias for people to be mistaken towards the end of the video. We compare (and outperform) against baselines that use this bias. visual beliefs, which may be a useful signal for better understanding people s activities. Second, recognizing visual beliefs often requires an understanding of people s actions. Abstract Images: We take advantage of abstract images pioneered by [34], which has become popular for studying high-level vision tasks. [2] use abstract images to learn features that encode the semantic similarity between images. [6] use abstract images to detect visual humor. [32] explore binary question-answering in abstract scenes, and [11] learn to predict object dynamics in clip art. While these approaches reason about image-level features and semantics, our approach looks at character-level features. Importantly, two characters in the same scene can have different beliefs about the world, so each character should have a different character-level feature. Additionally, we extend this previous work to multi-frame scenes depicting visual stories. 3. Dataset We collected a dataset of abstract scenes to study the beliefs of characters. Each scene in our dataset consists of a sequence of 8 frames, and shows an everyday situation. In each scene, there is one or more people who believe something incorrectly about their environment. There are many reasons why a person may have a false belief, including occlusion and misinterpreting intentions. Although the characters inside the scenes do not know if they are mistaken, we designed the dataset so that third-party viewers can clearly recognize who is mistaken. Our dataset complements existing abstract scene datasets. For example, our dataset builds upon the abstract scenes VQA dataset [2] in two ways: first, frames in our dataset are grouped into scenes telling stories over several timesteps; second, characters in our dataset frequently have mistaken beliefs. Additionally, our dataset complements the abstract humor dataset [6]. People with false beliefs may cause humor in a scene. We believe abstract scenes provide a good benchmark for studying visual beliefs. We originally tried to collect a dataset of real videos containing people with false beliefs (such as suspense movies), but we encountered significant difficulty scaling up dataset collection. While many real videos contain characters with mistaken beliefs, these beliefs are very complex. This complexity made large-scale annotation very expensive. We believe abstract scenes are suitable for understanding visual beliefs today because they allow the field to gradually scale up complexity on this important problem. In this work, we focus on mostly obvious false beliefs in abstract scenes. We use our dataset for both learning and evaluation of models for detecting mistaken characters in scenes. We show a few examples of our dataset in Figure 2 and summarize statistics in Figure 3. We collected this dataset on Mechanical Turk [25]. First, we ask workers to illustrate scenes. Then, we ask workers to annotate mistaken characters. In the remainder of this section, we describe how we built this dataset. The appendix contains additional details Generating Scenes In the illustration step, workers were told to drag and drop clipart people and objects into eight frames to tell a coherent story. The interface was a modified version of [2]. We told workers that some frame should contain a character who has a mistaken belief about the world. In addition to illustrating these eight frames, workers also wrote a scenelevel description and eight frame-level descriptions. These descriptions were used during the annotation step, but were not used to train or evaluate our models. The illustration interface was designed to ensure the scenes were diverse. First, the background of each scene was randomly chosen to be indoors or outdoors. Second, the people, animals, and objects an illustrator could choose to add were randomly chosen Annotation In the annotation step, the goal was to label which characters have mistaken beliefs. We hired workers to review the previously illustrated scenes, and write one yes/no question for each frame. For each frame, workers wrote the true answer to the question and the answer according to each 3 of the characters. We labeled a character as mistaken it its answer was different from the true answer. In total, we collected 1,496 scenes, 1,213 of which passed our qualification standards. These scenes were the collective effort of 215 workers. On average, each frame contains 1.71 characters, and an average of 0.53 characters are mistaken per frame; characters are mistaken in 31.03% of frames. A pool of 237 workers annotated each scene twice. The labels for whether a character was mistaken were consistent between workers 61.15% of the time, indicating that in some scenes it was unclear whether a character was mistaken. In this paper, we only consider scenes where characters are clearly mistaken or not Quality Control In pilot experiments, we found that many workers misunderstood the task. We used three tools to filter out bad scenes and teach workers how to create good scenes. First, workers were required to complete a qualification quiz before starting the illustration and annotation steps. In the quiz for the illustration task, workers identified good and bad scenes. In the quiz for the annotation step, workers filled in characters answers for a scene with preselected questions. These quizzes forced workers to think about the beliefs of characters. Adding these quizzes significantly increased the quality of our data. Second, for the illustration task we manually reviewed the first scene completed by each worker. If the scene was incoherent or did not contain a mistaken character, we disallowed the worker from illustrating more scenes. Third, we showed annotators the scene-level and frame-level descriptions for the scenes they were illustrating, helping them understand what the illustrator intended What Causes Mistaken Beliefs? What causes characters to have mistaken beliefs? Figure 2 shows a few scenes from our dataset that highlight different types of mistaken beliefs. In the first scene, the woman is mistaken because the dog is occluded behind couch, and because she cannot see actions outside her field of view. In the second scene, the woman falsely accuses the boy of breaking the painting because she cannot observe events when she is not present. The girl in the third scene mistakenly assumes the boy can safely get off the teeter totter because of her faulty reasoning about physics. In the fourth scene, the boy wearing a red shirt misinterprets the intentions of the other boy. In the last scene, the woman wearing the red shirt lacks the common sense that some mushrooms are poisonous. Recognizing mistaken characters requires detecting each of these types of mistaken beliefs. To recognize characters with these types of incorrect beliefs, models will need a rich understanding of the scene, such as action recognition, gaze following, physical reasoning, and common sense knowledge. In this work, we take the first step and tackle a subset of these challenges. 4. Belief Tasks We study two tasks for recognizing mistaken people: Task 1: Who is mistaken? In this task, given a scene and a character, the goal is to predict whether the character is mistaken in any frame in the scene. This task has several applications in identifying people who may be confused or unaware of danger. Task 2: When are they mistaken? In this task, given a frame, the goal is to predict whether any character is mistaken in this frame. This task has applications in identifying when people might be confused, but it is not possible to know who is confused, such as in a crowd. Joint Task: We also explore a joint task where we seek to simultaneously recognize who is mistaken as well as localize when they are mistaken in time. 5. Method We now describe an approach for predicting who is mistaken and when they are mistaken. Recognizing mistaken characters requires looking beyond a single frame as knowledge of the past or knowledge of the future can provide important signals for recognizing mistakes. For example, in the second scene of Figure 2, a model must see that the woman was not present when the girl broke the painting to understand why she falsely accused the boy. Our model for detecting mistaken characters will look at the past, present, and future. The model must also understand what a person may know and what they might not. To understand that a person is mistaken, the model should be able to determine that the scene is different from what the person believes Person-Centric Representation Before predicting whether a character is mistaken, we must tell our model which character to focus on. We use a person-centric representation of the world, where the model takes the perspective of an outside observer focusing on a specific character. For each frame in the scene, we center the frame at the head of the specified character. We also flip the frame so the specified character always faces left. For example, in Figure 4, the frame in the upper left can be viewed from each of the three characters perspectives. Alternative approaches that remove parts of the frame outside the character s field of view may struggle to reason about what the character cannot see Visual Features We use a frame-wise approach by extracting visual features for each frame and concatenating them temporally to create a time-series. We extract visual features from the 4 Original Woman s Perspective Man s Perspective Boy s Perspective Figure 4: Person-Centric Representation: We use a visual representation that focuses on the character of interest. person-centric images using the AlexNet convolutional network [16] trained on ImageNet [9]. We use activations from POOL5, and further downsample by a factor of two. The resulting feature has size (256, 12, 21). One alternative is to use a handcrafted representation which exploits the fact that we have parameters for the rendering model of our abstract scenes. Using natural image features may allow for easier domain adaptation to real images. Moreover, although the features we use are trained on natural images (i.e. ImageNet), we found success at using them in abstract scenes, possibly because the quality of abstract scenes is high enough Learning To learn to predict whether a person is mistaken or not, we can train a regularized convolutional logistic regression model, supervised by annotations from our training set. Suppose our image sequences are length T and our features are D dimensional. Let φ(x i, p j ) R T D represent the features for sequence x i for person p j and y ij {0, 1} T be our target category binary, indicating whether person p j is mistaken in each frame of sequence x i. Our vector of predictions is ŷ i,j R T. We optimize the objective: min w ( y t i,j log(ŷi,j) t + (1 yi,j) t log(1 ŷi,j) t ) i,j,t (1) where ŷi,j t = (w φ(x i, p j ) + b) t The learned weight vector w R K D represents the convolutional kernal, where parameter K specifies the temporal width; b R is the learned bias. For simplicity, we omit the L2 penalty on w. The superscript ( ) t gives the entry of a vector corresponding to frame t in a scene. We denote convolution as, which is performed temporally. To handle border effects, we pad these features with zeros. The convolutional structure of our model encodes our prior that character s beliefs are temporally invariant. Task Method Who+When Who When Chance Time 62.9 (1.9) 52.4 (1.8) 64.3 (2.2) Pose 51.9 (2.1) 50.3 (3.5) 54.8 (1.9) Time+Pose 60.6 (2.0) 51.6 (1.2) 61.2 (1.9) Single Image 61.1 (1.7) 59.7 (3.3) 62.0 (2.0) Multiple Image 66.6 (1.8) 64.1 (2.8) 67.5 (1.8) Table 1: Quantitative Evaluation: We evaluate the accuracy of our model versus various baseline on the who task, the when task, and the j
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks