Language is only half the story of communication. Much of our communication is embodied in our interaction with the world through gaze and gesture, and our interpretation of our situation in terms of experienced and recognized emotions. We must there for pay close attention to the communication of emotions through the colouring of our speech and livening of our face with appropriate expression. In addition we also need to be sensitive to the deictic and interpretive cues that are conveyed through our eye gaze, facial expressions, hand movements and body language.
Artificial Intelligence and Image Processing research has in recent years developed a focus on the recognition of emotions as expressed through facial gestures or expressions, conscious and unconscious. Facial feature extraction from images, and preserving or simulating a facial expression in a synthesized face, are key points in developing a system capable for effective human-computer interfacing, since by altering one’s features into different facial expressions during conversation humans are able to communicate their emotions across and affect the flow of a spontaneous conversation. A raised eyebrow or a mouth contorted in a smirk or angry snarl can convey a person’s emotional state. Humans have the ability to detect and interpret such facial movements and adapt their response in seconds or even milliseconds. In the context of our Teaching Head experiments, just providing appropriate rather than neutral or inappropriate facial expressions, during an otherwise identically deliverd lesson, can make a whole grade point of difference in the students results.
Recognition and Synthesis of Facial Gestures
The problem of recognizing emotion from the facial expression of a single image has been turned into a straightforward Machine Learning problem by the availability of a number of databases, or corpora, consisting of multiple images for a range of subjects, for each of 6 putative basic emotions, plus an additional neutral emotion case. Whilst we work with these databases, and have achieved promising results and new optimizations for the Image Processing and Machine Learning task that is involved, the databases themselves, and the paradigm they represent, have a number of limitations. Some of the most common techniques are wholistic and somewhat simplistic, using Principle Component Analysis on whole images. In collaboration with associate investigators at Beijing University of Technology, we have been exploring the limits of this technique, applying the technique to smaller components of the image and then fusing the results, with good success, as well as exploring the use of appropriate image processing and dimension reduction techniques.
Another approach to recognizing expressions is to recognize the individual facial gestures using Active Appearance Models (AAM) that are supposed to track individual points on the face, including in particular the mouth and eye areas.Theoretically, an AAM is capable of modelling and reconstructing any human face, including any facial expression or gestures displayed by the subject. Emotions in human subjects are classified by identifying the facial features’ juxtapositions into a particular expression of an emotion bin, which represent the six globally recognised emotions: anger, fear, disgust, joy, sadness and surprise. The AAM may also be used to model the speech gestures for purposes of lip-reading, or for synthesis of the visemes we use in the Thinking Head, Head X .
The 6 basic emotions may also be used for synthesis, often based on Active Appearance Models - which is again what we use for our emotion expression in Head X as well. From the perspective of accurate identification and tracking of these keypoints, and the use of these tracks or deviation from the home or normal position, recognition of an expression of an individual photo of an unknown person remains very difficult, and even for a known person, can be subject to significant error. However, the average over many images and many subjects delivers a standardized emotion signature that can be used to allow our Thinking Heads to express not only the 6 basic emotions illustrated below, but also arbitrary mixes of them, the so-called hybrid emotions.
For both recognition and synthesis purposes, better accuracy may be obtained with models, including both PCA and AAM, that are based on movie snippets, that is sequences of images, rather than single images. A more general and powerful technique available for moving images is optical flow, in which the movement of small patches from one frame to the next is represented by a motion vector, and this technique also allows for the estimation of depth given known motion (most typically where either the camera or the object is stationary). This is very similar to the way stereo disparity is found between two simultaneous images from a known distance apart. The points that move the fastest or slowest in a particular direction are typically those that are useful for an articulation or gesture model.
A related approach is to look for points of interest in an individual image, or in a sequence of images. However, many interest point detector approaches are ill suited for use in 3D environments because they were originally designed for 2D applications. Interest point detectors have mainly been developed to take advantage of lines, corners, ridges and blobs, but this biases their effectiveness to environments and objects that have these distinguishing features in them.
Hand Gestures and Body Language
The visual techniques we have discussed are not limited to the face, and indeed the later techniques are borrowed from a more general application of image and video processing. Emotions are not expressed only be the face either, and application of similar techniques to the hand in particular, and the body in general, is also important. Genetic programming techniques can derive novel interest points in an image that do not depend on conventional features or hand identified points of interest. These interest points are robust against various lighting conditions, as well as distortion or rotation. They are also repeatable and can be found when the same scene is shown with angular distortions.
We have developed a new genetic interest point detector algorithm that combines a grammar-guided search process with intermediate caching of results to minimize the total number of required detector evaluations. The fitness function uniquely uses depth information within a virtual 3D environment to measure the effectiveness of repeatable feature detections as a scene changes and leverages other aspects of 3D environments to better gauge interest point repeatability. This facilitates evolutionary exploration of the search space and produces interest point detectors that are more robust when handling 3D environments, even when depth data is not directly provided to the interest point detector.
Another technique that is useful for looking at the motion of the human body is to project an array of laser dots on the scene, or to actually fix dots to parts of the human body for motion capture. These dots may be visible or infrared. We have been experimenting with visible laser dots, and more recently with the Microsoft Kinect which displays a dense matrix of infrared dots. The dots are viewed by an infrared camera a known distance away from the laser, and distance can be calculated from the disparity of the dot from its "at infinity" position as with stereo cameras.
Eye Gaze and Pointing
In fact, it is not just the hand that points – we can point with our nose, or most typically, we can indicate something just by looking at it. Of course this, like most of our gestures, is usually largely unconscious. Eye gaze is very important to young children as they learn the meaning of language, and babies look at their mothers' faces and eyes, work out what mum is looking at, and then look at it themselves to ground the concepts they are also hearing linguistically.
The eyes are actually somewhat easier to find than the mouth, and finding a face and mouth often includes locating the eyes as part of it. Locating the pupil and iris in relation the eyeball, allows relatively accurate identification of gaze. Understanding and utilizing gaze is thus also part of our broader understanding of gesture, with directing the eyes at something being just as much a gesture as moving the lips to smile, or moving the mouth to say a word. All of these fall into the general class of what we call speech gestures.
from desks to gyms, from soldiers to museums
The classic concept of speech recognition or emotion recognition is someone seated and talking to another person, or working with a computer. The Kinect is a relatively new peripheral for the Xbox 360, and is designed to take gaming away from the desk and into a much more physical world, and it enables our robots and heads to see the world in a whole new way too.
When we are working, or working out, there a issues of cognitive load and general workload, that are very closely allied to our emotions and expressions, and indeed an also be reflected by our expressions. This also has affect on how effectively we are operating, is related to how well we have learned the skills we are learning, which has consequences for how much effort we need to put in to a task, and how much is left over for other things. This area of situation awareness is very important for the elite sportsperson or soldier, and we are working in both the sports and the defence area on ways of characterizing these more cognitive aspects of emotion. Busyness and boredom are others of the many attributes that we call cognitive emotions.
Our research in emotions feeds directly into our Thinking and Teaching Head research and in particular our AVAST project directly teaches children with disabilities social skills, including helping them to watch for, interpret and produce expressions and gestures appropriately. Our work has also been used in public artistic performances. In an interactive and award winning display at the Powerhouse Museum, developed in collaboration with colleagues at the University of Western Sydney, we displayed ideas or thought bubbles that were related in meaning to the words used in conversations between museum-goers and the Head, while surrounding musical sounds in the room were selected to match the emotions being expressed, giving some viewers a sense of being immersed in the inner world of the Head's thoughts and feelings.
Audiovisual & Brainmuscle Computer (ABC) Interfaces
On this page we have discussed only the visual aspects of our work. Clearly emotion is also expressed in auditory speech, but also we are exploring the use of EEG for assessing emotional state, cognitive load, situation awareness, and skill acquisition, and have demonstrated a number of basic capabilities in the area of learning to play a game or shoot a gun. An important aspect of this work is distinguishing the parts of the signal that are really brain signal (EEG) from those that are actually muscle (EMG, EOG, ECG, etc.). These are the subject of separate projects in Audio/Speech Signal Processing and Brain Computer Interface and Medical Imaging .