- Main investigator:
- Lapedriza Garcia, Àgata
- Information and communication technologies
- Area of specialization:
- Information and communication technologies
- Affiliation center:
- Faculties eHealth Center
- UNESCO codes:
- 120304, 120302, 120317, 120320, 120601
- Collaborates with:
- e-Health Center
The amount of digital information available on the net has grown exponentially in recent years. As a result, one of the most serious problems in this context is the semantic search for information. There are, today, solutions for searching text data quickly and conveniently, but this problem is far from being resolved in the case of audiovisual data.
The AIWELL group develops computer vision and artificial intelligence algorithms to extract information from static images or videos. The group works specifically in:
- Algorithms for the automatic recognition of objects in natural images for their subsequent classification and use in spontaneous environments as well as their cognitive interpretation.
- Algorithms for the recognition of emotions, gestures and non-verbal language, using images and videos of people, to construct user-friendly human-machine interaction interfaces and analyse social interactions between people.
- Applications of vision to the automation of processes that require advanced artificial intelligence
Understanding complex visual scenes is one of the hallmark tasks of computer vision. Given a picture or a video, the goal of scene understanding is to build a representation of the content of a picture (ie what are the objects inside the picture; how are they related; if there are people in the picture, what actions are they performing; what is the place depicted in the picture; etc.).
With the appearance of large scale databases like ImageNet and Places, and the recent success of machine learning techniques such as Deep Neural Networks, scene understanding has experienced a great deal of progress. This progress has made it possible to build vision systems capable of addressing some of the above-mentioned tasks.
This line of research is being undertaken in collaboration with the computer vision group at the Massachusetts Institute of Technology. Our goal is to improve existing algorithms for scene understanding and to define new problems made attainable by recent advances in neural networks and machine learning.
Recognition of facial expressions
Facial expressions are a very important source of information for the development of new technologies. As humans we use our faces to communicate our emotions, and psychologists have studied emotions in faces since the publication of Charles Darwin’s early works. One of the most successful emotion models is the Facial Action Coding System (FACS) 2, where a particular set of action units (facial muscle movements) act as the building blocks for six basic emotions (happiness, surprise, fear, anger, disgust and sadness).
The automatic understanding of this universal language (very similar in almost all cultures) is one of the most important research areas in computer vision. It has applications in many fields, such as design of intelligent user interfaces, human-computer interaction, diagnosis of disorders and even in the field of reactive publicity. In this line of research we propose to design and apply state-of-the-art supervised algorithms to detect and classify emotions and action units.
Nevertheless, there is a far greater range of emotions than just this basic set. With better than chance accuracy, we can predict, among other things, the results of a negotiation, the preferences of the users in binary decisions, and the deception perception. In this line of research we collaborate with the Social Perception Lab at Princeton University (http://tlab.princeton.edu/) to apply automated algorithms to real data from psychology labs.
Human pose recovery and behaviour analysis
Human action/gesture recognition is a challenging area of research that deals with the problems of recognizing people in images, detecting and describing body parts, inferring their spatial configuration, and performing action/gesture recognition from still images or image sequences, also including multimodal data. Because of the large pose parameter space inherent in human configurations, body pose recovery is a difficult problem that involves dealing with several distortions including illumination changes, partial occlusions, changes in the point of view, rigid and elastic deformations, and high inter- and intra-class variability, to mention just a few. Even with the high level of difficulty of the problem, modern computer vision techniques and new tendencies deserve further attention, and promising results are expected in the next few years.
Moreover, several subareas have been recently defined, such as affective computing, social signal processing, human behaviour analysis, and social robotics. The effort involved in this area of research will be offset by its potential applications: TV production, home entertainment (multimedia content analysis), education purposes, sociology research, surveillance and security, improved quality of life by means of monitoring or automatic artificial assistance, etc.
Computer vision and cognition
We have observed huge progress in computer vision over the last four years, mainly because of the appearance of big datasets of labelled images, such as ImageNet 1 and Places , and the success of deep-learning algorithms when they are trained with this large amount of data. Since this turning point, performance has increased in many computer vision applications, such as scene recognition, object detection and recognition, image captioning, etc.
However, despite this amazing progress, there are still some tasks that are very hard for a machine to solve, such as image question-answering, or describing, in detail, the content of an image. The point is that we can perform these tasks easily not just because of our capacity for detecting and recognizing objects and places, but because of our ability to reason about what we see. To be capable of reasoning about something, one needs cognition. Nowadays computers cannot reason about visual information because computer vision systems do not include artificial cognition. One of the main obstacles to developing cognitive systems for computer vision was the lack of data to train. However, the recent work of Visual Genome 4 presents the first dataset that enables the modelling of such systems and opens the door to new research goals.
This line of research aims to explore how to add cognition in vision systems, to create algorithms that can reason about visual information.
Computer vision and emotional AI
In recent years we have observed an increasing interest, both within academia and within the computer vision industry, in systems for understanding how people feel and how visual information affects our mood and emotions. The line of research of computer vision and emotional AI is focused on creating systems for understanding image that include aspects of emotional intelligence in the process of interpreting the visual information. These systems have many applications. For example, they can be applied to the care and assistance of people, online education, and human-computer interaction.
In this line of research we work with advanced deep-learning techniques. The line of research combines several computer vision topics, such as face analysis, pose and gesture analysis, action recognition, scene recognition, object detection, and object/scene attribute recognition, to extract high-level information from images and videos.
Object recognition in images is still one of the most important research topics in computer vision. Given an image or a video, the goal of object recognition is to recognize and localize all the objects. In the last few years, this topic has experienced an impressive gain in performance with the use of Deep Neural Networks, and big datasets such as ImageNet.
In recent years, end-to-end learning algorithms have revolutionized many areas of research, such as computer vision, natural language processing, gaming and robotics. Deep-learning techniques have achieved the highest levels of success in many of these tasks, given their astonishing capability to model both the features/filters and the classification rule.
The algorithms developed in this line of research will focus on enhancing deep-learning architectures and improving their learning capabilities, in terms of invariant (rotation, translation, warping, scaling) feature extraction, computational efficiency and parallelization, speeding up the network learning times, and connecting images to sequences.
These algorithms will be applied to real computer vision problems in the field of neuroscience, in collaboration with the Princeton Neuroscience Institute. These range from detection and tracking of rodents in low resolution videos, image segmentation and limb detection, motion estimation of whiskers using high-speed cameras and in vivo calcium image segmentation of neural network activity in rodents.