C5 Visual Recognition
Visual recognition answers the visual task of explaining the content of an image in terms of “What is it?” and “where is it?”. The answer to these questions is usally a class label corresponding to the object or object types in the image, a tight bounding box containing the object in question, or, at a finer level, the region (pixels) that is its outline. These tasks are called, respectively, image classification, object detection and semantic segmentation. A question is “give me objects like this one”, that requires learning a similary metric between images, even in the case come from different modalities, like sketches and photographs, through the so called encoder-decoder architectures. VR module covers neural network architectures addressing these four types of tasks. And, as a practical complement, methods to implement them.
Specifically, in this module we give to the student an overview of the latest methods based on deep learning techniques to solve visual recognition problems. The final aim is the understanding of complex scenes to build feasible systems for automatic image understanding able to answer the complex question of what objects and where are these objects in a complex scene.
Having addressed the task of classification in module M2, the students will learn a large family of successful architectures of deep convolutional networks that have been proved to solve the visual tasks of detection and segmentation and recognition. In addition to these two visual tasks, this module addresses also advanced topics in deep learning such as architectures for image generation (GANs and VAEs) plus encoder-decoder architectures for multimodal applications.
Module Project: Multimodal Recognition

This project explores several key tasks in visual recognition — object detection and segmentation, image captioning, and image generation— which highlight some of the main challenges involved in recognizing and understanding the visual content of images.
We begin with object detection and segmentation, which locate and classify objects within an image, a fundamental step for extracting structured visual information that supports higher‑level semantic understanding. The project then moves to image captioning, where visual features must be translated into natural‑language descriptions, this is a demanding task that evaluates a model’s ability to achieve high‑level comprehension of image content. Finally, the project examines image generation, where models synthesize new images from textual descriptions. This illustrates the inverse process: using semantic text understanding to produce coherent visual content. In this context, the project also investigates how image generation can support and enhance the training of image‑captioning models.
M5 Schedule – Academic Year 2025-2026 – Student Guide <here>

