PAVIS schools follow the trend of the VIPS schools organised by Prof. Murino at University of Verona from 2004 to 2010 (VIPS Schools at University of Verona). These schools are organised with a different format as compared to the majority of the other schools of this type.

Our goal is to provide students an in-depth understanding of the specific topic covered and, to this end, we typically invite 1 or 2 recognized experts to give lectures for about 16 hours in a period of 3-4 days, possibly also with lab (practice) classes. In this way, we favour a closer interaction between the speakers and the students, with a clear advantage for the latter who gained a better comprehension of the several facets of the tackled themes.

To date, we organized 5 editions of the PAVIS schools with the support of outstanding scholars on interesting topics in computer vision and machine learning. The programs and other info of the schools are detailed in the following list.

5th PAVIS School - 2014

Scene understanding and object recognition

Speaker: Antonio Torralba

October 28 - October 30, 2014 – Sestri Levante (GE), Italy

Scene understanding and object recognition in context

The goal of this school will be to introduce recent advances in scene recognition, multiclass object detection and object recognition in context. The class will cover global features for scene recognition (gist, deep features, …), databases for scene understanding (crowdsourcing, image annotation, …), methods for multiclass object detection (short summary of object detection approaches with emphasis on multiclass techniques) and current approaches for object recognition in context and scene understanding. The theoretical sessions will be complemented with guided experiments in MATLAB.

Invited Speakers

Antonio Torralba received the degree in telecommunications engineering from Telecom BCN, Spain, in 1994 and the Ph.D. degree in signal, image, and speech processing from the Institut National Polytechnique de Grenoble, Grenoble, France, in 2000. He is an Associate Professor of Electrical Engineering and Computer Science at the Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge. From 2000 to 2005, he spent postdoctoral training at the Brain and Cognitive Science Department and the Computer Science and Artificial Intelligence Laboratory, MIT. Dr. Torralba is an Associate Editor of the IEEE Trans. on Pattern Analysis and Machine Intelligence, and of the International Journal in Computer Vision. He received the 2008 National Science Foundation (NSF) Career award, the best student paper award at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2009, and the 2010 J. K. Aggarwal Prize from the International Association for Pattern Recognition (IAPR).


Tuesday, October 28th

Lecture: 9:30 - 13:00 and 14:30 - 18:00

Introduction to scene perception and recognition

We will describe the current state of the art in scene recognition together with motivation from cognitive psychology.

  • Scene recognition: Gist and global image features
  • Scene attributes: some experimental methods to get them
  • Eye movements and models of attention

Wednesday, October 29th

Lecture: 9:30 - 13:00 and 14:30 - 18:00

Multiclass object detection We will describe the challenges and opportunities that arise from trying to detect many objects simultaneously.

We will review current techniques in transfer learning and context models.

Multiclass object detection (HOG, deep features, …)

  • Transfer learning
  • Object detection in context
  • 3D datasets, reconstruction and recognition

Thursday, October 30th

Lecture: 9:30 - 13:00 and 14:30 - 18:00

Large databases for scene and object recognition

This day we will focus on techniques that rely on having very large databases available. We will review current approaches for image annotation and non-parametric methods for scene parsing. We will also review applications inspired from human perception.

  • Large databases, dataset bias
  • Image annotation and crowd-sourcing
  • SIFT flow and label transfer
  • Scene memory

Please note: Each lecture will combine theory and exercises so to make the class more interactive. Such practical sessions will be performed in MATLAB (please, bring laptop with you). Students are assumed to have basic knowledge of algebra, probability and statistics.


4th PAVIS School - 2013

Large Scale Visual Recognition of Object Instances and Categories

Speakers: Andrew Zisserman and Andrea Vedaldi

September 18-20, 2013 – Sestri Levante (GE), Italy

Large Scale Visual Recognition of Object Instances and Categories

The goal of this school is to introduce a number of state-of-the-art fundamental techniques in image understanding as well as to demonstrate the use of open source software to implement them in applications. Theoretical aspects that will be covered include image representations suitable for registration, object instance, and object category matching (including regions of interest, local descriptors, descriptor metrics, quantisation, indexing, and histogramming) as well as machine learning techniques to train models for given object types (linear and non-linear large scale support vector machines and related kernel representations and optimisation methods). Alternating with the theoretical sessions, in a series of guided experiments the students will explore how such ideas can be implemented in software by using MATLAB and open source libraries such as VLFeat.

Invited Speakers

Andrew Zisserman is professor of Engineering Science at the University of Oxford, where he leads the Visual Geometry Group. He is one of the most cited authors in computer science, with more than 270 papers in major computer vision and machine learning conferences and journals. He has published several books, including "Visual Reconstruction" (with Andrew Blake) and "Multiple View Geometry in Computer Vision" (with Richard Hartley). His work introduced groundbreaking ideas in visual geometry, matching, retrieval, and object recognition. He is recipient of several awards, including three Marr prizes, the most prestigious award in computer vision. He is fellow of the Royal Society and of the Royal Society of Engineering.

Andrea Vedaldi is University Lecturer in Engineering Science at the University of Oxford since 2012. His research interests include the automatic interpretation of image, machine learning and large scale optimisation. He is author of more than thirty papers in major computer vision and machine learning conferences and journals, as well as leading author of the VLFeat computer vision library. From 2008 to 2012 he was postdoctoral researcher and junior research fellow at the University of Oxford, supported by the Glasstone Research Fellowship in Science and the New College W. W. Spooner Fellowships. He is the recipient of the PhD and MSc degrees in Computer Science from the University of California at Los Angeles in 2008 and 2005 respectively (outstanding PhD and MSc thesis awards), and of the BSc degree in Information Engineering by the University of Padua in 2003.


Wednesday September 18th


Morning - 9:30-13:00

Local features and descriptors

  • Covariant detectors
  • Descriptors

Matching and recognition using local features

  • Greedy matching
  • Second nearest-neighbour test
  • Geometric verification

Efficient visual search

  • From images to symbols: visual words
  • Bag of visual words and inverted index
  • VLAD
  • Local sensitive hashing
  • Product quantization

Afternoon - 14:30-18:00

Large scale retrieval and applications

Practical - Bring your own laptop, instructions at


Thursday September 19th


Morning - 9:30-13:00

  • Object categories and intra-class variability
  • Supervised learning
  • Bag of word models for classification
  • Other image representations
  • Dataset and evaluation
  • Large scale linear learning

Afternoon - 14:30-18:00


Friday September 20th


Morning - 9:30-13:00

  • Sliding window object detection
  • HOG detectors
  • Advanced HOG-based representations
  • Learning with structure
  • Part-based model

Afternoon - 14:30-18:00

  • Latent structure
  • Recapitulation and current research challenges

GirprThe school is endorsed by GIRPR (Gruppo Italiano Ricercatori in Pattern Recognition)


3rd PAVIS School - 2012

Component Analysis methods for Human Sensing

Speakers: Fernando De la Torre and Jeffrey Cohn

October 2-5, 2012 – Sestri Levante (GE), Italy

Component Analysis methods for Human Sensing

Enabling computers to understand human behavior has the potential to revolutionize many areas that benefit society such as clinical diagnosis, human computer interaction, and social robotics. A critical element in the design of any behavioral sensing system is to find a good representation of the data for encoding, segmenting, classifying and predicting subtle human behavior. In this tutorial we will review component analysis (CA) techniques (e.g. kernel principal component analysis, support vector machines, spectral clustering) that are commonly used to learn spatial and temporal patterns of human behavior. The aim of CA is to decompose a signal into interesting components that explicitly or implicitly (e.g. kernel methods) define the representation of the signal. CA techniques are especially appealing because many can be formulated as eigen-problems, offering great potential for efficient learning of linear and non-linear representations of the data without local minima. Although CA methods have been widely used, there is still a need for a better mathematical framework to analyze and extend CA techniques. In the first part of the tutorial we will review existing CA techniques such as PCA, LDA, NMF, ICA,… and standard extensions (e.g., kernel, latent variable models, tensor factorization). In the second part of the tutorial we will show how several extensions of the CA methods outperform state-of-the-art algorithms in problems such as temporal alignment of human behavior, activity recognition, face recognition, facial expression recognition, temporal segmentation/clustering of human activities, joint segmentation and classification of human behavior, and facial feature detection in images. Applications of automatic measurement and synthesis of facial expression and prosody will include advances in basic research in nonverbal communication, avatars, and biomedical applications in psychiatry (Major Depressive Disorder) and medicine (physical pain).

Invited Speakers

Fernando De la Torre is Associate Research Professor in the Robotics Institute at Carnegie Mellon University. He received the BSc degree in telecommunications, as well as the MSc and PhD degrees in electronic engineering from La Salle School of Engineering at Ramon Llull University, Barcelona, Spain in 1994, 1996, and 2002, respectively. His research interests are in the fields of computer vision and machine learning. Specifically, he is interested in modeling and recognizing human behavior, with a focus on understanding human behavior from multimodal sensors (e.g. video, body sensors). He has done extensive work on facial image analysis (e.g., facial expression recognition, facial feature tracking). In machine learning his interest centers on developing efficient and robust methods to model high-dimensional data. Currently, he is directing the Component Analysis Laboratory and the Human Sensing Laboratory at Carnegie Mellon University. He has more than 100 publications in refereed journals and conferences. He has organized and co-organized several workshops and has given tutorials at international conferences on the use and extensions of component analysis.

Jeffrey Cohn is Professor of Psychology at the University of Pittsburgh and Adjunct Faculty at the Robotics Institute, Carnegie Mellon University. He received his PhD in psychology from the University of Massachusetts at Amherst. Dr. Cohn has led interdisciplinary and inter-institutional efforts to develop advanced methods of automatic analysis of facial expression and prosody and applied those tools to research in human emotion, interpersonal processes, social development, and psychopathology. He co-developed influential databases, Cohn-Kanade, MultiPIE, and Pain Archive, co-edited two recent special issues of Image and Vision Computing on facial expression analysis, and co-chaired the 8th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2008).


Tuesday October 2nd

1:00 PM - 2:30 PM (Jeffrey Cohn)

Why do the faces and facial expression matter?

How do we measure facial expression?

  • Emotion-specified versus anatomically-based
  • Categories versus dimensions
  • Static versus dynamic

Automatic analysis and synthesis

2:45 PM - 5:30 PM (Fernando De la Torre)

Traditional linear models (e.g., PCA, LDA, SVM, CCA, OCA, PLS, MDS, k-means, NMF and ICA)

Standard extensions of linear models: kernel methods, latent variable models, tensor factorization

Unified view of component analysis


Wednesday October 3rd

9:00 AM - 12:00 AM (Fernando De la Torre)

Intro to Facial Expression Analysis

Modeling with extended CA methods

  • Robust Principal Component Analysis (RPCA) and applications
  • Low-rank matrix completion

Alignment/detection (Active Appearance Models and extensions)

  • Filtered Component Analysis (FCA)
  • Kernel Parametrized PCA
  • Metric Learning for image alignment


  • Discriminative Cluster Analysis (DCA)
  • Source Constrain Clustering (SCC)

 2:00 PM - 5:00 PM (Jeffrey Cohn)


  • Emotion and micro-expression recognition
  • Facial action coding system
  • Valence, arousal, and related dimensions

Person-specific and person-independent approaches

Thursday October 4th

9:00 AM - 12:00 AM (Fernando De la Torre)ù


  • Multimodal Oriented and Pareto Discriminant Analysis
  • Detection Segmentation-SVM
  • Matrix Completion for multi-label image classification


  • Bilinear reduced rank regression and applications to FES
  • Subspace Regression
  • Robust and Continuous Regression
  • Supervised Local Subspace Learning

2:00 PM - 5:00 PM (Jeffrey Cohn)

Applications of automatic analysis and synthesis

  • Pain
  • Depression
  • Mother-infant

Training and testing

  • Metrics for evaluation and the problem of skewed data
  • Issues in selecting training and testing data


Friday October 5th

9:00 AM - 11:30 AM (Fernando De la Torre)

Time series analysis

  • Canonical Time Warping (CTW) for multimodal alignment of human behavior
  • Aligned Cluster Analysis (ACA) for clustering human motion
  • Unsupervised Temporal communality discovery
  • Segment-SVM and applications to event detection, sequence labeling, and early event detection

1:00 PM - 3:00 PM (Jeffrey Cohn)


  • Fast-FACS
  • Combining manual and automatic measurement

Vocal prosody and longitudinal designs




2nd PAVIS School - 2011

2D and 3D Visual Recognition: Approaches and Methods

Speakers: Fei-Fei Li and Silvio Savarese

March 21-24, 2011 – Genova, Italy

2D and 3D Visual Recognition: Approaches and Methods

This school covers a number of critical topics in computer vision and visual recognition including object recognition, categorization, and scene understanding. Lectures offer a general introduction to visual recognition’s main challenges and objectives as well as state-of-the-art methodologies (such as bag-of-words and 2D part-based methods) for object and scene recognition. Lectures also cover recent development on 3D spatial reasoning for joint object and scene understanding. The school includes laboratory sessions with open ended coding projects in Matlab.

Invited Speakers

Fei-Fei Li Stanford University, USA

Silvio Savarese University of Michigan, USA


1st PAVIS School - 2010

Social Signal Processing: State of the Art and Prospects

Speakers: Daniel Gatica-Perez and Alessandro Vinciarelli

July 18-22, 2010 – Sestri Levante (GE), Italy

Social Signal Processing: State of the Art and Prospects

This school follows the series of intensive courses (VIPS Schools), aimed at PhD students and researchers in the areas of Computer Vision, Pattern Recognition and Image Processing. It is organized and sponsored by the PLUS (Pattern analysis, Learning, and image Understanding Systems) laboratory of the Istituto Italiano di Tecnologia, Genova (Italy) jointly with the VIPS (Vision, Image Processing, and Sound) lab of the University of Verona. The course is residential, spanning 5 days, so that attendees can install a more productive interaction with the lecturers. The maximum number of participants is limited to 50 persons. In case of a larger number of applications, priority will be given to PhD students.

Invited Speakers

Daniel Gatica-Perez IDIAP, Switzerland

Alessandro Vinciarelli University of Glasgow (Scotland)