Recent Advances in Visual Data Summarization

CVPR 2019 Tutorial

Location: Room 203C

Sunday, June 16, 2019, 1:30 pm - 5:30 pm


  • Rameswar Panda: Research Staff Member, IBM Research AI, MIT-IBM Watson AI Lab.
  • Ehsan Elhamifar: Assistant Professor, Northeastern University.
  • Michael Gygli: Research Scientist, Google Research, Zurich.
  • Boqing Gong: Research Scientist, Google Research, Seattle.
  • Tutorial Description

    Visual data summarization has many applications ranging from computer vision (video summarization, video captioning, active visual learning, object detection, image/video segmentation, etc) to data mining (recommender systems, webdata analysis, etc). As a consequence, new important research topics and problems are recently appearing, (i) online and distributed summarization, (ii) weakly supervised summarization, (iii) summarization in sequential data, as well as (iv) summarization in networks of cameras, in particular, for surveillance tasks. The objective of this tutorial is to present the audience with a unifying perspective of the visual data summarization problem from both theoretical and application standpoint, as well as to discuss, motivate and encourage future research that will spur disruptive progress in the the emerging field of summarization.


  • 1:30 pm - 1:50 pm: Introduction and Overview: Rameswar Panda [Slides]
  • 1:50 pm - 2:40 pm: Dynamic Subset Selection: Ehsan Elhamifar [Slides]
  • 2:40 pm - 3:30 pm: Video Summarization Objectives: Michael Gygli [Slides]
  • 3:30 pm - 3:50 pm: Break.
  • 3:50 pm - 4:40 pm: Weakly Supervised Video Summarization: Rameswar Panda [Slides]
  • 4:40 pm - 5:30 pm: Sequential Determinantal Point Processes: Boqing Gong [Slides]
  • Abstracts

  • Dynamic Subset Selection: Algorithms, Theory and Applications to Procedure Learning (Ehsan): Subset selection is the task of finding a small subset of most informative points from a large dataset and finds many applications in computer vision including image and video summarization, data clustering, active visual learning and classifier selection, among others. Despite many studies, the majority of existing methods ignore dynamics and important structured dependencies among points and require many pairs of datasets and ground-truth summaries for efficient learning. In this talk, I will discuss a new class of utility functions by generalizing the well-known facility location to structured settings, develop scalable algorithms based on extensions of submodular maximization and discuss theoretical underpinning of the developed methods. I will discuss an important application in vision, which is understanding procedural videos, where I show using tools from dynamic subset selection significantly improves the performance over existing methods. I will also discuss incorporating high-level reasoning into the developed methods by learning from humans with small amount of annotations.
  • Video Summarization Objectives during Training and Testing (Michael): The omnipresence of video recording devices has created the need to automatically edit and summarize videos. Video summarization is a challenging task, however. What characterizes a good summary depends on the context and the task that one aims to execute. This makes obtaining ground truth for summarization datasets and the evaluation of summarization methods difficult. As a result, datasets are typically small and it is unclear how well existing evaluation metrics align with human preferences. In this talk, I will first discuss existing datasets and how recent works compensate the lack large-scale datasets. Approaches for this include pre-training on other tasks or using weakly-supervised and unsupervised training objectives. Others rely on web priors or use topic-similarity to summarize multiple videos jointly. Second, I will discuss the advantages and disadvantages of existing evaluation metrics. Finally, I will propose ideas on how to train better models and more reliably track performance of summarization models.
  • Weakly Supervised Video Summarization (Rameswar): Many of the recent successes in video summarization have been driven by the availability of large quantities of labeled training data. However, in the vast majority of real-world settings, collecting such data sets by hand is infeasible due to the cost of labeling data or the paucity of data in a given domain. One increasingly popular approach is to use weaker forms of supervision that are potentially less precise but can be substantially less costly than producing explicit annotation for the given task. In this talk, we will first discuss about different forms of weak supervision that can be leveraged while summarizing videos. We will present how the context of additional topic-related videos can provide more knowledge and useful clues to extract semantically meaningful video summaries. Next, we will introduce how the context of a video in a scene, e.g., video level labels help generating a meaningful video summary by avoiding the requirement of huge amount of human-labeled video-summary pairs in fully supervised algorithms. Finally, we will describe how sparse optimization methods exploiting content correlations across multiple videos in a camera network or multiple videos resulting from a web search help in generating an informative multi-video summary describing the whole video collection.
  • Sequential Determinantal Point Processes: Models, Algorithms, and Applications in Diverse and Sequential Subset Selection (Boqing): Determinantal point processes (DPPs) were first used to characterize the Pauli exclusion principle, which states that two identical particles cannot occupy the same quantum state simultaneously. The notion of exclusion has made DPP an appealing tool to model diversity in applications such as video summarization and image ranking. In this talk, I will give a gentle review to DPPs and then present sequential DPPs (seqDPPs), a probabilistic model we originally proposed for modeling video summarization as a supervised, diverse, and sequential subset selection process — in contrast, prior approaches to video summarization were largely unsupervised. This talk will cover both seqDPPs and hierarchical seqDPPs, three tailored training algorithms (maximum likelihood estimation, large-margin, and reinforcement), and their applications to vanilla video summarization as well as query-focused video summarization.
  • Target Audience

    The intended audience are academicians, graduate students and industrial researchers who are interested in the state-of-the-art machine learning techniques for information extraction and summarization in large high-dimensional datasets that are considered to be mixed, multi-modal, inhomogeneous, heterogeneous, or hybrid. Audience with mathematical and theoretical inclination will enjoy the course as much as the audience with practical tendency.

    Speaker Bios

  • Rameswar Panda is currently a Research Staff Member at IBM Research AI, MIT-IBM Watson AI Lab, Cambridge, USA. Prior to joining IBM, he obtained his Ph.D in Electrical and Computer Engineering from University of California, Riverside in 2018. His primary research interests span thevareas of computer vision, machine learning and multimedia. In particular, his current focus is on developing semi, weakly, unsupervised algorithms for solving different vision problems. His work has been published intop-tier conferences such as CVPR, ICCV, ECCV, MM as well as high impact journals such as TIP and TMM.
  • Ehsan Elhamifar is an Assistant Professor in the College of Computer and Information Science (CCIS) and is the director of the Mathematical, Computational and Applied Data Science (MCADS) Lab at the Northeastern University. Prof. Elhamifar is a recipient of the DARPA Young Faculty Award and the NSF CISE Career Research Initiation Award on the topic of Big Data Summarization. Previously, he was a postdoctoral scholar in the Electrical Engineering and Computer Science (EECS) department at the University of California, Berkeley. Prof. Elhamifar obtained his PhD from the Electrical and Computer Engineering (ECE) department at the Johns Hopkins University. Prof. Elhamifars research areas are machine learning, computer vision and optimization. He is interested in developing scalable and robust algorithms that can address challenges of complex and massive high-dimensional data. Specifically, he uses tools from convex, nonconvex and submodular optimization, sparse and low-rank modeling, deep learning and high-dimensional statistics to develop algorithms and theory and applies them to solve real-world challenging problems, including big data summarization, procedure learning from instructional data, large-scale recognition with small labeled data and active learning for visual data.
  • Michael Gygli is a research scientist at Google AI in Zurich, working under Prof. Vittorio Ferrari. Before joining Google, Michael was the head of AI at, leading the efforts to automate video editing through summarization and highlight detection. In 2017 he obtained a PhD from ETH Zurich for his thesis on Interest-Based Video Summarization via Subset Selection, under the supervision of Prof. Luc Van Gool. Michael has published several papers at venues such as CVPR, ICCV, ECCV, ICML and MM.
  • Boqing Gong is a research scientist at Google, Seattle and a remote principal investigator at ICSI, Berkeley. His research in machine learning and computer vision focuses on modeling algorithms and visual recognition. Before joining Google in 2019, he worked in Tencent and was a tenure-track Assistant Professor at the University of Central Florida (UCF). He received an NSF CRII award in 2016 and an NSF BIGDATA award in 2017, both of which were the first of their kinds ever granted to UCF. He is/was a (senior) area chair of NeurIPS 2019, ICCV 2019, ICML 2019, AISTATS 2019, AAAI 2020, and WACV 2018--2020. He received his Ph.D. in 2015 at the University of Southern California, where the Viterbi Fellowship partially supported his work.