The ECCV/ACM Multimedia 2016 workshop entitled “Computer Vision for Audio-Visual Media” (CVAVM) is dedicated to the role of computer vision for audio-visual media. Audio-visual data is readily available since it is simple to acquire and the great majority of videos today contain an audio track. Audio-visual media are ubiquitous in our daily life: from movies to TV programs to music videos to YouTube clips, to cite just a few. Moreover audio-visual media exist on various platforms: TVs, movie theaters, tablets and smartphones. Audio-visual media are also applied in many casual and professional contexts and applications such as entertainment, machine learning, biomedical, games, education, movie special effects, among many others.

The goals of this workshop are to (1) investigate the great research opportunities of audio-visual data/media processing and editing, (2) gather researchers working on audio-visual data/media and (3) present and discuss the latest trends in research and technology with paper presentations and invited talks.

Our workshop investigates any applications and algorithms that combine visual and audio information. The first major thrust is how the combination of audio and visual information can simplify or improve “traditional” computer vision applications, in particular (but not limited to) action recognition, video segmentation and 3D reconstruction. The second major thrust is the exploration of emerging, novel and unconventional applications of audio-visual media, for example movie trailer generation, video editing, and video-to-music alignment.

We invite anyone who is interested in audio-visual data and media. Our CVAVM workshop is organized by ECCV and coordinated together with the ACM Multimedia Conference. It is a wonderful and exciting opportunity to foster the collaboration between the computer vision and multimedia communities, so people from both computer vision and multimedia communities are welcome to submit papers and attend the workshop.

Important dates:

Paper registration (title, abstract and authors): June 24, 2016
Full paper submission: June 26, 2016
Notification of acceptance: July 20, 2016
Camera-ready paper due: July 24, 2016
Workshop date: October 16, 2016 (morning)


The workshop will be on October 16, 2016. See venue information below .

09:00 – 09:10 Welcome and Opening Remarks
09:10 – 09:55 Invited keynote1 by Prof. William Freeman (MIT)
10:00 – 10:20 Oral1: “Speech-driven Facial Animation Using Manifold Relevance Determination” by Samia Dawood, Yulia Hicks, and David Marshall
10:20 – 10:40 Oral2: “Suggesting Sounds for Images from Video Collections”, by Matthias Soler, Jean-Charles Bazin, Oliver Wang, Andreas Krause and Alexander Sorkine-Hornung
10:40 – 11:00 Coffee break
11:00 – 11:20 Oral3: “GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion”, by Ankit Gandhi, Arjun Sharma, Arijit Biswas, and Om Deshmukh
11:20 – 12:05 Invited keynote2 by Rodolphe Gelin (SoftBank Robotics, Aldebaran)
12:05 – 12:15 Closing Remarks

Keynote talks:

William T. Freeman is a Professor at MIT. He will be talking about exciting technologies for audio-visual data.

Rodolphe Gelin, EVP Chief Scientific Officer at Aldebaran. He will be talking about exciting applications of audio-visual data for robotics and interaction



The CVAVM workshop is part of the ECCV and ACM MM 2016 workshops. It will take place in Amsterdam, The Netherlands, on 16 October 2016. The venue is in the room C0.02 at the Roeterseiland complex of the University of Amsterdam (Roetersstraat 11, 1018 WB, see google map). Please see the ECCV webpage for information on venue, accommodations, and other details.

Paper Submissions:

Our CVAVM workshop invites paper submissions on any applications and algorithms that combine visual and audio information. See the list of topics below.

Paper submissions are handled through the workshop’s CMT website.

The paper submission deadline is June 26, 2016. The paper submission is similar to the ECCV main conference, see guidelines and template on the ECCV webpage . Papers are limited to 14 pages (excluding references), including figures and tables. The reviewing will be double-blind, and each submission will be reviewed by at least two reviewers.

Additional comments: you can upload supplementary material from the author console (i.e. after initial registration of the paper).

Topics include (but are not limited to):

– 3D reconstruction and tracking
– scene and action recognition, and video classification
– video segmentation and saliency
– speaker identification
– speech recognition in videos
– automatic video captioning
– virtual/augmented reality and tele-presence
– human-computer interaction
– robotics
– joint audio-visual processing
– automatic generation of videos
– trailer generation
– video and movie manipulation
– video synchronization
– image sonification
– video-to-music alignment
– joint audio-video retargeting


Workshop chairs:

Jean-Charles Bazin, Disney Research
Zhengyou Zhang, Microsoft Research
Wilmot Li, Adobe Research

Committee members:

Dinesh Manocha, University of North Carolina
Ivan Dokmanic, EPFL
Michael Rubinstein, Google and MIT
Timothy R. Langlois, Adobe Research