报告题目:Modeling Visual Composition for Explainable Image Understanding
报 告 人:Prof Ying Wu
报告时间:2019年6月5日 15:30
报告地点:南一楼中311
邀请人:钟胜
Biography:
Dr. Ying Wu is Full professor of Electrical and Computer Engineering at Northwestern University, Evanston, Illinois. He received his B.S. from Huazhong University of Science and Technology, Wuhan, China, in 1994, the M.S. from Tsinghua University, Beijing, China, in 1997, and the Ph.D. in electrical and computer engineering from the University of Illinois at Urbana-Champaign (UIUC), Urbana, Illinois, in 2001. In 2001, he joined the Department of Electrical and Computer Engineering at Northwestern University as an assistant professor. He was promoted to associate professor in 2007 and full professor in 2012. His current research interests include computer vision, autonomous robots, pattern recognition, machine learning, multimedia data mining, and human-computer interaction. He serves as associate editors for IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE T-PAMI), IEEE Transactions on Image Processing (IEEE T-IP), IEEE Transactions on Circuits and Systems for Video Technology (IEEE-TCSVT), SPIE Journal of Electronic Imaging (JEI), and IAPR Journal of Machine Vision and Applications (MVA). He serves as Program Chair and Area Chairs for top conferences including CVPR, ICCV, and ECCV. He received the Robert T. Chien Award at UIUC in 2001, and the NSF CAREER award in 2003. He is a Fellow of the IEEE.
Abstract:
The key issue in visual modeling is to cope with the uncertainty in visual patterns that generally exhibit enormous variations. DeepNet-based methods have recently demonstrated impressive results on many visual recognition tasks. They forcefully learn a highly non-linear mapping from the visual space to the “label” space. Generally, they don’t explicitly exploit the visual space structure, but rather depending on the coverage of the huge amount of training data. In practice, however, it is not always possible to collect “sufficient” training data so as to well cover the complicated visual diversity.
The diversity of the visual appearances is caused by many reasons. An important one is the variation in structural composition. A larger visual pattern can be decomposed by a set of its smaller component patterns. It is the possibly endless structural compositions that produce the enormous visual complexity and diversity. This is not well investigated and understood, and thus deserves more studies.
In this talk, I will firstly present our recent work on a visual compositional model that unified the modeling of structure and appearance. It designs a stochastic grammar based on the probabilistic And/Or-Graph to model the structural composition. In addition, this structural decomposition is grounded to images via deep networks to handle the visual appearance uncertainty. The learning of this model can be performed effectively in an end-to-end fashion. Then, I will present a case study of the human pose estimation task by compositional models. Instead of making simple relationships among sub-parts, we propose a deeply learned compositional model (DLCM) to learn the compositionality of the human body. This results in a novel model with a hierarchical compositional architecture and bottom-up/top-down inference stages. In addition, we propose a novel bone-based part representation. It not only compactly encodes orientations, scales and shapes of the parts, but also avoids their potentially large state spaces. With significantly lower complexities, our approach outperforms state-of-the-art methods on three benchmark datasets, which demonstrates the advantages of modeling visual composition.