Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32727
Online Pose Estimation and Tracking Approach with Siamese Region Proposal Network

Authors: Cheng Fang, Lingwei Quan, Cunyue Lu


Human pose estimation and tracking are to accurately identify and locate the positions of human joints in the video. It is a computer vision task which is of great significance for human motion recognition, behavior understanding and scene analysis. There has been remarkable progress on human pose estimation in recent years. However, more researches are needed for human pose tracking especially for online tracking. In this paper, a framework, called PoseSRPN, is proposed for online single-person pose estimation and tracking. We use Siamese network attaching a pose estimation branch to incorporate Single-person Pose Tracking (SPT) and Visual Object Tracking (VOT) into one framework. The pose estimation branch has a simple network structure that replaces the complex upsampling and convolution network structure with deconvolution. By augmenting the loss of fully convolutional Siamese network with the pose estimation task, pose estimation and tracking can be trained in one stage. Once trained, PoseSRPN only relies on a single bounding box initialization and producing human joints location. The experimental results show that while maintaining the good accuracy of pose estimation on COCO and PoseTrack datasets, the proposed method achieves a speed of 59 frame/s, which is superior to other pose tracking frameworks.

Keywords: Computer vision, Siamese network, pose estimation, pose tracking.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1088


[1] K. Sun, B. Xiao, D. Liu, and J. Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
[2] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. ECCV, 2018.
[3] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[4] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, June 2014. 2
[5] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded Pyramid Network for Multi-Person Pose Estimation. In CVPR, 2018.
[6] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, pages 483–499. Springer, 2016.
[7] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional Siamese networks for object tracking. In ECCV Workshops, 2016
[8] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance visual tracking with siamese region proposal network. In CVPR, 2018
[9] Q. Wang, L. Zhang, L. Bertinetto, W. Hu. Fast online object tracking and segmentation: a unifying approach. In CVPR 2019.
[10] X. Zhou, D. Wang, P. Krähenbühl. Objects as points, arXiv preprint arXiv:1904.07850, 2019.
[11] U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multi-person pose estimation and tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, 2017.
[12] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In CVPR, pages 1653–1660, 2014.
[13] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In CVPR, pages 4733–4742, 2016.
[14] J. Tompson, A. Jain, Y. Lecun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. Eprint Arxiv, pages 1799–1807, 2014.
[15] X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature learning for pose estimation. In CVPR, pages 4715–4723, 2016.
[16] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. Multi-context attention for human pose estimation. In CVPR, pages 5669–5678, 2017.
[17] W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR, pages 3073–3082, 2016.
[18] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele. Arttrack: articulated multiperson tracking in the wild. In CVPR, 2017.
[19] G. Ning, H. Huang. LightTrack: a generic framework for online top-down human pose tracking. arXiv preprint arXiv: 1905.02822, 2019.
[20] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu. Pose flow: Efficient online pose tracking. BMVC, 2018.
[21] J. Henriques, R. Caseiro, P. Martins, et al. High-speed tracking with kernelized correlation filters. PAMI 37(3) (2015) 583-596.
[22] M. Danelljan, G. Häger, F. Khan, M. Felsberg: Accurate scale estimation for robust visual tracking. In: BMVC 2014.
[23] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. In International Conference on Neural Information Processing Systems, pages 91–99, 2015.
[24] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and A. Vedaldi. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, 2016.
[25] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[26] H. Law and J. Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
[27] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal loss for dense object detection. ICCV, 2017.
[28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015. 5.
[29] S. E. Wei, V. Ramakrishna, T. Kanade, et al. Convolutional Pose Machines (C)// The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4724-4732.
[30] W. Liu, D. Anguelov, D. Erhan, et al. SSD: Single Shot MultiBox Detector (C)// Leibe B, Matas J, Sebe N, Welling M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9905. Cham: Springer, pp. 21-27.
[31] Z. Cao, T. Simon, S. E. Wei, et al. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields (C)// The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291-7299
[32] Y. Xiu, J. Li, H. Wang, et al. Pose Flow: Efficient Online Pose Tracking (J). arXiv preprint arXiv:1802.00977, 2018.
[33] A. Lukezic, T. Vojir, L. C. Zajc, et al. Discriminative Correlation Filter with Channel and Spatial Reliability (C)// The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6309-6318.
[34] F. Li, C. Tian, W. Zuo, et al. Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking (C)// The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4904-4913.
[35] M. Danelljan, G. Bhat, F. S. Khan, et al. ECO: Efficient Convolution Operators for Tracking (C)// The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6638-6646.
[36] G. Ning, H. Huang. LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking (J). arXiv preprint arXiv:1905.02822, 2019.