Shifted Window Based Self-Attention via Swin Transformer for Zero-Shot Learning
Authors: Yasaswi Palagummi, Sareh Rowlands
Abstract:
Generalised Zero-Shot Learning, often known as GZSL, is an advanced variant of zero-shot learning in which the samples in the unseen category may be either seen or unseen. GZSL methods typically have a bias towards the seen classes because they learn a model to perform recognition for both the seen and unseen classes using data samples from the seen classes. This frequently leads to the misclassification of data from the unseen classes into the seen classes, making the task of GZSL more challenging. In this work, we propose an approach leveraging the Shifted Window based Self-Attention in the Swin Transformer (Swin-GZSL) to work in the inductive GZSL problem setting. We run experiments on three popular benchmark datasets: CUB, SUN, and AWA2, which are specifically used for ZSL and its other variants. The results show that our model based on Swin Transformer has achieved state-of-the-art harmonic mean for two datasets - AWA2 and SUN and near-state-of-the-art for the other dataset - CUB. More importantly, this technique has a linear computational complexity, which reduces training time significantly. We have also observed less bias than most of the existing GZSL models.
Keywords: Generalised Zero-shot Learning, Inductive Learning, Shifted-Window Attention, Swin Transformer, Vision Transformer.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 217References:
[1] B. Goertzel, “Artificial General Intelligence: Concept, State of the Art, and Future Prospects,” Journal of Artificial General Intelligence, vol. 5, p. 1–48, 2014.
[2] D. Huynh and E. Elhamifar, “Fine-Grained Generalised Zero-Shot Learning via Dense Attribute-Based Attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[3] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell and Z. Akata, “Generalised Zero- and Few-Shot Learning via Aligned Variational Autoencoders,” CoRR, vol. abs/1812.01784, 2018.
[4] F. Alamri and A. Dutta, “Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning,” CoRR, vol. abs/2108.00045, 2021.
[5] L. Zhang, T. Xiang and S. Gong, “Learning a Deep Embedding Model for Zero-Shot Learning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017.
[6] Dinu et al., “Improving zero-shot learning by mitigating the hubness problem”, ICLRW, 2015.
[7] Radovanovic et al., “Hubs in Space: Popular Nearest Neighbors in´ High-Dimensional Data”, JMLR, 2010.
[8] Paul et al., “Semantically Aligned Bias Reducing Zero Shot Learning”, CVPR, 2019.
[9] Z. Akata, F. Perronnin, Z. Harchaoui and C. Schmid, “Label Embedding for Image Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 7, p. 1425–1438, 2016.
[10] Z. Akata, S. E. Reed, D. Walter, H. Lee and B. Schiele, “Evaluation of output embeddings for fine-grained image classification,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
[11] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein and B. Schiele, “Latent Embeddings for Zero-Shot Classification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
[12] R. Socher, M. Ganjoo, C. D. Manning and A. Y. Ng, “Zero-Shot Learning Through Cross-Modal Transfer,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 2013.
[13] F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. P. Lim and X.-Z. Wang, “A Review of Generalised Zero-Shot Learning Methods,” CoRR, vol. abs/2011.08641, 2020.
[14] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato and T. Mikolov, “DeViSE: A Deep Visual-Semantic Embedding Model,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 2013.
[15] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado and J. Dean, “Zero-Shot Learning by Convex Combination of Semantic Embeddings,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
[16] G. Dinu and M. Baroni, “Improving zero-shot learning by mitigating the hubness problem,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015.
[17] Y. Yu, Z. Ji, Y. Fu, J. Guo, Y. Pang and Z. Zhang, “Stacked Semantics-Guided Attention Model for Fine-Grained Zero-Shot Learning,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montreal, Canada, 2018.
[18] Y. Liu, J. Guo, D. Cai and X. He, “Attribute Attention for Semantic Disambiguation in Zero-Shot Learning,” in 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 2019.
[19] Y. Liu, L. Zhou, X. Bai, Y. Huang, L. Gu, J. Zhou and T. Harada, “Goal-Oriented Gaze Estimation for Zero-Shot Learning,” CoRR, vol. abs/2103.03433, 2021.
[20] F. Alamri and A. Dutta, “Implicit and Explicit Attention for Zero-Shot Learning,” in Pattern Recognition - 43rd DAGM German Conference, DAGM GCPR 2021, Bonn, Germany, September 28 - October 1, 2021, Proceedings, 2021.
[21] S. Chen, Z. Hong, Y. Liu, G.-S. Xie, B. Sun, H. Li, Q. Peng, K. Lu and X. You, “TransZero: Attribute-guided Transformer for Zero-Shot Learning,” CoRR, vol. abs/2112.01683, 2021.
[22] V. K. Verma, K. J. Liang, N. Mehta and L. Carin, “Meta-Learned Attribute Self-Gating for Continual Generalised Zero-Shot Learning,” CoRR, vol. abs/2102.11856, 2021.
[23] V. K. Verma, A. Mishra, A. Pandey, H. A. Murthy and P. Rai, “Towards Zero-Shot Learning with Fewer Seen Class Examples,” in IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021, 2021.
[24] Y. Yu, Z. Ji, J. Han and Z. Zhang, “Episode-Based Prototype Generating Network for Zero-Shot Learning,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020.
[25] N. V. Nayak and S. H. Bach, “Zero-Shot Learning with Common Sense Knowledge Graphs,” CoRR, vol. abs/2006.10713, 2020.
[26] F. Li, Z. Zhu, X. Zhang, J. Cheng and Y. Zhao, “From Anchor Generation to Distribution Alignment: Learning a Discriminative Embedding Space for Zero-Shot Recognition,” CoRR, vol. abs/2002.03554, 2020.
[27] G.-S. Xie, L. Liu, F. Zhu, F. Zhao, Z. Zhang, Y. Yao, J. Qin and L. Shao, “Region Graph Embedding Network for Zero-Shot Learning,” in Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV, 2020.
[28] Y. Annadani and S. Biswas, “Preserving Semantic Relations for ZeroShot Learning,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, 2018.
[29] Y. Liu, Q. Gao, J. Li, J. Han and L. Shao, “Zero Shot Learning via Low-rank Embedded Semantic AutoEncoder,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, 2018.
[30] G. Liu, J. Guan, M. Zhang, J. Zhang, Z. Wang and Z. Lu, “Joint Projection and Subspace Learning for Zero-Shot Recognition,” in IEEE International Conference on Multimedia and Expo, ICME 2019, Shanghai, China, July 8-12, 2019, 2019.
[31] Z. Ji, H. Wang, Y. Pang and L. Shao, “Dual triplet network for image zero-shot learning,” Neurocomputing, vol. 373, p. 90–97, 2020.
[32] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021.
[33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” CoRR, vol. abs/2010.11929, 2020.
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin, “Attention Is All You Need,” CoRR, vol. abs/1706.03762, 2017.
[35] Z. Zhang and V. Saligrama, “Zero-Shot Learning via Semantic Similarity Embedding,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.
[36] S. Yang, K. Wang, L. Herranz and J. van de Weijer, “On Implicit Attribute Localization for Generalised Zero-Shot Learning,” IEEE Signal Process. Lett., vol. 28, p. 872–876, 2021.
[37] W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata, “Attribute Prototype Network for Zero-Shot Learning,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, 2020.
[38] G.-S. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao and L. Shao, “Attentive Region Embedding Network for Zero-Shot Learning,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019.
[39] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein and B. Schiele, “Latent Embeddings for Zero-Shot Classification,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
[40] Y. Xian, S. Sharma, B. Schiele and Z. Akata, “F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019.
[41] C. H. Lampert, H. Nickisch and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, 2009.
[42] Y. L. Cacheux, H. L. Borgne and M. Crucianu, “Modeling Inter and Intra-Class Relations in the Triplet Loss for Zero-Shot Learning,” in 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 2019.
[43] Y. Xian, T. Lorenz, B. Schiele and Z. Akata, “Feature Generating Networks for Zero-Shot Learning,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
[44] Y. Zhu, J. Xie, Z. Tang, X. Peng and A. Elgammal, “Semantic-Guided Multi-Attention Localization for Zero-Shot Learning,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019.
[45] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado and J. Dean, “Zero-Shot Learning by Convex Combination of Semantic Embeddings,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
[46] G. Patterson and J. Hays, “Sun attribute database: Discovering, annotating, and recognizing scene attributes,” IEEE Conference on Computer Vision and Pattern Recognition, 2012.
[47] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD Birds 200,” California Institute of Technology, Tech. Rep. CNS-TR-2010-001, 2010.
[48] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zeroshot learning—a comprehensive evaluation of the good, the bad and the ugly,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.