JaCoText: A Pretrained Model for Java Code-Text Generation
Authors: Jessica Lòpez Espejel, Mahaman Sanoussi Yahaya Alassan, Walid Dahhane, El Hassane Ettifouri
Abstract:
Pretrained transformer-based models have shown high performance in natural language generation task. However, a new wave of interest has surged: automatic programming language generation. This task consists of translating natural language instructions to a programming code. Despite the fact that well-known pretrained models on language generation have achieved good performance in learning programming languages, effort is still needed in automatic code generation. In this paper, we introduce JaCoText, a model based on Transformers neural network. It aims to generate java source code from natural language text. JaCoText leverages advantages of both natural language and code generation models. More specifically, we study some findings from the state of the art and use them to (1) initialize our model from powerful pretrained models, (2) explore additional pretraining on our java dataset, (3) carry out experiments combining the unimodal and bimodal data in the training, and (4) scale the input and output length during the fine-tuning of the model. Conducted experiments on CONCODE dataset show that JaCoText achieves new state-of-the-art results.
Keywords: Java code generation, Natural Language Processing, Sequence-to-sequence Models, Transformers Neural Networks.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 877References:
[1] Wasi Ahmad and Saikat Chakraborty and Baishakhi Ray and Kai-Wei Chang, Unified Pretraining for Program Understanding and Generation. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
[2] Long Phan and Hieu Tran and Daniel Le and Hieu Nguyen and James Annibal and Alec Peltekian and Yanfang Ye, CoTexT: Multi-task Learning with Code-Text Transformer. Proceedings of the 1st Workshop on Natural Language Processing for Programming, 2021.
[3] Daya Guo and Shuo Ren and Shuai Lu and Zhangyin Feng and Duyu Tang and Shujie Liu and Long Zhou and Nan Duan and Alexey Svyatkovskiy and Shengyu Fu and Michele Tufano and Shao Kun Deng and Colin B. Clement and Dawn Drain and Neel Sundaresan and Jian Yin and Daxin Jiang and Ming Zhou, GraphCodeBERT: Pre-training Code Representations with Data Flow. 9th International Conference on Learning Representations, 2021.
[4] Xu Frank F. and Alon Uri and Neubig Graham and Hellendoorn Vincent Josua, A Systematic Evaluation of Large Language Models of Code. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022.
[5] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N Gomez and Lukasz Kaiser and Illia Polosukhin, Attention is All you Need. Advances in Neural Information Processing Systems, 2017.
[6] Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova, BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 2018.
[7] Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth, Looking Beyond the Surface:A Challenge Set for Reading Comprehension over Multiple Sentences. NAACL, 2018.
[8] Christopher Clark and Kenton Lee and Ming-Wei Chang and Tom Kwiatkowski and Michael Collins and Kristina Toutanova, BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 2019.
[9] Yasmin Moslem and Rejwanul Haque and John Kelleher and Andy Way, Domain-Specific Text Generation for Machine Translation. arXiv, 2022.
[10] Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu, PEGASUS: Pretraining with Extracted Gap-sentences for Abstractive Summarization, 2019.
[11] Peter J. Liu and Yu-An Chung and Jie Ren, SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders, 2019.
[12] Alec Radford and Karthik Narasimhan, Improving Language Understanding by Generative Pretraining. 2018.
[13] Alec Radford and Jeffrey Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever, Language Models are Unsupervised Multitask Learners. 2019.
[14] Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 2020.
[15] Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu, CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRR, 2021.
[16] Maxim Rabinovich and Mitchell Stern and Daniel Klein Abstract Syntax Networks for Code Generation and Semantic Parsing, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1), 2017.
[17] Pengcheng Yin and Graham Neubig, A Syntactic Neural Model for General-Purpose Code Generation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017.
[18] Hamel Husain and Ho-Hsiang Wu and Tiferet Gazit and Miltiadis Allamanis and Marc Brockschmidt, CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. CoRR, 2019.
[19] Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou, CodeBERT: A Pretrained Model for Programming and Natural Languages, 2020.
[20] Iz Beltagy and Kyle Lo and Arman Cohan, SciBERT: A Pretrained Language Model for Scientific Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
[21] Jinhyuk Lee and Wonjin Yoon and Sungdong Kim and Donghyeon Kim and Sunkyu Kim and Chan Ho So and Jaewoo Kang, BioBERT: a pretrained biomedical language representation model for biomedical text mining. Bioinformatics, 2020.
[22] Iz Beltagy and Matthew E. Peters and Arman Cohan, Longformer: The Long-Document Transformer. arXiv:2004.05150, 2020.
[23] Kaitao Song and Xu Tan and Tao Qin and Jianfeng Lu and Tie-Yan Liu, MASS: Masked Sequence to Sequence Pretraining for Language Generation. International Conference on Machine Learning, 2019.
[24] Luca Di Liello and Matteo Gabburo and Alessandro Moschitti, Efficient pretraining objectives for Transformers, 2021.
[25] Taku Kudo and John Richardson, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
[26] Srinivasan Iyer and Ioannis Konstas and Alvin Cheung and Luke Zettlemoyer, Mapping Language to Code in Programmatic Context. EMNLP, 2018.
[27] Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu, Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002.
[28] Shuo Ren and Daya Guo and Shuai Lu and Long Zhou and Shujie Liu and Duyu Tang and M. Zhou and Ambrosio Blanco and Shuai Ma, CodeBLEU: a Method for Automatic Evaluation of Code Synthesis, 2020.
[29] Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and Abdelrahman Mohamed and Omer Levy and Veselin Stoyanov and Luke Zettlemoyer, BART: Denoising Sequence-to-Sequence Pretraining for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
[30] Daya Guo and Duyu Tang and Nan Duan and Ming Zhou and Jian Yin, Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing. Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), 2019.
[31] Nicholas Locascio and Karthik Narasimhan and Eduardo DeLeon and Nate Kushman and Regina Barzilay, Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
[32] Xiaojun Xu and Chang Liu and Dawn Song, SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning, 2017.
[33] Victor Zhong and Caiming Xiong and Richard Socher, Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning, 2017.
[34] Pengcheng Yin and Bowen Deng and Edgar Chen and Bogdan Vasilescu and Graham Neubig, Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. Association for Computing Machinery, 2018.
[35] Yinhan Liu and Jiatao Gu and Naman Goyal and Xian Li and Sergey Edunov and Marjan Ghazvininejad and Mike Lewis and Luke Zettlemoyer, Multilingual Denoising Pretraining for Neural Machine Translation. Transactions of the Association for Computational Linguistics, 2020.
[36] Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, 2019.