Treffer: Evaluating the performance of multilingual models in answer extraction and question generation.

Title:
Evaluating the performance of multilingual models in answer extraction and question generation.
Authors:
Moreno-Cediel A; Departamento de Ciencias de la Computación, Universidad de Alcalá, 28805, Alcalá de Henares, Spain., Del-Hoyo-Gabaldon JA; Departamento de Ciencias de la Computación, Universidad de Alcalá, 28805, Alcalá de Henares, Spain., Garcia-Lopez E; Departamento de Ciencias de la Computación, Universidad de Alcalá, 28805, Alcalá de Henares, Spain. eva.garcial@uah.es., Garcia-Cabot A; Departamento de Ciencias de la Computación, Universidad de Alcalá, 28805, Alcalá de Henares, Spain., de-Fitero-Dominguez D; Departamento de Ciencias de la Computación, Universidad de Alcalá, 28805, Alcalá de Henares, Spain.
Source:
Scientific reports [Sci Rep] 2024 Jul 05; Vol. 14 (1), pp. 15477. Date of Electronic Publication: 2024 Jul 05.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Nature Publishing Group Country of Publication: England NLM ID: 101563288 Publication Model: Electronic Cited Medium: Internet ISSN: 2045-2322 (Electronic) Linking ISSN: 20452322 NLM ISO Abbreviation: Sci Rep Subsets: PubMed not MEDLINE; MEDLINE
Imprint Name(s):
Original Publication: London : Nature Publishing Group, copyright 2011-
References:
Rus, V., Cai, Z., & Graesser, A. Question generation: Example of a multi-year evaluation campaign.In Proceedings of the WS on the QGSTEC (2008).
Wang, W., Feng, S., Wang, D., & Zhang, Y. Answer-guided and semantic coherent question generation in open-domain conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 5066–5076. https://doi.org/10.18653/v1/D19-1511 .
Duan, N., Tang, D., Chen, P., & Zhou, M. Question generation for question answering. in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 866–874. https://doi.org/10.18653/v1/D17-1090 .
Rebuffel, C. et al., Data-QuestEval: A referenceless metric for data-to-text semantic evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 8029–8036. https://doi.org/10.18653/v1/2021.emnlp-main.633 .
Pan, L., Chen, W., Xiong, W., Kan, M.-Y., & Wang, W. Y. Zero-shot fact verification by claim generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online: Association for Computational Linguistics, Aug. 2021, pp. 476–483. https://doi.org/10.18653/v1/2021.acl-short.61 .
Le, N.-T., Kojiri, T., & Pinkwart, N., Automatic question generation for educational applications – the state of art. In Advanced Computational Methods for Knowledge Engineering, T. van Do, H. A. L. Thi, and N. T. Nguyen, Eds., in Advances in Intelligent Systems and Computing. Cham: Springer International Publishing, 2014, pp. 325–338. https://doi.org/10.1007/978-3-319-06569-4_24 .
Liu, M., Calvo, R. & Rus, V. G-Asks: An intelligent automatic question generation system for academic writing support. Dialogue Discourse 3, 101–124. https://doi.org/10.5087/dad.2012.205 (2012). (PMID: 10.5087/dad.2012.205)
Mitkov, R., & Ha, L. A. Computer-aided generation of multiple-choice tests. In Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications Using Natural Language Processing, 2003, pp. 17–22. Accessed: Dec. 21, 2022. [Online]. https://aclanthology.org/W03-0203.
Rao, D. C. H. & Saha, S. K. Automatic multiple choice question generation from text: A survey. Ieee Trans. Learn. Technol. 13(1), 14–25. https://doi.org/10.1109/TLT.2018.2889100 (2020). (PMID: 10.1109/TLT.2018.2889100)
Olney, A. M., Graesser, A. C. & Person, N. K. Question generation from concept maps. Dial. Discourse https://doi.org/10.5087/dad.2012.204 (2012). (PMID: 10.5087/dad.2012.204)
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., & Moldovan, C. The first question generation shared task evaluation challenge. In Proceedings of the 6th International Natural Language Generation Conference, Association for Computational Linguistics, Jul. 2010. Accessed: Dec. 21, 2022. https://aclanthology.org/W10-4234.
Chali, Y. & Hasan, S. A. Towards topic-to-question generation. Comput. Linguist. 41(1), 1–20. https://doi.org/10.1162/COLI_a_00206 (2015). (PMID: 10.1162/COLI_a_00206)
Heilman, M., & Smith, N. A. Good question! Statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California: Association for Computational Linguistics, Jun. 2010, pp. 609–617. Accessed: Dec. 05, 2022. https://aclanthology.org/N10-1086.
Du, X., Shao, J., & Cardie, C. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada: Association for Computational Linguistics, Jul. 2017, pp. 1342–1352. https://doi.org/10.18653/v1/P17-1123 .
Vaswani, A., et al., Attention is all you need (2017), https://doi.org/10.48550/ARXIV.1706.03762 .
Lopez, L., Cruz, D., Cruz, J., & Cheng, C. Transformer-based end-to-end question generation. 2020.
Chan, Y.-H., & Fan, Y.-C. A recurrent BERT-based model for question generation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 154–162. https://doi.org/10.18653/v1/D19-5821 .
Wang, S. et al., PathQG: Neural question generation from facts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online: Association for Computational Linguistics, Nov. 2020, pp. 9066–9075. https://doi.org/10.18653/v1/2020.emnlp-main.729 .
Zhou, Q., Yang, N., Wei, F., Tan, C., Bao, H., & Zhou, M. “Neural question generation from text: A preliminary study (2017) arXiv: https://doi.org/10.48550/arXiv.1704.01792 .
Song, L., Wang, Z., Hamza, W., Zhang, Y., & Gildea, D. Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 569–574. https://doi.org/10.18653/v1/N18-2090 .
Steuer, T., Filighera, A., & C. Rensing, Remember the facts? investigating answer-aware neural question generation for text comprehension. In Artificial Intelligence in Education, I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, and E. Millán, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020, pp. 512–523. https://doi.org/10.1007/978-3-030-52237-7_41 .
Murakhovs’ka, L., Wu, C.-S., Laban, P., Niu, T., Liu, W., & Xiong, C. MixQG: Neural Question generation with mixed answer types. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 1486–1497. https://doi.org/10.18653/v1/2022.findings-naacl.111 .
Liu, B., et al., Learning to generate questions by LearningWhat not to generate. In The World Wide Web Conference, in WWW ’19. New York, NY, USA: Association for Computing Machinery, May 2019, pp. 1106–1118. https://doi.org/10.1145/3308558.3313737 .
Sasazawa, Y., Takase, S., & Okazaki, N. Neural question generation using interrogative phrases. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan: Association for Computational Linguistics, Oct. 2019, pp. 106–111. https://doi.org/10.18653/v1/W19-8613 .
Kim, Y., Lee, H., Shin, J., & Jung, K. Improving neural question generation using answer separation. In Proceedings of the AAAI Conference Artifical Intelligent, 2019, https://doi.org/10.1609/aaai.v33i01.33016602 .
Ma, X., Zhu, Q., Zhou, Y. & Li, X. Improving question generation with sentence-level semantic matching and answer position inferring. Proc. AAAI Conf. Artif. Intell. https://doi.org/10.1609/aaai.v34i05.6366 (2020). (PMID: 10.1609/aaai.v34i05.6366)
Naeiji, A., An, A., Davoudi, H., Delpisheh, M., & Alzghool, M., Question generation using sequence-to-sequence model with semantic role labels. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia: Association for Computational Linguistics, May 2023, pp. 2830–2842. Accessed: Sep. 04, 2023. https://aclanthology.org/2023.eacl-main.207.
Sun, Y., Liu, S., Dan, Z., & Zhao, X. Question generation based on grammar knowledge and fine-grained classification. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea: International Committee on Computational Linguistics, Oct. 2022, pp. 6457–6467. Accessed: Sep. 04, 2023. [Online]. Available: https://aclanthology.org/2022.coling-1.562.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423 .
Sun, Y., Chen, C., Chen, A. & Zhao, X. Tibetan question generation based on sequence to sequence model. Comput. Mater. Contin. 68(3), 3203–3213. https://doi.org/10.32604/cmc.2021.016517 (2021). (PMID: 10.32604/cmc.2021.016517)
Gusmita, R. H., Durachman, Y., Harun, S., Firmansyah, A. F., Sukmana, H. T., & Suhaimi, A. A rule-based question answering system on relevant documents of Indonesian Quran translation. In 2014 International Conference on Cyber and IT Service Management (CITSM), Nov. 2014, pp. 104–107. https://doi.org/10.1109/CITSM.2014.7042185 .
Shao, T., Guo, Y., Chen, H. & Hao, Z. Transformer-based neural network for answer selection in question answering. IEEE Access 7, 26146–26156. https://doi.org/10.1109/ACCESS.2019.2900753 (2019). (PMID: 10.1109/ACCESS.2019.2900753)
Naseem, T., et al., A semantics-aware transformer model of relation linking for knowledge base question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online: Association for Computational Linguistics, Aug. 2021, pp. 256–262. https://doi.org/10.18653/v1/2021.acl-short.34 .
Yang, Z., Hu, J., Salakhutdinov, R., & Cohen, W., Semi-supervised QA with generative domain-adaptive nets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada: Association for Computational Linguistics, Jul. 2017, pp. 1040–1050. https://doi.org/10.18653/v1/P17-1096 .
See, A., Liu, P. J., & Manning, C. D. Get to the point: Summarization with pointer-generator networks (2017). arXiv: https://doi.org/10.48550/arXiv.1704.04368 .
Vinyals, O., Fortunato, M., & Jaitly, N. Pointer networks. in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2015. Accessed: Dec. 29, 2022. https://papers.nips.cc/paper/2015/hash/29921001f2f04bd3baee84a12e98098f-Abstract.html.
Nallapati, R., Zhou, B., dos Santos, C., Gu̇lçehre, Ç., & Xiang, B. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 280–290. https://doi.org/10.18653/v1/K16-1028 .
Subramanian, S., Wang, T., Yuan, X., Zhang, S., Trischler, A., & Bengio, Y. Neural models for key phrase extraction and question generation. In Proceedings of the Workshop on Machine Reading for Question Answering, Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 78–88. https://doi.org/10.18653/v1/W18-2609 .
Sun, X., Liu, J., Lyu, Y., He, W., Ma, Y., & Wang, S. Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 3930–3939. https://doi.org/10.18653/v1/D18-1427 .
Back, S., Kedia, A., Chinthakindi, S. C., Lee, H., & Choo, J. Learning to generate questions by learning to recover answer-containing sentences. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online: Association for Computational Linguistics, Aug. 2021, pp. 1516–1529. https://doi.org/10.18653/v1/2021.findings-acl.132 .
Rodriguez-Torrealba, R., Garcia-Lopez, E. & Garcia-Cabot, A. End-to-End generation of multiple-choice questions using text-to-text transfer transformer models. Expert Syst. Appl. 208, 118258. https://doi.org/10.1016/j.eswa.2022.118258 (2022). (PMID: 10.1016/j.eswa.2022.118258)
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. The stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland: Association for Computational Linguistics, Jun. 2014, pp. 55–60. https://doi.org/10.3115/v1/P14-5010 .
Kumar, V., Muneeswaran, S., Ramakrishnan, G., & Li, Y.-F. ParaQG: A system for generating questions and answers from paragraphs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 175–180. https://doi.org/10.18653/v1/D19-3030 .
Arumae, K., & Liu, F. Guiding extractive summarization with question-answering rewards. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 2566–2577. https://doi.org/10.18653/v1/N19-1264 .
Dugan, L., et al., A feasibility study of answer-agnostic question generation for education. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 1919–1926. https://doi.org/10.18653/v1/2022.findings-acl.151 .
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020).
Uto, M., Tomikawa, Y., & Suzuki, A. Difficulty-controllable neural question generation for reading comprehension using item response theory. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 119–129. https://doi.org/10.18653/v1/2023.bea-1.10 .
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019).
Lord, F. M., Applications of Item Response Theory To Practical Testing Problems, 0 ed. Routledge, 2012. https://doi.org/10.4324/9780203056615 .
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. https://doi.org/10.3115/1073083.1073135 .
Banerjee, S., & Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan: Association for Computational Linguistics, Jun. 2005, pp. 65–72. Accessed: Nov. 24, 2022. [Online]. Available: https://aclanthology.org/W05-0909.
Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. Accessed: Dec. 23, 2022. [Online]. Available: https://aclanthology.org/W04-1013.
Ushio, A., Alva-Manchego, F., & Camacho-Collados, J. Generative language models for paragraph-level question generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 670–688. https://doi.org/10.18653/v1/2022.emnlp-main.42 .
Gatt, A. & Krahmer, E. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. J. Artif. Intell. Res. 61, 65–170. https://doi.org/10.1613/jair.5477 (2018). (PMID: 10.1613/jair.5477)
Laban, P., Wu, C.-S., Murakhovs’ka, L., Liu, W., & Xiong, C. Quiz design task: Helping teachers create quizzes with automated question generation. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 102–111. https://doi.org/10.18653/v1/2022.findings-naacl.9 .
Gutiérrez-Fandiño, A., et al., MarIA: Spanish language models. 2021, https://doi.org/10.48550/ARXIV.2107.07253 .
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. SQuAD: 100,000+ Questions for machine comprehension of text (2016). https://doi.org/10.48550/ARXIV.1606.05250 .
Xue, L. et al., mT5: A massively multilingual pre-trained text-to-text transformer (2020). https://doi.org/10.48550/ARXIV.2010.11934 .
Shazeer, N., Glu variants improve transformer. arXiv:200205202 (2020).
Press, O., Smith, N. A., & Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation (2022).
Muennighoff, N. et al., Crosslingual generalization through multitask finetuning (2022). https://doi.org/10.48550/ARXIV.2211.01786 .
Seonwoo, Y., Kim, J.-H., Ha, J.-W., & Oh, A. Context-aware answer extraction in question answering (2020). https://doi.org/10.48550/ARXIV.2011.02687 .
Bird, S., & Loper, E. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 214–217. Accessed: Jan. 09, 2023. [Online]. Available: https://aclanthology.org/P04-3031.
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. CIDEr: Consensus-based image description evaluation. Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566–4575. Accessed: Dec. 23, 2022. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2015/html/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.html.
Sharma, S., Asri, L. E., Schulz, H., & Zumer, J. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation (2017). https://doi.org/10.48550/ARXIV.1706.09799 .
Salton, G., Wong, A. & Yang, C. S. A vector space model for automatic indexing. Commun. ACM 18(11), 613–620. https://doi.org/10.1145/361219.361220 (1975). (PMID: 10.1145/361219.361220)
Xu, W., Napoles, C., Pavlick, E., Chen, Q. & Callison-Burch, C. Optimizing statistical machine translation for text simplification. Trans. Assoc. Comput. Linguist. 4, 401–415. https://doi.org/10.1162/tacl_a_00107 (2016). (PMID: 10.1162/tacl_a_00107)
Wu, Y. et al., Google’s neural machine translation system: Bridging the gap between human and machine translation (2016) https://doi.org/10.48550/ARXIV.1609.08144 .
Morris, A. Ed., An information theoretic measure of sequence recognition performance. IDIAP (2002).
Morris, A., Maier, V., & Green, P. From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition (2004). https://doi.org/10.21437/Interspeech.2004-668 .
Vajjala, S., Majumder, B., Gupta, A., & Surana, H. Practical natural language processing. In Practical Natural Language Processing, 1st ed.O’Reilly Media, 2020, p. 454.
Patil, S. Question generation using transformers (2023). Accessed Sep 06, 2023. https://github.com/patil-suraj/question_generation.
Entry Date(s):
Date Created: 20240705 Latest Revision: 20240708
Update Code:
20250114
PubMed Central ID:
PMC11226668
DOI:
10.1038/s41598-024-66472-5
PMID:
38969767
Database:
MEDLINE

Weitere Informationen

Multiple-choice test generation is one of the most complex NLP problems, especially in languages other than English, where there is a lack of prior research. After a review of the literature, it has been verified that some methods like the usage of rule-based systems or primitive neural networks have led to the application of a recent architecture, the Transformer architecture, in the tasks of Answer Extraction (AE) and Question Generation (QG). Thereby, this study is centred in searching and developing better models for the AE and QG tasks in Spanish, using an answer-aware methodology. For this purpose, three multilingual models (mT5-base, mT0-base and BLOOMZ-560 M) have been fine-tuned using three different datasets: a translation to Spanish of the SQuAD dataset; SQAC, which is a dataset in Spanish; and their union (SQuAD + SQAC), which shows slightly better results. Regarding the models, the performance of mT5-base has been compared with that found in two newer models, mT0-base and BLOOMZ-560 M. These models were fine-tuned for multiple tasks in literature, including AE and QG, but, in general, the best results are obtained from the mT5 models trained in our study with the SQuAD + SQAC dataset. Nonetheless, some other good results are obtained from mT5 models trained only with the SQAC dataset. For their evaluation, the widely used BLEU1-4, METEOR and ROUGE-L metrics have been obtained, where mT5 outperforms some similar research works. Besides, CIDEr, SARI, GLEU, WER and the cosine similarity metrics have been calculated to present a benchmark within the AE and QG problems for future work.
(© 2024. The Author(s).)