Treffer: Model Checking Using Large Language Models—Evaluation and Future Directions.
Weitere Informationen
Large language models (LLMs) such as ChatGPT have risen in prominence recently, leading to the need to analyze their strengths and limitations for various tasks. The objective of this work was to evaluate the performance of large language models for model checking, which is used extensively in various critical tasks such as software and hardware verification. A set of problems were proposed as a benchmark in this work and three LLMs (GPT-4, Claude, and Gemini) were evaluated with respect to their ability to solve these problems. The evaluation was conducted by comparing the responses of the three LLMs with the gold standard provided by model checking tools. The results illustrate the limitations of LLMs in these tasks, identifying directions for future research. Specifically, the best overall performance (ratio of problems solved correctly) was 60%, indicating a high probability of reasoning errors by the LLMs, especially when dealing with more complex scenarios requiring many reasoning steps, and the LLMs typically performed better when generating scripts for solving the problems rather than solving them directly. [ABSTRACT FROM AUTHOR]
Copyright of Electronics (2079-9292) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)