My Account Log in

1 option

Benchmarking Large Language Models for Motorway Driving Scenario Understanding Graz University of Technology, Institute of Automotive

SAE Technical Papers (1906-current) Available online

View online
Format:
Book
Conference/Event
Author/Creator:
Zhou, Ji, author.
Contributor:
Eichberger, Arno
Yang, Aixi
Zhao, Yongqi
Conference Name:
2024 International Conference on Smart Transportation Interdisciplinary Studies (2024-12-13 : Nanjing, China)
Language:
English
Physical Description:
1 online resource cm
Place of Publication:
Warrendale, PA SAE International 2025
Summary:
Systematic testing of Automated Driving Systems (ADS) requires finding relevant test cases. The extraction of critical cases, also called edge or corner cases, from naturalistic driving data is a complex task and often prone to multiple errors. Large Language Models (LLMs) have been employed for virtual testing of ADS in recent years; however, quantitatively benchmarking LLMs' performance in this task has been barely investigated. In this paper, based on the characteristics of different LLMs, six LLMs were selected for benchmarking the LLMs' ability to understand ADS functional scenarios on motorways. A novel scenario classification model was introduced to enhance the granularity of data categorization for motorway driving scenarios. Different driving scenarios, described in natural language, were defined for testing the capability of these LLMs to understand various scenarios and convert them into standardized structured data. To perform the benchmarking in a standardized manner, the same prompt engineering and the same dataset were used to interact with each selected LLM and explore the LLMs' sensitivity to language style variation. For each group of classified driving scenarios, two different formats of natural language descriptions were fed to the LLMs for splitting the testing data. The test results indicate that "gpt-4-1106-preview" model achieves the highest accuracy, followed by "gpt-3.5-turbo", and "llama3-70b-instruct", while other LLMs show error consistency between 40% and 60%. The LLMs "gpt-4-1106-preview" and "llama3-70b-instruct" feature lower error consistency in their outputs under the two different formats of natural language, indicating greater robustness in handling varying textual inputs. The outcome of this work contributes to applications of LLMs on scenario extraction for ADS testing
Notes:
Vendor supplied data
Publisher Number:
2025-01-7146
Access Restriction:
Restricted for use by site license

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account