addressing uncertainty in performance measurement of intelligent systems

Addressing Uncertainty in Performance Measurement of Intelligent Systems

Raj Madhavan1,2

Elena Messina1

Hui-Min Huang1

Craig Schlenoff1

1Intelligent Systems DivisionNational Institute of Standards and Technology (NIST)

&2Institute for Systems Research (ISR)University of Maryland, College Park

Commercial equipment and materials are identified in this presentation in order to adequately specify certain procedures.Such identification does not imply recommendation or endorsement by NIST, nor does it imply that the materials orequipment identified are necessarily the best available for the purpose. The views and opinions expressed are those of the presenter and does not necessarily reflect those of the organizations he is affiliated with.

Measuring Performance of Intelligent Systems

Performance Evaluation, Benchmarking, and Standardization are critical enablers for wider acceptance and proliferation of existing and emerging technologies

Crucial for fostering technology transfer and driving industry innovation Currently, no consensus nor standards exist on

– key metrics for determining the performance of a system

– objective evaluation procedures to quantitatively deduce/measure the performance of robotic systems against user-defined requirements

The lack of ways to quantify and characterize performance of technologies and systems has precluded researchers working towards a common goal from

– exchanging and communicating results,

– inter–comparing robot performance, and

– leveraging previous work that could otherwise avoid duplication and expedite technology transfer.

Measuring Performance of Intelligent Systems

The lack of ways to quantify and characterize technologies and systems also hinders adoption of new systems

– Users don’t trust claims by developers

– There is lack of knowledge about how to match a solution with a problem

Users may be reluctant to try a new technology for fear of expensive failure:

– Think of the “graveyards” of unused equipment in some places

…

Challenges in Measuring Performance of IS

Diversity of applications and deployment scenarios for the IS

Complexity of the Intelligent System itself– Software components– Hardware components– Interactions between components – System of Systems

Lack of a well-defined mathematical foundation for dealing with uncertainty in a complex system

– methods for computing performance measures and related uncertainties

– techniques for combining uncertainties and making inferences based on those uncertainties

– approaches for estimating uncertainties for predicted performance

Uncertainty and Complexity

Uncertainty and complexity are often closely related

The abilities to handle uncertainty and complexity are directly related to the levels of autonomy and performance

Autonomy Levels for Unmanned Systems (ALFUS) Framework

Standard terms and definitions for characterizing the levels of autonomy for unmanned systems

Metrics, methods, and processes for measuring autonomy of unmanned systems

Contextual Autonomous Capability http://www.nist.gov/el/isd/ks/autonomy_levels.cfm/ (Hui-

Min Huang)

Addressing Uncertainty in Performance Measurement via Complexity

In this context, performance that we are trying to measure is taken to mean the successful completion of the mission

Being able to handle higher level of mission and environmental complexities results in higher system performance

We can determine whether program-specific performance requirements are achievable

Mission Complexity/Uncertainty

EnvironmentComplexity/Uncertainty

Systems Complexity/Uncertainty

UGV metrics: max speed/acce. endurance distance/duration min turn/bank radius

flat, paved surface

unpaved surfaces

unknownterrain

Mobsubsys

UMS

TeamOrganization

UGV team metrics: coordinate in team coordinate w by-stander

Mobility Example

Hurdle Test Method Results: Numbers indicating successful repetitions. 10 corresponds to reliability of 80%--probability of success--that the robot can successfully perform the task at the associated apparatus setting. Measurement Uncertainty (in measuring Obstacle Traverse Capability): One half of the obstacle size increment (5 cm) and the elapsed time unit (30 s)

Test Methods (1)Hurdle Test Method

The purpose of this test method is to quantitatively evaluate the vertical step surmounting capabilities of a robot, including variable chassis configurations and coordinated behaviors, while being remotely teleoperated in confined areas with lighted and dark conditions.Metrics • Maximum elevation (cm) surmounted for 10 repetitions • Average time per repetition

To project remote situationalawareness from down range

REQUIREMENT:To communicate effectivelythroughout mission

To traverse and performcomms task

Flat, paved surface withoutobjects in surrounding

UMS operating area

EMI existence

Commssubsys

UMS

UMS teamCommsPlan

Mission Complexity/Uncertainty

EnvironmentComplexity/Uncertainty

Systems Complexity/Uncertainty

Comms Example

Line-of-Sight Radio Comms Test Method : Stations every 100 m for testing two-way communications. Multiple testing tasks at each test station sum up for the repeatability.

Test Methods (2) Radio Comms (LoS) Test Method

The purpose of this test method is to quantitatively evaluate the line of sight (LOS) radio communications range for a remotely teleoperated robot. Metric • Maximum distance (m) downrange at which the robot completes tasks to verify the functionality of control, video, and audio transmissions.

• System – a set of interacting or interdependent components forming an integrated whole intended to accomplish a specific goal

• Component – a constituent part or feature of a system that contributes to its ability to accomplish a goal

• Capability – a specific purpose or functionality that the system is designed to accomplish

• Technical Performance – metrics related to quantitative factors (such as accuracy, precision, time, distance, etc) as required to meet end-user expectations

• Utility Assessment – metrics related to qualitative factors that gauge the quality or condition of being useful to the end-user

SCORE

SCORE (System, Component and Operationally Relevant Evaluations)• Is a unified set of criteria and software

tools for defining a performance evaluation approach for complex intelligent systems

• Provides a comprehensive evaluation blueprint that assesses the technical performance of a system, its components and its capabilities through isolating and changing variables as well as capturing end-user utility of the system in realistic use-case environments

How SCORE Handles Complexity

• The complexity of the “system under test” grows as more components are introduced into the evaluation

• Components evaluated in the elemental tests are less complex than sub-systems (which contain multiple components) which are less complex than the while system

• SCORE tests at these various levels of complexity

• Data in the following slides indicate that the results of the elemental tests can accurately be predictive of the performance of the subsystem test (which is more complex) and so on.

• GOAL – Demonstrate capabilities to rapidly develop and field free-form, two-way speech-to-speech translation systems enabling English and foreign language speakers to communicate with one another in real-world tactical situations.

• NIST was funded over the past three years to serve as the Independent Evaluation Team for this effort.

• METRICS (as specified by DARPA)

• System usability testing – providing overall scores to the capabilities of the whole system

• Software component testing – evaluate components of a system to see how well they perform in isolation

TRANSTAC

Please open the car door.

فتح يرجىالسيارة باب

Automatic Speech

Recognition (ASR)

MachineTranslation (MT)

Text To Speech (TTS)

TRANSTACA QUICK TUTORIAL ON SPEECH TRANSLATION

Automated Metrics:. For speech recognition, we calculated Word-Error-Rate (WER). For machine translation, we calculated BLEU and METEOR.

TTS Evaluation: Human judges listened to the audio outputs of the TTS evaluation and compared them to the text string of what was fed into the TTS engine. They then gave a Likert score to indicate how understandable the audio file was. WER was also used to judge the TTS output.

Low-Level Concept Transfer: A directly quantitative measure of the transfer of the low-level elements of meaning. In this context, a low-level concept is a specific content word (or words) in an utterance. For example, the phrase “The house is down the street from the mosque.” is one high-level concept, but is made up of three low-level concepts (house, down the street, mosque).

Likert Judgment: A panel of bilingual judges rated the semantic adequacy of the translations, an utterance at a time, choosing from a seven point scale.

High-Level Concept Transfer: The number of utterances that are judged to have been successfully transferred. The high-level concept metric is an efficiency metric which shows the number of successful utterances per unit of time, as well as accuracy.

Surveys/Semi-Structured Interviews: After each live scenario, the Soldiers/Marines and the foreign language speakers filled out a detailed survey asking them about their experiences with the TRANSTAC systems. In addition, semi-structured interviews were performed with all participants in which questions such as “What did you like?, What didn’t you like? and What would you change?” were explored.

TRANSTACMETRICS

SCORE Level Metric Team 1 Team 2 Team 3

Elemental BLEU 1 2 2

Elemental METEOR 1 2 2

Elemental TTS 1 1 2

Sub-SystemLow-level Concept Transfer 1 2 2

System Likert Judgment 1 2 2

SystemHigh Level Concept Transfer 1 2 3

System (Qualitative) User Surveys 1 2 3

Com

plex

ity

From this data, it appears that:

• the quantitative performance of the elements of the systems have a direct correlation to the quantitative performance of the subsystems;

• the quantitative performance of the sub-systems has a direct correlation to the quantitative performance of the overall system;

• the quantitative performance of the overall system has a direct correlation to the qualitative perception of the soldiers using the systems.

TRANSTAC

In Conclusion …

Thank you!

addressing uncertainty in performance measurement of intelligent systems

Documents

performance of technologies

performance of robotic

computing performance

new technology

technology transfer

levels of autonomy

technology nist2institute

huangaddressing uncertainty