NIST launches ARIA program to assess societal impacts, ensure trustworthy AI systems

NIST releases finalized security guidelines for controlled unclassified information in SP 800-171r3, SP 800-171Ar3

The U.S. National Institute of Standards and Technology (NIST) has introduced the ‘Assessing Risks and Impacts of AI’ (ARIA) program, a testing, evaluation, validation, and verification (TEVV) initiative aimed at enhancing the understanding of artificial intelligence’s capabilities and effects. The program evaluates the societal risks and impacts of AI (artificial intelligence) systems, particularly in real-world interactions, and aims to establish methods for measuring system performance within societal frameworks post-deployment. It also aims to help organizations and individuals determine whether a given AI technology will be valid, reliable, safe, secure, private, and fair once deployed.

ARIA’s outcomes will contribute to the U.S. AI Safety Institute’s efforts to establish reliable and trustworthy AI systems. The program comes shortly after several recent announcements by NIST around the 180-day mark of the Executive Order on trustworthy AI and the U.S. AI Safety Institute’s unveiling of its strategic vision and international safety network.

Additionally, ARIA is one of several NIST evaluation initiatives that partially address NIST’s assignment of the President’s Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence (14110) to launch an initiative to create guidance and benchmarks for evaluating and auditing AI capabilities.

ARIA expands on the AI Risk Management Framework, which NIST released last January, and helps to operationalize the framework’s risk measurement function, which recommends that quantitative and qualitative techniques be used to analyze and monitor AI risk and impacts. ARIA will help assess those risks and impacts by developing a new set of methodologies and metrics for quantifying how well a system maintains safe functionality within societal contexts.

“In order to fully understand the impacts AI is having and will have on our society, we need to test how AI functions in realistic scenarios — and that’s exactly what we’re doing with this program,” Gina Raimondo, U.S. Commerce Secretary, said in a media statement. “With the ARIA program, and other efforts to support Commerce’s responsibilities under President Biden’s Executive Order on AI, NIST and the U.S. AI Safety Institute are pulling every lever when it comes to mitigating the risks and maximizing the benefits of AI.”

“The ARIA program is designed to meet real-world needs as the use of AI technology grows,” said Under Secretary of Commerce for Standards and Technology and NIST Director Laurie E. Locascio. “This new effort will support the U.S. AI Safety Institute, expand NIST’s already broad engagement with the research community, and help establish reliable methods for testing and evaluating AI’s functionality in the real world.”

“Measuring impacts is about more than how well a model functions in a laboratory setting,” said Reva Schwartz, NIST Information Technology Lab’s ARIA program lead. “ARIA will consider AI beyond the model and assess systems in context, including what happens when people interact with AI technology in realistic settings under regular use. This gives a broader, more holistic view of the net effects of these technologies.”

The ARIA effort will inform the work of the U.S. AI Safety Institute and over time, the U.S. AI Safety Institute Consortium may assist in enhancing and producing ARIA-style evaluations at scale, useful for all industries.

The initial ARIA activities will focus on risks and impacts associated with large language models (LLM), including the use of AI agents. The risks and impacts of LLMs will be evaluated across three levels–model testing, red-teaming, and field testing. As an evaluation of safe and trustworthy AI, submitting organizations will be required to provide documentation about their models, approaches, mitigations, and guardrails, along with information about their governance processes. In future evaluations, documentation requirements may be expanded and constitute part of the final score. 

While the first set of ARIA activities will focus on risks related to the generative AI technology of LLMs, the ARIA evaluation environment is flexible, and future iterations will broaden beyond generative AI. ARIA participant community and other researchers can provide input on future evaluation topics, domains, and technologies. For example, subsequent ARIA evaluations may consider other generative AI technologies such as text-to-image models, or other forms of AI such as recommender systems or decision support tools.

ARIA will originate a suite of qualitative, quantitative, and mixed methods to measure risks, impacts, trustworthy characteristics, and technical and societal robustness of models within the specified context of use. NIST will develop these metrics in close collaboration with ARIA participants. Selected ARIA evaluation output data will be made available as a rich corpus for research purposes, including the development of novel metrics for use in ARIA. 

NIST evaluations provide all participants with the opportunity to obtain vital information about their submitted technology components, make adjustments based on what they learned, and resubmit for further testing. While many organizations evaluate their technology internally, involvement in NIST evaluations allows participants to determine what is working with their models, often in comparison to other organizations on the same tests, with the same data, and under the same conditions. 

While the ARIA evaluations are open to all who wish to participate, the evaluation cycle typically concludes with a participant-only workshop to discuss information about new and promising approaches that may assist submitters’ understanding of how they might improve their models. Teams participating in ARIA can expect to glean information during testing and the workshop(s) that will help deliver safe and trustworthy AI.

NIST evaluation results are made publicly available. The level of information to be made public is predetermined for each evaluation. ARIA participants may decide to anonymize their submissions so that each team only knows how they performed in comparison to others. 

Even when results are not tied to a particular participating organization, the public will have access to the specific results of all technologies that have been evaluated. That information is valuable in gaining an understanding of how these technologies perform in a real-world context.

Webinar: A Sense of Urgency - Industrial Cybersecurity and Compliance Under the NIS2 Directive

Register: June 27, 2024 2pm CET

Related