Accuracy, validity, Reliability, Robustness and Resilience (AVR3)

We welcome all feedback and recommendations for improvement.

OVERVIEW

KS OWNER – SUNDAR NARAYANAN & RYAN CARRIER
KS DEPUTY –
LAST UPDATED – 13th December 2021

Context:

This body of knowledge, guidance, normative criteria and definitions will provide decision makers with the tools, comprehensive documentation and knowledge they require to confidently examine, consider, deploy and disclose decisions on AI, algorithmic and autonomous systems based upon sufficient accuracy, validity, reliability, robustness and resilience.

Accuracy, validity, reliability, resilience and robustness (AVR3) for AI, algorithmic and autonomous systems requires the integration of multiple disciplines that share the same general terminology – Testing and Evaluation, but the key risks, impacts, processes, considerations and outputs may be widely varied .

For example:
Testing & Evaluation – Food and Drugs includes – Focus on humans and specific adverse events
Testing & Evaluation – Model Risk Management includes – Focus on adverse consequences and effective challenge
Testing & Evaluation – Software/Hardware testing includes – Focus on performance and threat testing
Testing & Evaluation – Highly- integrated transportation safety testing includes – Focus on safety primarily

AI, algorithmic and autonomous systems require an integration of all these approaches because these socio-technical systems frequently encompass all of the above domains as well as having meaningful impact on humans. AI, algorithmic and autonomous systems are applied in multiple domains and contexts, such as healthcare, financial services, cloud computing and autonomous vehicles and profoundly impacts humans ubiquitously. Therefore, our AVR3 approach attempts to integrate many testing and evaluation perspectives in order to maximize effectiveness and minimize Residual Risk to humans. The focus on risk mitigation to humans is aligned with ForHumanity’s mission and primary focus of Independent Audit of AI Systems.

The objectives we intend to quantify and document are:

Provide a risk-based process by which organisations can examine, consider and deploy models considering accuracy, validity, reliability, robustness and resilience in a contextually relevant way that maximizes risk mitigation and includes documentation and acceptance of Residual Risk and appropriate disclosure
Examine and consider accuracy for the broadest range of protected categories and intersections thereof to minimize discrimnation and bias
Deploy valid models that produce benefit to humans consistent with established and transparent ethical principles while minimizing risk
Deploy Reliable, Resilient and Robust models to enhance trust, secure privacy and minimize threats/harms

Accuracy, validity, reliability, resilience and robustness have been critical risk management tools for quantitative models as a necessary step towards regulatory conformity for decade in Financial services. These approaches are increasingly being adopted for machine learning models. Emerging regulations including the Draft EU Artificial Intelligence Act expect valid models to ensure accuracy, robustness, and cybersecurity. With this context, this document provides select criteria and guidance regarding audit, audit compliance and certification of AI and autonomous systems to deliver sufficient Accuracy, validity, Reliability, Robustness and Resilience.

Accuracy, Validity, Reliability, Robustness and Resilience are measures to evaluate sufficient integrity of an AI, algorithm and autonomous systems. These systems are used in a variety of domains. Hence, we have harmonized learnings from multiple industries (for audit and certification of AI, algorithms and autonomous systems) including:

adverse events tracking from healthcare,
safety testing from highly integrated transportation, nuclear energy and hazardous chemicals,
adverse event tracking from clinical trials,
red team exercises from information security,
edge case testing and source code verification from software testing,
resilience metrics monitoring from cloud services,
model monitoring from financial services.

In addition, because of the socio-technical nature of these systems and their impact on humans, AI algorithmic and autonomous systems retain novel considerations, established in the AI/ ML space, including:

data quality,
model quality,
pipeline quality,
tradeoff validity,
construct validity,
Data entry point attacks..

This document will be updated on an ongoing basis to reflect newer learning or emerging areas that may be relevant to AVR3. For instance, the document currently does not have references or guidance for validation of federated learning algorithms.

Purpose of AVR³ BoK: To provide guidance on sufficient, mature and key insufficient evidence for compliance in the context of AVR³
Governance

This guidance introduces the Testing and Evaluation Committee (TEC). The Testing and Evaluation Committee is entrusted with the responsibility and documentation associated with oversight to assess the risks, understand the underlying impacts, evaluate associated mitigation measures and report on residual risks. This committee collaborates with the Algorithmic Risk Committee (ARC) and Ethics Committee (EC) in the context of AVR3.

Special note on Validity/Validation

Industry uses of the terms Validity, Validation are often used in industry to mean two separate things. Validation and validation dataset is frequently used concurrently with testing data in the context of proving a model is fit for purpose sufficiently. While the validity of the model is often referenced to mean specific things such as content or construct validity. We define the term validity below in order to avoid confusion specifically with datasets used in testing.

Terms and Definitions:

Defined Term	Definition
Accuracy	An indicator of a system’s functional correctness to produce an output consistent with a defined scope, nature context and purpose compared against the true or absolute correct value.
Adverse Event Tracking System (AETS)	A system available to the public (including internal stakeholders, partners, customers, civil society, industrial associations and general public) to report or register information regarding adverse events contributed by artificial intelligence, algorithmic or autonomous systems.
Business Continuity Plan (BCP)	scheme that describes a system of prevention and recovery from potential threats to a company, ensuring that personnel and assets are protected and are able to function quickly in the event of a discontinuity, threat or disaster. The plan should include disaster recovery plans and prioritization of restoration.
Construct validity (New)	The extent to which performance metrics or measures represent the ground truth with the theoretical construct.
Data Quality (New)	The quality of data that makes it representative and aligned to the Scope, Nature, Context and Purpose of the intended use as applicable to an algorithm. Quality of data refers to data that is reasonably and sufficiently relevant, complete and free from errors in aggregation, annotation, maintenance, enrichment, ground truth constructive (inference or proxy or causative), correct syntax, sampling and training-test split as appropriate to the specific domain and/or industry context from reasonably calibrated sources
Dark Pattern (New)	Deceptive UI/ UX interactions including nudges that are non-transparent, constraining or limiting choices and/ or not in the best interest of the users.
Ethics Risk Analysis (New)	A study of instances of ethical choice, softlaw, application of *Code of Ethics* and *Code of Data Ethics* principles and shared moral frameworks across the lifecycle of the AI, algorithm or autonomous systems
Edge Case Testing (New)	An approach to conducting a test (internal or external) for a highest and lowest in a range of possible values that occur in extreme situations in an AI or autonomous system using Edge case test data (eg. safety, adverse events, disparate impact etc including the ones caused by change to scope, nature, context and purpose)
Defect (New)	Defects are imperfection, faults, errors, or deficiency in the outcomes where it does not meet its requirements or specifications including inconsistencies in prediction results (eg. biased results) for same data input or similar data input.
Failure (New)	Failure is an event when the AI, algorithm or autonomous system does not perform as designed, including incorrect prediction or statistically different prediction for a given instance or inability to guard against an adversary.
Ground Truth (New)	Information ascertainable as real or true through direct observation or deductive measurements (rather than through inference), wherein such observations or measurements are defensible beyond reasonable doubt. Further the deductive measurements shall be to deduce the fact based on self-declaration or unconstrained subsequent choices by the user (about whom the information is established as real or true).
Information Quality (New)	The quality of the content of AI, algorithm or autonomous systems that is representative of the fitness for use (scope, nature, context and purpose). It refers to accuracy of data in representing ground truth and relevance of the data for the slated scope, nature, context and purpose.
Material Defect, Failure and Adverse Event (New)	A Defect, Failure and Adverse event is said to be material when the potential degree of harm is high for any individual or when there is potential moderate impact for a group of people (may or may not be a protected group) or both.
Model and pipeline quality (New)	The quality of the model refers to the collective integrity of its components including the algorithm (including its versions), pipeline, serving infrastructure and integrations between pipeline components.
Overseer	A natural person assigned by a system provider or user to act as no less than a Human-on-the-Loop whose responsibilities include knowing the capacity and limitations of the system, sufficient training for the regular operation including the identification of anomalies, dysfunctions and unexpected performance. They shall be aware that there should not be an over-reliance on the system and their role is to avoid automation-bias. They should have sufficient training on the outputs, interpretation tools and methods. They should be trained when to disregard, override or reverse the output of the system including real-time intervention during processing to interrupt the system using a stop button.
Residual Risks	Unmitigated risk pertaining to a specific risk input or the aggregation of all risk in an AI, algorithmic or autonomous system.
Reliability	The extent to which the results can be reproduced when the research is repeated under the same conditions.
Resilience	The speed and capability of the system to recover from major disruption (at a data level, model & pipeline level and information level) to a sufficient level of function in accordance with the system’s intended operation.
Robustness	The ability of a system to withstand (keep regular and anticipated function), in spite of exceptional, unforeseen events, stressful conditions such as component failures, loss of service, adversarial attacks or extreme conditions beyond the expected operating environment.
TEC At-Risk Report (New)	A periodic report (at least quarterly) prepared or compiled by TEC containing risk mitigations for identified risks and residual (unmitigated) risks with accuracy, validity, reliability, robustness and resilience of AI, algorithmic or autonomous systems. The report shall also contain the risk log (list of risks considered or identified), risk evaluation and control assessment performed during the process.
Test Evaluation Environment (New)	An environment (virtual or hardware, instrumentation, simulators, software tools, and other support elements) that enables testing (internal or external) the reliability and robustness of an AI, algorithmic or autonomous system using methods including red team exercises and adversarial attacks.
Validity	the extent to which the results really measure what they are supposed to measure (intended purpose) presently and as time passes. And is distinct from the concept of a (validation) dataset as it relates to training and testing data.

Criteria

Sufficiency of AVR3: Ethics Committee shall examine, consider and deploy measures to ensure that sufficient accuracy, validity, reliability, robustness and resilience exists for at-risk protected category and intersections thereof.
TEC At-Risk Report: TEC shall submit to ARC and EC, a report on residual risks arising from technical evaluations and suggested risk mitigations as TEC At-Risk Report on an as- needed basis or at least quarterly basis. The TEC shall include residual risks arising from examination of metrics and measures including at-risk protected category (and intersections thereof) validation, inaccuracies, documentation, change management, industry norms and benchmarking, multi-stakeholder feedback and diverse inputs, quality management, causality validation, source code testing, security validation, Stress testing for safety, resilience testing, Adverse Event Tracking System and post market monitoring mechanism.
Industry standards: Where there are comparable systems or industry standards, the TEC shall be familiar with industry-standards for accuracy levels associated with the AI, algorithmic or autonomous system. TEC shall examine inaccuracies and document them in the TEC At-Risk report. In the absence of industry standards, especially for innovative technologies, the TEC shall examine, consider and deploy comparables from relevant fields to understand a potential maturity lifecycle for acceptable thresholds for accuracy, validity, reliability, robustness and resilience. Lack of industry standards should be documented as a Residual Risk and as noted by the EDPB guidance on High Risk data processing^[1]. These systems should also be deemed High Risk in the absence of sufficient risk mitigation.
Industry norms and benchmarking: TEC shall examine metrics, measures and tradeoff considerations with established industry / domain specific thresholds or norms (including threshold for pre-trained models) for reasonableness. TEC shall report unmitigated risks as part of the TEC At-Risk report.
Multi-stakeholder feedback: ARC shall examine, consider and deploy feedback and risk mitigations found to be reasonable from multi-stakeholder feedback and diverse input regarding validation, accuracy, reliability, robustness and resilience.
Security validation: The TEC along with the CISO should create a Test Evaluation Environment (TEE). The TEC shall deploy measures to mitigate the risks arising out of Test Environment Evaluation.
Residual Risk management: ARC, EC and Overseer shall examine, consider and deploy risk mitigations documented in the TEC At-Risk Report.

General Guidance

Scope: ARC shall examine and consider the scope of AVR3 and ensure alignment with scope, nature, context and purpose of the AI, algorithmic or autonomous system.
Setting up of TEC: ARC or a duly designated officer of the organisation shall establish a Testing and Evaluation Committee (TEC) and establish a TEC policy which outlines its role, responsibility, reporting lines, domain specificity, and when the process or threshold whereby an internal T&E is sufficient or Independent T&E will be required. TEC is recommended to be a committee with representatives from validation and other teams, however, it can even be a duly designated individual responsible for Technical and Evaluation of the AI, algorithmic or Autonomous systems (similar to the committee management and substitution process found in Section 4.4 of the Certification Scheme Manual for UK GDPR).
TEC At-Risk Report: A periodic report (at least quarterly) prepared or compiled by TEC containing risk inputs along with their suggested mitigations associated with accuracy, validity, reliability, robustness and resilience of AI, algorithmic or autonomous systems and any unmitigated residual risks. The report shall also contain the risk log (list of risk inputs considered or identified), risk evaluation and control assessment performed during the process.
Industry norms and benchmarking: The TEC shall examine, consider and deploy measures for AVR3 and mitigate risks arising out of accepting industry norms and establishing thresholds and benchmarks. A lack of industry norms should be included in the TEC At-Risk Report as High Risk. TEC shall evaluate continued appropriateness of industry standards, thresholds and benchmarks to determine the need for reassessment of AVR3. TEC shall update the gaps as part of TEC At-Risk Report.
Metrics and Measures: TEC shall examine the existing monitoring or review practices in AVR3 including the metrics monitored for validity (including F1 score, ROC curve, False positive, false negative etc), consider residual risks (if any) post adopting such metrics and update the such residual risks as part of TEC At-Risk Report. At-risk protected categories and intersections thereof shall be examined specifically in regards to accepted metrics and measures.
Inaccuracies: TEC shall examine the accuracy level of the model and also examine the factors relating to inaccurate models or outcomes. Inaccuracies should be examined and considered in the context of impacts to humans. Consider such inaccuracies and factors as residual risks and include them in the TEC At-Risk Report.
Documentation: TEC shall examine the test plan (test items, features to be tested, test tasks etc), test design specification (test conditions, detailed approach and high level test cases), test case specification (pre-conditions, input specifications, postconditions), test procedure and reports thereof for AVR3 that are created, maintained and managed (as guided by IEEE 829). TEC shall also examine the documentation that tracks the resolutions of the findings. TEC shall update the unresolved findings or inadequacies in the documentation that can reasonably result in risk to humans as part of TEC At-Risk Report.
Change management: TEC shall review any material change in metrics, thresholds and tradeoff decisions considered for each aspect of AVR3 for reasonableness and thereby a need for subsequent reassessment. The TEC shall deploy measures to mitigate the risks arising out of change management. TEC shall update the unresolved gaps or past errors as part of TEC At-Risk Report.
Ethical choice in change management: The TEC shall provide to the Ethics Committee details of changes in thresholds and performance ranges along with the residual risks they have identified in the process therein. The Ethics Committee shall examine and consider the ethical choice impact contributed by dynamics in thresholds and performance ranges and update the residual risks as part of Ethical Risk Analysis (ERA).
Quality management: TEC shall examine, consider and deploy measures that ensure (a) data quality (including data from external sources, data annotation and data preprocessing), (b) information quality and (c) model & pipeline quality, and record any Residual Risk data quality issues with the customers/ users. Comprehensive audit satisfaction with all shall & should statements under data quality, information quality and model & pipeline quality. TEC shall update the unresolved gaps or past errors as part of TEC At-Risk Report.
Construct validation: ARC should examine, consider and deploy rules that ensure that inferences and/ or causal learning models are validated against the Ground Truth (including the interpretability of the model). TEC shall update the unresolved gaps or past errors as part of the TEC At-Risk Report and the ARC shall examine, consider and accept the differences between statistical correlation and ground truth causality.
Resolving feedback received through AETS: The TEC shall deploy measures to ensure appropriate actions and/or mitigations are undertaken by designated individuals including efforts to remediate or resolve the concerns arising out of AETS.
Reporting to Accountable Officers: The TEC shall submit reports to the Officer duly designated in the TEC operating policy on a quarterly basis on Defect, Failure and Adverse Events received on AETS and how such concerns are dealt with including efforts to remediate the system or resolve the concern.
Design and User Interface validation: TEC along with representatives from Children’s Data Oversight Committee and/ or Ethics Committee, shall examine the design of the AI, or autonomous system including their associated interfaces and algorithms to ensure that dark patterns and subconscious persuasive tactics (including curtailing user’s ethical choices) that are detrimental to users are not used in the AI, algorithmic or autonomous systems.
Source code validation for Robustness: The TEC and CISO shall ensure the source code is scanned for vulnerabilities, bugs or errors that could become an attack vector, based on current intelligence and published reports of vulnerabilities in libraries or open source code used for building the AI, algorithmic or autonomous system.
Security Testing: The TEC and CISO should consider creating a Test Environment Evaluation (TEE) containing (a) security test data in line with up to date risk intelligence and threat vectors that are known to the organizational along with its associated hardwares (including chips, IOT and other embedded devices) and (b) existing control, monitoring and response capacities (eg. the blue team) for gathering diverse inputs and multi-stakeholder feedback (eg. red team exercise, bug bounty program, hackathons for adversarial attacks – including data poisoning, model inversion etc, hardware vulnerability testing – including failure mode or hazard analysis etc) on safety and security (for Reliability and Robustness) of artificial intelligence, algorithmic and autonomous systems. The TEC should also deploy measures to gather publicly reported incidents of software, platforms, hardware and/ or physical vulnerabilities that may have an impact on AI, algorithmic or Autonomous systems. The TEC shall deploy measures to mitigate the risks arising out of Test Environment Evaluation. The results of the TEE shall be reported to the CISO and unresolved or unmitigated risks from the TEE shall be included as part of TEC At-Risk Report.
Stress Testing: The TEC should consider conducting Edge Case Testing (ECT) for safety (with reference to Reliability & Robustness) on a periodic basis for verifying impact or sensitivity (both in terms of testing at a point in time or tested over a period of time) including At-Risk protected category variables (or intersections thereof) (both at system and model levels) using edge case test examples gathered through internal organizational sources and through multi-stakeholder feedback and diverse inputs. The TEC shall deploy risk mitigations arising out of Edge Case Testing. The results of the ECT along with the edge case test examples used shall be reported to the ARC for satisfaction of audit requirements. Further unresolved or unmitigated risks from the ECT shall be included as part of TEC At-Risk Report under Residual Risk.
Resilience testing: The TEC should consider evaluating resilience by using randomly generated disruptions to the AI or autonomous system (eg. Netflix Simian Army) to evaluate the speed and capability of the system to return to a sufficient level of function to operate on its services post the disruption, ensuring that the Business Continuity Plan is accurate and up-to-date. Such a testing shall individually test for data resilience (capability to recover from poisoning attack), information resilience (ability to recover from an attack that impacts information quality) and model/ pipeline quality (ability to recover from the attack on code or the pipeline therein). The disruptions can be either automated or manual as considered reasonable by the TEC. Further unresolved or unmitigated risks from the Resilience testing shall be included as part of TEC At-Risk Report as Residual Risks.
Public disclosure: ARC shall gather information from TEC at-risk report on Residual Risk and deploy measures to disclose to the customers or public as required.
Human-in-the-loop, Human-on-the-loop and Overseer (collectively called as HTL) validation: The TEC shall examine the data relating to HTL, from a testing perspective as well as feedback from diverse inputs and multi stakeholders to determine if the role played by HTL is reasonably effective and Robust. This HTL testing includes Human-in-The-Loop, Human-on-The-Loop and/or the role of the Overseer. The TEC may perform additional procedures as may be necessary to evaluate effectiveness of HTL in the given artificial intelligence or an autonomous system. The TEC shall deploy risk mitigations arising out of HTL validation across HTL and/or Overseer roles. The results of the HTL validation shall be reported to the ARC and unresolved or unmitigated risks from the HTL validation shall be included as part of TEC At-Risk Report as Residual Risks.
Reporting to Accountable Officers: The TEC shall submit reports to the duly designated Officer on a quarterly basis on Defect, Failure and Adverse Events received on AETS and how such concerns are dealt with including efforts to remediate the system or resolve the concern.
At-Risk at Protected Category & intersections: Testing and Evaluation Committee (TEC) shall compile At-Risk protected category and intersections (gathered from ARA) and gaps in ethical choices (reported as part of ERA) and examine the extent of and the depth of accuracy and validity metrics for at-risk protected category level and update the Residual Risks as part of TEC At-Risk Report. Ethics Committee shall examine, consider and deploy thresholds and benchmarks for sufficient accuracy, validity, reliability, robustness and resilience exists for at-risk protected category and intersections thereof.
Adverse Event Tracking System: The ARC shall deploy an Adverse Event Tracking System (AETS) to gather information regarding adverse events contributed by artificial intelligence, algorithmic or autonomous systems. This Adverse Event Tracking System is to identify, collect or compile adverse events, failures, defects of the system as a whole and not limited to the algorithms alone, either through a public reporting or through monitoring of reasonable open source or social media sources. The ARC shall disclose and enable public access to report on AETS by stakeholders including internal stakeholders, partners, customers, civil society, industrial associations and general public.
Monitoring of Concept Drift and Data Drift: TEC shall deploy reasonable measures to monitor Data drift and Concept drift, that may result in incorrect results and their associated impacts. TEC shall track the adverse events attributable to Data Drift or Concept Drift and report them to ARC as part of its TEC At-Risk report
Post market monitoring: TEC shall examine, post market monitoring mechanism including reported adverse events through Adverse Events Tracking System, failure root cause analysis and actions undertaken for such instances for reasonableness. TEC shall update the unresolved issues as part of TEC At-Risk Report to ARC.
Post market monitoring for Autonomous systems (Blackbox): TEC should ensure creation of a robust and resilient recording system that retains information to facilitate investigation of failures and inaccuracies including those contributed by sensors, data, model, metrics, conditions and human actions in the context of autonomous systems by revisiting the events and circumstances at the time of failures. TEC shall update the unresolved issues as part of TEC At-Risk Report to ARC. Eg. Reporting patient safety events in Healthcare.
Residual Risk management: ARC, EC and the Overseer shall examine, consider and deploy risk mitigations documented in the the TEC At-Risk Report and deploy adequate measures to (a) provide additional safety features for known chances of inaccuracies which may restrict rights and freedom, (b) disclose such residual risks to users/ customers on being aware of such risks staying unmitigated (as part of Residual Risk disclosure under risk management), (c) provide adequate disclaimer for Defect, Failure and Adverse Events that may exist in the artificial intelligence, algorithmic or autonomous systems, (d) provide opportunity for users/ customers to opt-out as far as feasible (e) decide on need to decommission or override the AI, algorithm or Autonomous systems and (f) insure for potential external liabilities.

^[1] For more information on EDPB Guidance on High Risk Data Processing

https://www.cnil.fr/sites/default/fi les/atoms/files/20171013_wp248_rev01_enpdf_4.pdf

Mature/ Sufficient/ Insufficient

Definitions and examples in the table below will be specific to this Knowledge Store and should be populated by the Knowledge Store Owner. You are trying to capture the best summary of criteria that puts an audit response or piece of audit evidence into these categories.

MATURE

SUFFICIENT

INSUFFICIENT

MATURE

SUFFICIENT

INSUFFICIENT

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.