Task-level AI assurance

The missing piece in the conversation

Jun 05, 2025

AI assurance is being discussed everywhere: in government hallways, in boardrooms of Fortune 500 companies grappling with responsible innovation, on the bustling floors of tech startups racing to build trustworthy systems, in the lecture halls of universities developing the next generation of AI ethics courses, and in the international standards committees drafting frameworks for global implementation.

Everyone is developing frameworks and processes for evaluating AI, and international standards are the key arena in which these frameworks and processes converge. However, something is missing from many conversations — something that is apparent to practitioners but not being discussed nearly enough: the importance of a task-level approach.

Beyond Generic Frameworks

Ultimately, everyone wants to ensure AI is accurate, robust, and free of unwanted bias. Other qualities are essential, such as controllability, but these are simpler to evaluate.

Let's wind back to before the current AI iteration gained popularity. ISO/IEC 25010 is a software quality standard that defines product and service quality. The standard provides a framework for specifying, measuring, and evaluating the quality of software and computer systems. It covers a range of quality characteristics, including functionality, reliability, usability, performance efficiency, and security. I have often used the taxonomy to explore stakeholder needs and structure conventional software testing. But AI systems are different in some ways and they are inherently much more suited to quantifiable evaluations of functionality.

A few years ago, I led a project in ISO/IEC to create an addendum for AI systems. We concluded that the introduction of AI modified the concept of "functional correctness", analogous to accuracy. New concepts were required for "functional adaptability", "robustness", "user controllability", "transparency" and "intervenability". This conclusion has permeated other standards work, especially using the term "functional correctness" to describe accuracy (which means something else in AI). Bias, another popular concept, was not added because it can't ultimately be separated from functional correctness.

Understanding Task-Level Approach

Before proceeding, let's discuss what I mean by a task. An AI task is a specific problem to be solved by algorithmic means, which may involve physical or cognitive elements such as producing translations, navigating spaces, or creating synthetic content. Classification, regression, ranking, and clustering are common examples of such tasks. More complex examples include object detection, machine translation, automatic summarisation, automated speech recognition, pose estimation and image segmentation.

Bias and robustness both represent critical dimensions of AI system quality, yet they are mainly measured using derivatives of task-level functional correctness metrics. While these dimensions are often treated as separate concerns, their practical evaluation fundamentally depends on our ability to measure how AI systems perform across different contexts, inputs, and conditions. This capability exists at the task-specific level.

Bias and robustness measurements are, primarily, differential assessments of functional correctness. There are, of course, other ways to measure robustness and bias, but these approaches are the most mature and prevalent. They analyse how a system's performance varies:

For bias: across different demographic groups, sensitive attribute values or any other characteristic of the data
For robustness: across different perturbations, distributional shifts or other unexpected conditions that can occur

Each measurement becomes most meaningful when anchored to task-specific definitions of correctness. As I mentioned earlier, this is obvious to practitioners; nobody would assess the correctness of a classification model without using the metrics that have been established over decades. What isn't apparent to many is how this applies in the era of general-purpose AI systems, such as large language models.

Why Generic Standards Fall Short

Generic process-level standards provide valuable frameworks for development practices. Still, they can not effectively measure functional correctness, bias, or robustness because, without a task-specific definition, these measurements become abstract and uninterpretable. For example, a management system standard might say that metrics should be defined, and a testing standard might say how those metrics should be selected - but without getting down to a task-level it is hard to turn those into clear, concrete steps.

Task-Level Metrics in Practice

Consider these examples that illustrate how measurements derive from task-level functional correctness metrics:

Named Entity Recognition Example

Named Entity Recognition is a fundamental natural language processing task that identifies and categorises specific elements in text that represent real-world objects, such as people, organisations, locations, dates, monetary values, and other named entities. We can discuss it without mentioning any specific technology. It performs two key functions:

Identifying the boundaries of named entities within text (detection)
Assigning appropriate category labels to these entities (typing)

We can assess its functional correctness using a modified, named entity recognition-specific variant of classical metrics, such as precision, recall, and F1 scores, on identified entities. (Look out for future standards on how this differs from other forms of F1 score.)

Bias measurement: Comparing F1 scores between mentions of different demographic groups reveals potential bias. For example, I can determine the F1 score on just the mentions of Western figures within the dataset, the F1 score on non-Western figures only, and the disparity between the two.

Robustness measurement: I can evaluate how F1 scores change when entities are expressed in different ways using different spellings. Perhaps I can determine that the F1 score on texts with standard spelling (for example, in news articles) is 0.90 and the F1 score with certain spelling variations (for example, in social media) is 0.65. This reveals the robustness of the AI system performing the named entity recognition task.

Computer Vision Example

For detecting objects from images using computer vision, we can use mean Average Precision (mAP) at different Intersection over Union (IoU) thresholds in the same way. We can look at the mAP differences across demographic attributes and under different lighting conditions to understand bias and robustness, respectively.

The Mathematical Relationship

Bias and robustness metrics are functions of task-specific correctness metrics. Functional correctness, bias and robustness are quantified most meaningfully with the foundational task-specific metric.

Implications for AI Assurance

This understanding has several important implications for AI assurance:

Standards development: Testing, bias and robustness standards must build upon task-specific functional correctness standards rather than being developed in isolation.
Testing frameworks: Effective testing requires establishing task-appropriate evaluation metrics and extending these across relevant variations. When organisations use standardised task-specific metrics, comparisons between different AI systems become more meaningful and reliable through benchmarks.
Regulatory compliance: Regulations addressing accuracy, bias and robustness (such as in the EU AI Act) need to reference task-specific standards to be implementable.
AI procurement: Organisations evaluating AI systems should first understand the appropriate task-level metrics before assessing claims of correctness, bias, and robustness.

Moving Forward

The evolving landscape of AI standardisation is progressing toward more granular and task-specific approaches to assurance. While process-level standards provide valuable frameworks for overall AI development, the next critical evolution requires more precise evaluation at the task level, particularly through specialised metrics tailored to specific AI tasks. This is not an easy task, getting a diverse group of real experts into the standardisation community to define the right categories of tasks and the requirements surrounding each metric is challenging.

ISO/IEC 4213 is a technical specification that provides evaluation metrics for classification tasks, it is currently being updated to include regression, recommendation and clustering tasks. As the standards landscape evolves, we can expect an increasing number of standards that explicitly derive from and reference task-level evaluation standards. This includes not only ISO/IEC 4213 but its future updated 2nd edition and the much-needed and in-progress task-level evaluation standards like ISO/IEC 23282 for natural language processing and CEN-CENELEC's JT021025 for computer vision technologies.

Conclusion

Bias and robustness in AI systems are not separate concerns that can be addressed independently of functional correctness. Instead, they represent critical dimensions of functional correctness across different contexts and conditions.

By recognising that bias and robustness metrics derive from task-specific correctness metrics, we aim to establish a more coherent and practical approach to AI assurance that treats these quality dimensions as inherently connected rather than isolated concerns.

This task-based approach bridges the gap between abstract principles and concrete, measurable performance criteria, enabling more meaningful assessment and improvement of AI systems across their full spectrum of operational conditions.

AI regulation, standards and reality

Discussion about this post

Ready for more?