The Transform-and-Perform Framework: Explainable Deep Learning Beyond Classification

In recent years, visual analytics (VA) has shown promise in alleviating the challenges of interpreting black-box deep learning (DL) models. While the focus of VA for explainable DL has been mainly on classification problems, DL is gaining popularity in high-dimensional-to-high-dimensional (H-H) problems such as image-to-image translation. In contrast to classification, H-H problems have no explicit instance groups or classes to study. Each output is continuous, high-dimensional, and changes in an unknown non-linear manner with changes in the input. These unknown relations between the input, model and output necessitate the user to analyze them in conjunction, leveraging symmetries between them. Since classification tasks do not exhibit some of these challenges, most existing VA systems and frameworks allow limited control of the components required to analyze models beyond classification. Hence, we identify the need for and present a unified conceptual framework, the Transform-and-Perform framework (T&P), to facilitate the design of VA systems for DL model analysis focusing on H-H problems. T&P provides a checklist to structure and identify workflows and analysis strategies to design new VA systems, and understand existing ones to uncover potential gaps for improvements. The goal is to aid the creation of effective VA systems that support the structuring of model understanding and identifying actionable insights for model improvements. We highlight the growing need for new frameworks like T&P with a real-world image-to-image translation application. We illustrate how T&P effectively supports the understanding and identification of potential gaps in existing VA systems.

There are several challenges in interpreting DL model behavior: the increasing complexity of model architectures [12], dataset size [13], the high dimensionality [14] of inputs and outputs [15], the relation of input changes to output effects [16], and the output type (e.g., classification, regression).In contrast to classification, the complexities of analyzing high-dimensional regression outputs arise in DL models for high-dimensional-to-high-dimensional (H-H) problems.H-H problems involve translating or converting data from one form to another, for example, generating a CT from an MRI brain scan [7].These problems can be defined as Y ¼ fðXÞ, where inputs X are processed by the model f, which can be highly nonlinear, to generate outputs Y .The translation is defined for continuous scalar spaces f :R n !R m , where X 2 R n , Y 2 R m .Specifically, text and categorical data are out of scope for our work.Although a classification problem has the same form, we target the more generic case, where Y has no explicit groups or classes and is a continuous highdimensional space, i.e., m ) 1 and m ' n, making the model difficult to interpret.
In recent years, visual analytics (VA) has shown promise in DL model interpretation [12], including model understanding [13], [17], [18], compression and design [19], [20], [21], out-of-distribution analysis [22], and hyperparameter tuning [23].The focus has primarily been on classification [15], which does not exhibit the complexities of H-H problems.The complexity in interpreting H-H problems arises from the lack of explicit and interpretable data groups, causing user questions to be less well-defined.Hence, there is a need for the VA system to support multiple user workflows for a goal.For example, the user might need to explore groups across inputs and outputs before studying the associated model behavior.With a focus on classification, existing VA systems often do not allow sufficient control for such analysis and are challenging to extend to H-H problems.
Conceptual VA frameworks support various stages [24], [25] of machine learning (ML) and explainable AI workflows [26], [27].The inputs, model, and outputs have been identified as fundamental components for analyzing DL models [28].However, H-H problems add a level of complexity.There is little work on the needs and analysis process of the components in conjunction, which is required to identify underlying symmetries in the data and model behavior.Further, identifying and analyzing multiple user-workflows to study unknown input-output relations are missing.This analysis is especially relevant for studying problems beyond classification and has gotten little attention.
To address these challenges, we propose Transform-and-Perform (T&P), a unified framework to support VA designers in structuring and identifying workflows and analysis strategies to design VA systems.T&P primarily focuses on H-H problems.The end goal is to support the creation of effective VA systems that support model developers in complex model understanding and improvements.In addition, T&P supports understanding and uncovering potential gaps in existing VA systems.To this end, we facilitate analysis of relations across the 3Ws of model behavior -1) when a model behavior occurs, i.e., input and ground truth analysis, 2) how & why a model behavior occurs, i.e., model and training context analysis, and 3) what the model behavior is, i.e., model output and performance analysis.We utilize the symmetries in input-output relations to study model behavior.These symmetries that define how the outputs change with input changes are intrinsic to a model and unify a broad class of models [16].For example, in MNIST classification [29], the output class is expected to be invariant, i.e., constant, for an acceptable range of rotations.In CT brain segmentation [30], the output mask is expected to be equivariant, i.e., change in the same way as the input, for all input rotations.These relations underlie the breakthrough DL performance and provide a formal way to understand their inductive biases.Inductive biases are the internal assumptions that aid the model in making inferences beyond seen data.Hence, understanding these relations are critical to model interpretation.
We propose the T&P framework inspired basis its focus on transforming, i.e., modifying the 3Ws to perform analysis and identify actionable insights to boost model performance.
We define the 3Ws (when, how & why, what) of model behavior and how their analysis can be supported.We structure the study of unknown input-output relations via their symmetries, i.e., invariant (no effect), equivariant (same effect), and generally variant (other effects) relations.We propose a user question generation method to support user workflows centered around the 3Ws.We provide a checklist to use T&P to identify workflows and analysis strategies for VA system design.We illustrate through practical use cases, how T&P facilitates VA system design with an MRI medical image-to-image formation model and, the benefit of T&P in understanding and uncovering potential gaps in existing VA systems.

CHALLENGES & OVERVIEW
Below are the primary challenges (C1-C6) in building VA systems for H-H DL models addressed in this work.We additionally motivate and provide an overview of T&P.
DL models can be formalized as Y ¼ fðXÞ.green, is explainable AI (XAI) centric, focusing on understanding the models' internals.We identify the following challenges that impact the end user's ability to understand models and their underlying inductive biases.
C1 Size of datasets -Group-based exploration can aid in identifying global model patterns for huge datasets.Identifying such global patterns is otherwise complex with instance-level analysis [13].C2 High-dimensional inputs -Pre-defined class groups in classification tasks are often insufficient due to intra-class variance [31].For example, a model could detect large, brightly lit objects differently than objects under shadows [32].This grouping is more complex in H-H problems, where groups are not inherently present.C3 Architectural complexity -Model complexity and variability depend on the use case.Existing VA tools often focus on linear architectures [13] and primarily analyze the last compressed layer.This analysis is not possible in H-H problems where non-linear models [33] are often used, making the analysis complex.C4 High-dimensional outputs -The number of output neurons/classes in classification tasks is small.In contrast, H-H problems [3] have high-dimensional outputs similar to the input.Output variables, like pixels within an instance, can perform differently and have unique sensitivity maps with respect to the input.Studying and summarizing this information is challenging.C5 Output type -In regression problems, like H-H problems, there is no clear demarcation of right versus wrong.Since the outputs are continuous, operating ranges need to be defined [34].There are no explicit groups, and users may be unaware of specific input characteristics (e.g., noise level [35]) that affect the model.These unknowns necessitate identifying custom input properties, i.e., control variables to analyze.C6 Input-output relations -Studying input-output symmetries [16] can support the understanding of a model's inductive biases [11].These relations need to be studied with respect to control variables which may be unknown and need to be explored.Further, the relation studied might be complex.For example, an MNIST [29] model output might be invariant to acceptable rotations [36] and expected to vary otherwise.Studying models that encode complex input-output relations involves the above challenges.VA can support alleviating the challenges of such analysis.However, addressing these challenges, especially C3-C6, that co-occur and increase complexities in H-H problems is minimally explored in the VA community.These co-occurring complexities in H-H problems lead to unknown model behavior a-priori, resulting in users performing multiple exploration tasks and complicating VA design.These workflow possibilities must be explicitly considered.Hence, we propose the T&P framework to facilitate VA system design by identifying workflows and analysis strategies for studying such models (see Fig. 1).T&P takes the models, datasets, and analysis goals of interest as inputs.The model analysis component defines a means to interact with DL models via a visualization by providing a model behavior specification.This specification is designed around the three model components discussed previously, the 3Wswhen , how & why , and what of model behavior.The user performs analyze, search, and query tasks [37] on the model behavior until gaining knowledge to update the analysis input.T&P is detailed in Section 4.

RELATED WORK
The following sections describe related work on existing VA systems and conceptual VA frameworks.

Visual Analytics Systems
VA systems proposed for explainable DL implicitly support the 3Ws, i.e., when , how/why , and what .This section focuses on existing VA systems, analyzed based on the support level of the 3Ws for various tasks.These identified support levels are used as a basis to develop our framework.Some existing VA systems focus only on the input [38] , output [39] , or model space .Rauber et al. [40] proposed a system to visualize the relationships between activations for classification tasks.Although the system supports model understanding, identifying model problems requires more control over the input-output space.For example, supporting understanding of how intra-class changes affect activations to analyze failures.The need to analyze the input-output relations is accentuated while studying complex high-dimensional outputs.For example, in segmentation, activations variations are caused by the class and the segmented object size [41].These complexities highlight the challenge of building systems for high-dimensional outputs.
Prior works have also shown support for two components ( , , or ).Targeting model understanding, Hohman et al. [17] summarized feature attributions associated with specific images.Das et al. [42] visualized neuron connections contributing to misclassifications and adversarial attacks.The analysis of adversarial attacks partially supports understanding inductive biases, showing the benefit of additional output analysis for model understanding.However, the user's ability to define misclassifications is limited, i.e., between predefined classes.User support to add perturbations has been shown, like adding noise [35], [43] to a given image instance to analyze its effect on the output.However, to improve models, it is essential to study higher-level failure patterns, i.e., at a group level [32].
Partial support for all the three analysis components ( ) has also been shown for goals ranging from model design [19], [21], diagnosing transfer learning process [46], and model exploration [18], [35].However, the level of control on the components is limited.For example, input analysis is often limited to instance level [18], [35] or based on predefined class groups [19], [21].These limitations make it challenging to use VA systems on large datasets or where intra-class variation is high.Additionally, these systems are not extendable to regression cases with no explicit groups to study [47].Targeting exploration of industry scale models, ActiVis [13] identified the need for grouplevel analysis for large datasets, allowing users to define custom groups to study activation and output patterns.
However, support for users unaware of what essentially forms a group or a class (i.e., input control groups or variables) and studying unknown input-output relations are not addressed.On the other hand, automatic methods like LIME [48] are challenging to use since the unknown inputoutput relations (see Section 2) necessitate interaction.
VA systems support Generative Adversarial Networks (GANs), consisting of a generator network that generates high-dimensional outputs and a discriminator network to classify real versus fake data.These systems [49], [50] focus on understanding the performance of the discriminator and the generated data distribution.Since data is still grouped by real versus fake, they do not fully reflect the challenges of H-H problems.Although some work exists on understanding the generator's input space [51], their relation to the generated images has been studied only for predefined input characteristics, like age.Furthermore, there is little work on understanding the generated output image quality required in H-H problems.While VA systems for segmentation models have been proposed [15], high-dimensional outputs are grouped based on the object presence and the sub-spaces of segmented pixels, which is challenging in H-H problems.
In summary, most prior VA systems insufficiently support analysis of the 3Ws for studying unknown input-output relations required in H-H problems.Prior systems assume that users have knowledge of some of the 3Ws.When the model behavior is unknown beforehand, the user is likely to explore [37] input-output relations, complicating VA design.

Ontologies & Frameworks
Our contribution is a new VA framework to address the analysis of DL models for H-H problems.Foundational workflows such as the visualization [45] and VA [44] models conceptualize the insight generation process and are generic.With the increasing need to understand ML models, conceptual methods have been proposed to identify the various stages [24] and interactions [25] in the ML pipeline where VA can help.Five main areas of VA-assisted ML have been identified [24] -data examination and preparation, model understanding, feature and parameter analysis, learning process, and result analysis.
Focusing on ML model analysis, Krause et al. [26] proposed a VA workflow to help data scientists and domain experts in exploring, diagnosing, and understanding the decisions made by a binary classifier.Cashman et al. [52] presented a workflow that used VA principles to help the model selection process.However, the critical model analysis component has not been addressed to understand the black-box nature of DL models.The framework proposed by Spinner et al. [27] incorporated concepts for model refinement, understanding, and diagnosis.However, the proposed framework focuses on the iterative improvement process.It does not describe how VA designers can support understanding and diagnosing an ML/DL model for a given model iteration.Applying XAI methods to the input, output, and model has been identified for DL model analysis [28].However, in all the proposed work, the challenges and formal process of extracting input-output relation rules for complex DL models have not been described.Specifically, how VA designers can support the analysis of each 3W component and their interactions to study unknown input-output relations have not been addressed.Additionally, the user's knowledge and focus on the 3Ws that drive the VA system design have not been discussed.Previous works have also proposed frameworks for understanding input-output relations from visualization [53] and DL processes [16], leveraging their intrinsic symmetries using geometric principles.We build upon these to formalize the study of unknown input-output relations in complex H-H DL models.
With the increasing need for VA for DL analysis and the challenges described in Section 2, there is a need to further elaborate existing VA frameworks for building systems in emerging H-H problems in DL.In contrast to previous work, our focus is primarily on the model analysis stage.The primary goal is to support the design of more effective VA systems for DL models for H-H problems.

THE TRANSFORM-AND-PERFORM FRAMEWORK
We use a combination of the visualization workflow proposed by van Wijk [45] and the diamond-shaped workflow proposed by Keim et al. [44] as the baseline representation of our Transform-and-Perform (T&P) framework as shown in Fig. 2. We chose these as a baseline because they are used as a foundation in literature and have proven to be generic.T&P extends these frameworks to analyze DL models for H-H problems.We specify the four fundamental blocks, i.e., input, visualization, (model behavior) specification to the visualization and knowledge gained [45].A feedback loop is included to update the analysis inputs based on insights gained [44].Since T&P focuses on DL model analysis, in contrast to prior work, we move the models to the analysis input, along with the datasets and user analysis goals.Users interact with the VA system to analyze models using the

Analysis Input
The analysis input serves as an input/output block for all inputs required for the model analysis stage.At the minimum, the user defines the goal(s) of the analysis with the dataset(s) and model(s) to be analyzed.Specifically, the dataset(s) of interest include data types, training, testing, and evaluation datasets.The model(s) include their architecture, learned parameters, training hyper-parameters such as optimizers, loss functions, and preprocessing rules.The user goal(s) also indicates the focus and could serve as a switch for the supported workflows in the model analysis stage.For example, goals can include out-of-distribution analysis [22] or model compression [21].

Model Analysis
The model analysis stage is the central component of T&P.
The components of the model analysis stage are spread across the areas of visualization, XAI, and ML (Fig. 1).The aim is to highlight the benefit of integrating how experts from these areas analyze a model.To begin, VA system designers can display an image using a default model specification or create intelligent views based on the inputs.The user interacts with the system via the visualization In the following sections, we describe the model analysis stage.We first structure the study of input-output relations (Section 4.3), which supports the definition of the 3Ws.Next, we detail the model behavior specification by defining the 3Ws (Sections 4.4.1, 4.4.2, and 4.4.3) and how they can be leveraged to support workflows (Section 4.4.4).Finally, we provide a checklist to use T&P for identifying workflows and analysis strategies for system design (Section 4.5).

Formalizing Input-Output Relations
In this section, we structure the study of input-output relations to support the definition of the 3Ws.
The regularities arising from the structure of the problem domain impose a corresponding structure in the models being learned and can aid in interpretation.Bronstein et al. [16] recently proposed a unifying mathematical framework to exploit these regularities using geometric principles of invariance and equivariance.Kindlmann et al. [53] and McNutt [54] proposed similar frameworks for studying visualizations.To study models and their inductive biases, we extrapolate the algebraic visualization design framework [53], [54] to reason model behavior by analyzing intrinsic symmetries/asymmetries between their inputs and outputs (see Fig. 3 a is an operator applied to the input with respect to control variables, such as noise or rotation.v is a function that describes how the output changes and can also vary non-linearly with respect to a ().
To support high-dimensional data and complex models, we include inter-(f X , f f , f Y ) and intra-(c X , c f , c Y ) sample selection operators on , , and respectively.Inter-sample refers to a set of samples, such as multiple images.Intra-sample refers to within-sample groups, such as a group of pixels in an image.Like Bronstein et al. [16], we aim to understand model behavior by exploring changes in input and their corresponding output effects .We identify three categories of input-output relations: invariant, equivariant, and generally variant relations referred to as IEV relations.Invariant relations are those where the output does not change with input changes, i.e., v ¼ identity (see Figs. 3 and 4).Equivariant relations are those where the output changes in the same way as the input, i.e., v ¼ a.We further generalize input-output relations in literature [16] to include generally variant relations.This set includes all relations where output changes in a different and unknown manner to the input, i.e., v is an unknown, possibly non-linear function, funcðaÞ.These relations support the study of complex input-output relations occurring in H-H problems.
Complexities of model interpretability increase with the IEV relations the models encode.For example, in MNIST [29] classification, R n !R 1 , the single output variable or neuron belongs to a class.The output class should be invariant [56] to input rotation (see Fig. 4a) and other general corruptions such as noise, blur, or fog [57].On the other hand, the user may expect the output probabilities to be generally variant to some transforms.For example, one may expect that under a horizontal squeeze operation a, i.e., making the digits look like ones, v is a quasi-linear function mapping probabilities to class one.Similar transformations must be analyzed, for example, a ¼ occlusion [58] on input subspaces (c X ) mimicking real-world obstacles.
In high-dimensional classification [30], [39], R n !R m'n , where each variable or neuron of the high-dimensional output belongs to a class, complexity increases.In CT brain segmentation [30], outputs should change in the same way as inputs, i.e., equivariant (v ¼ a) when a ¼ frotation; translationg (see Fig. 4b).However, the output should be invariant to intra-class variations.
In H-H problems, R n !R m'n , where each variable or neuron of the high-dimensional output is continuous, the challenge of identifying patterns significantly increases.The model learns complex input-output relations, i.e., several possibly unknown a, and v.For example, in brain MRI denoising [59], the model ideally should be invariant to input noise, i.e., v ¼ identity when a = fX þ n, n $ h ð0; sÞ j s 2 ½0; 1Þg.However, it is common for outputs to change in an unknown, possibly non-linear manner with the input (see Fig. 4c).For example, an arbitrary v may be, To increase the H-H problem complexity, the output effect function (v) on applying an operation (a) on the input could vary across output variables.For example, on applying a ¼ noise, the model may have an invariant effect on the background output pixels (c Y ¼ background) but have a relationship as in Eq. 1 on foreground pixels (c Y ¼ foreground).T&P provides a methodical approach for studying these complex relations by exploring the input-output space via understanding a À v relations.When studied with the model space , the a and v provide critical insights into model behavior and its causes.For example, models with different inductive biases may reduce the effect presented in Eq. 1. Example generic a operators are provided and application specific ones [57], [60], [61] can be identified as needed.More details on these modules and their interplay are presented in the following sections.

Model Behavior Specification
In this section, we detail the components of model behavior specification.We define the 3Ws and analysis support strategies.We then describe how they work together to support user workflows with the analyze, search, query component.

4.4.1
The when component aims to help the user understand the input space , i.e., parameters of the model input under which the model behaves a certain way.Fig. 5 details the elements, i.e., aspects of each 3W component and how their analysis can be supported.As detailed, the when elements include the input datasets and their corresponding ground truth sets.This component supports the user in studying and comparing inputs ( ), subspaces of samples (c X ), input control variables (a), groups (f X ), and instances.Strategies with examples from literature to support this component are provided below.The references are indicative and, in many cases, need to be further extended to H-H problems.
Analyzing control groups (f X ) in addition to instances enables the analysis of large datasets [13] (C1).Analysis of data distributions [38] also supports C1.Identification and definition of inter-sample groups [47], [62], [63] (f X ) enable an analysis of highdimensional inputs (C2, C5, C6) where pre-defined groups are insufficient due to intra-class variation or where control groups do not exist like in regression.With the analysis of input sub-spaces or intra-sample groups [32] (c X ), the user can study input attributes that affect the output (C2, C6).Supporting extraction and analysis of the semantics of input and ground truth samples [41], [64], [65] can support high-dimensional input analysis.Identifying input control variables (a) enables the study of model behavior changes and IEV relations (C5, C6).With input perturbation via control variables a [35], [43], [51], IEV relations can be studied (C6).The study of intermediate (c f ) model attributes (e.g., a layer [17], [42]) enables the study of complex models and IEV relations (C3, C6).For example, to design operations like rotation equivariant convolutions [56].The study of groups of models [19] or their attributes [21] (f f ) supports comparison (C3).Complex model analysis can be supported by extracting deeper insights [20] via the base model elements .For example, previous works have utilized gradients [67], layer-wise relevance propagation [18], [68], activation maximization [69], and neuron importance [70] methods.Methods to dissect why a behavior occurs aid the decoupling of change in output (v) due to a) change in input (a) versus b) difference in model processing (C5, C6).This differentiation is critical in interpreting H-H models and is minimally explored in literature.

4.4.3
The what component aids the user in understanding what the model behavior is, i.e., the output characteristics and performance.The what component elements include the model outputs, the respective ground truths, and performance metrics for evaluation.As detailed in Fig. 5, this component supports studying outputs and characteristics, identifying patterns within (c Y ) and across samples (f Y ), and effects on outputs caused by input changes (v).Strategies to support this component include: Identification of output control groups [22] (f Y ) enables the analysis of large datasets (C1).Analysis of output distributions [62] also aids C1.
Perturbation/modification of outputs [71] enables studying input characteristics for a performance (C6).Analyzing output subspaces [15], [39] (c Y ) supports high-dimensional outputs (C4, C5, C6) to detect local failure patterns.This analysis is instrumental when clear groups do not exist or are insufficient.Identification and definition of inter-sample output groups [22], [72] (f Y ) enable the study of input characteristics (a) that cause a specific performance trend (v) and aid in studying inductive biases (C5, C6).Extraction of output semantics enables the study of IEV relations and high-dimensional outputs (C4, C6).

4.4.4
The previous sections described how the 3Ws are characterized effectively to address the challenges of DL problems (see Section 2).This section describes how different workflows in a VA system can be built by combining the 3Ws.We propose the analyze, search, query component (see Fig. 1) that supports user tasks and workflows to study input-output relations (C6) and make sense of complex models via interaction with the 3Ws.The definitions of analyze, search, and query align with those of Munzner [37].A workflow is a sequence of user tasks.Depending on the user goal, each task can focus on one or more of the 3Ws, where one possibly has little knowledge.Ideally, the system must aid in understanding these unknown 3Ws of interest.The interactively versus automatically analyzed parts vary based on the system goals, tasks, and system design.Once the required 3Ws are updated, the model behavior specification is communicated via the visualization component to the user.
User Question Generation: We describe a method for VA designers to generate user questions or tasks, which are a mixture of the 3Ws, based on the system goals (see Fig. 6).The 3Ws are represented as a continuous triangular space with the vertices representing the when, how & why or what.The position of the question within the triangle represents its focus.Being at a vertex indicates that the question focuses on the single 3W component at that position and leads to the most straightforward questions.Each user question involves analyze, search, and query tasks.Commonly, the user would query known components and start with higherlevel analyze or search tasks on unknown components from the 3Ws to understand them.Unknowns increase as one moves away from the vertices and edges toward the center of the triangle, complicating design.We identify seven main focus areas of user questions, centered around one (group , , ), two (group , , ), or all 3Ws (group ).
When Centric tasks are centered around studying and comparing datasets .This category also includes the study of input characteristics contributing to specific and known outputs and model behavior.
How/Why Centric tasks are centered around the study and comparison of models .Such questions might be in the context of known inputs and outputs.What Centric tasks focus on the study and comparison of model outputs .This can be analyzed for fixed inputs and models to understand failure origins.
When & How/Why Centric tasks focus on studying input characteristics and their impact on model internals .This task may be in the context of the outputs.How/Why & What Centric tasks focus on understanding how and why the model generates specific outputs, i.e., how affects .This analysis can be done in isolation or with known input constraints.
When & What Centric tasks focus on understanding the input-output relations, i.e., studying which impacts , with or without known model space attributes.When, How/Why & What Centric: The user has minimal understanding of all 3Ws and tasks involve exploring unknown , , and their relations (a-f-v).The user would likely perform multiple explorations increasing the support required from the VA system.
Similar to the triplet model of McNutt [54], each user question focuses on understanding one or more of the 3Ws by fixing and varying the other components.Table 1 shows how example questions can be placed in the context of focus areas, and structured with the triplet model using the 3Ws.Further, example VA systems from literature for each focus area are provided.For example, a task can be to discover, i.e., analyze the noise level at which a classification model [75] breaks.As per T &P , the focus is on understanding how affects for a fixed model .The focus of the question is at .When mapped as per Fig. 6, it is evident that possible workflows focusing on the model are missing.Here, model correctness can also be defined by the part of the image the model looks at to make decisions, i.e., understanding how affects .Conversely, it is also possible the user question is very highlevel, i.e., close to .With directional support from the examples in Table 1, the designer can break down the high-level question into multiple, more specific ones.The following section demonstrates this mapping via T&P to help define and identify workflows and analysis strategies for system design.

Concept to Usage: A Checklist
We describe a checklist to use T&P for identifying workflows and analysis strategies for system design.Existing VA systems can be mapped per stage below for understanding or improvements.VA designers can perform some steps in parallel or go back to any stage for iterative improvement.
Define input: The analysis inputs, i.e., system goals, datasets, and models supported, must be defined (see Section 4.1).System-specific inputs should be provided.
Map & verify goal: Each system goal must be described in terms of the 3Ws.Designers must re-verify the requirement with users if any of the 3Ws are missing and the goal in the define input stage may be redefined.
Define & formalize questions: The system goal must be broken down into possible questions.Each question should be mapped to the focus areas of 3Ws (see Fig. 6), and structured into the fix-vary-understand model shown in Table 1.Basis this, if required focus areas are missed, identify possible questions with the support of Table 1, and reassess their need with users.
Similarly, if the questions are high-level and close to the focus area , they should be broken down into smaller questions.The designer can perform this stage with or after the identify support mechanisms stage.
Identify support mechanisms: The 3W elements required per question should be identified (see Fig. 5).Designers should assess the input-output (IEV ) relations of interest (see Section 4.3).The analysis to be supported must be identified by referring to the thought seeds in Fig. 5. Strategies per relevant challenge can be considered (see Sections 4  1, designers can use these to focus their research process and then design their systems.The prototype can be tested with users and iteratively improved with this checklist.We show below how this checklist can aid in design.For simplicity, we choose a CNN image classification model (R n !R 1 ) with the goal of analyzing its robustness to data corruptions like noise (define input).The goal implies the need to analyze perturbations on the input, effects on the model activations, and performance, i.e., all 3Ws (map & verify goal).The designer could have pre-defined sub-questions, such as searching, i.e., exploring model performance and activation changes to data perturbations.Formalizing these questions as per T&P (define & formalize), they fall under the focus areas and respectively.Exploring model performance to data perturbations (focus ) can be further structured into understanding and for a fixed model .If questions are still complex or unclear, thinking about the aspects of the 3Ws can help formalize further (identify support mechanism).For example, with Fig. 5, the need for an input control variable (a) selection, input perturbation to apply a, and a viewer to understand the output changes (v) are identified.Further, it is noted that the identification of control variables (a) and inter-instance input groups (f X ) are also essential to consider for image inputs (C2).The designer can choose not to focus on specific aspects of support.Pre-defined class groups can be used (f X ) rather than supporting group identification.Based on these insights, the initial high-level question can be further broken down into simpler questions.Specifically, a) understanding the input characteristics of a group that has a specific fixed accuracy (focus ), and b) understanding the invariance of accuracy by varying the input control variables such as noise (focus between & ).The above only shows an example sequence of the design process with T&P.With these specific questions, designers can focus their research to design and build their systems (design & verify), detailed further in the following section.Multiple smaller questions can be supported for complex workflows enabling the user to actively interact with the model analysis block until the goal is reached.Actionable insights identified via the VA system are related to updating or correcting the input datasets, models, or their training parameters to repeat analysis.Alternatively, the user may have the desired solution and stops analysis.Section 4 described T&P for complex DL model analysis like H-H problems.Section 5 illustrates T&P's applications.

USE CASES
In this section, we first present two cases for understanding and improving existing VA systems with T&P.We then demonstrate how T&P facilitates VA system design in a real-world H-H application.

Existing VA Systems Through the Lens of T&P
We highlight how T&P can be used to understand and uncover potential research gaps in existing VA systems.Although they do not reflect all the challenges of H-H problems, we illustrate the possibilities of T&P with systems that focus on classification (R n !R 1 ) and GANs (R n !R m !R 1 ).Cases are selected basis their support for the 3Ws, variation in data, and models.
ActiVis: Kahng et al. proposed ActiVis [13] for interpreting industry-scale DL classification models.ActiVis enables the exploration through instances, subsets, and model activations to identify failure causes.We structure the understanding of ActiVis with T&P (see Section 4.5).The system supports model exploration, proven with an industry-scale ranking model for content recommendations by analyzing a large set of numerical features (define input).With the analysis of user-defined input subsets, model activations, and classification performance, all the 3Ws are supported (map & verify goal).The various ActiVis components can be structured by mapping the support mechanisms.Specifically, the Computational Graph Overview of the model architecture allows layer selections (c f ).The selected layer activations can be studied via a t-SNE Projected View or a Matrix View of the activations per neuron/class with the neuron activation view.The Instance Selection Panel displays the classification result of instances .The system allows the definition of instance subsets (f X ) to handle large datasets.ActiVis mainly supports the following high-level formalized questions, (Q1) understanding model activations across inputs (focus between and ), (Q2) understanding the performance across inputs (focus between and ), and (Q3) understanding the failure causes .
With this mapping with T&P, it is evident that the input space understanding support is limited.Given that the model inputs are high-dimensional, it is challenging for the user to know useful subsets to analyze.However, by expecting users to define subsets, ActiVis assumes a certain level of user knowledge about the when component .As per T&P (identify support mechanisms), users unaware of what forms an instance subset (f X ) could benefit from additional means to explore input groups.This change also impacts the questions, where now understanding of possible input groups based on their semantics (Q4), performance values (Q5), or model activations (Q6) can be supported.With the suggestions from Table 1, analysis of input semantics can be supported via a Facets [38] like interface (design & verify).The ActiVis authors identify input group exploration as future work based on user feedback.With T&P, this future work would have been identified as a potential gap and could have been verified with users early in the design process.Further, for the example ranking model including numeric input features, it could be helpful to study output probability trends (v) by changing the value of a ¼ feature [35], [76] This analysis can support identifying sub-groups of data within which the model behavior is invariant to input change .In contrast to previous frameworks [24] that provide a high-level overview of ActiVis, T&P focuses on an in-depth understanding to identify extensions.
GANViz: Wang et al. proposed GanViz [49] to understand the adversarial process of GANs.Formalizing the understanding with T&P, GanViz focuses on understanding DCGANs [77] for images (define input).In DCGANs, the Generator (G) network generates fake data from a random noise vector (R n < m !R m ) and uses them to fool the Discriminator (D) network (R m !R 1 ).Defining & formalizing questions as per T&P, GanViz mainly supports Q1) understanding performance during training , Q2) interpretation of D's decisions , Q3) interpretation of G's output distribution , and Q4) comparative behavior analysis of selected samples in D, i.e., close to .To support these questions, GanViz contains a metric view containing line graphs of various performance metrics over time (Q1).A probability view provides sorted probability values and corresponding summarized thumbnails of real/fake samples (Q1).To further explore the quality of the generator, a distribution view (Q3) contains t-SNE feature summarizations of the real/fake images (G's , D's ).Additionally, users can perform selections (D's f X ) in this view for statistics of real/fake prediction probabilities of D. The TensorPath, and the activation comparison view (Q2, Q4) summarize D's architecture, analyze, and compare the essential features used to make real versus fake classifications with a visualization similar to parallel coordinates.Multiple pairs of activations across sources or timestamps can be compared.The system groups samples by true-positive/negative (f Y ), false-positive/negative (f Y ), and real/fake samples (f X ).
By mapping these support mechanisms to T&P, we note limited support in analyzing G 0 s input space and custom subgroups within G 0 s outputs (f Y ).This translates to understanding G 0 s input latent space ( , Q5) and analyzing outputs per user-defined group ( , Q6).For Q5, since G 0 s inputs are random vectors [77], they cannot be interpreted directly contrary to ActiVis.With T&P, analyzing how these input vectors and changes to them (a) impact the output (v) can give an insight into this.For example, the designer can enable latent space understanding by supporting a ¼ numeric perturbation or a ¼ interpolation between samples and their corresponding output changes.This analysis can be supported by extending existing work [51] for cases where input characteristics (e.g., a ¼ age) are unavailable beforehand (design & verify).For example, supporting changes in the input magnitude and user-identified features and studying their output effects to identify and map input characteristics.Image similarity metrics such as structural similarity index (SSIM) can be used further for effective output comparison.User feedback on GANViz also highlighted such a question on G's input-output mapping.This powerful analysis can allow users to sample the input space as required for dataset creation.Analyzing output subgroups across real and fake samples (Q6) can help the end-user identify and compare similar images.For example, this analysis can support the study of the erratic behavior of bold/italic digits discussed.These potential gaps were identified with T&P's checklist in Section 4.5.
If T&P is used during design, such gaps can be identified early on and discussed with users to verify their need.Although some potential gaps for GANViz and ActiVis identified by T&P were also indicated by users in their respective user studies, the practical benefit of the new features is to be tested and is out of scope for this paper.

VA System for the fastMRI Use Case
Since there are currently not many VA systems that focus on H-H problems, a new one was created.This section illustrates the potential of T&P to support the design of a new VA system for a real-world H-H problem.
Referencing T&P's checklist in Section 4.5, we, the VA designers, referred to as designer henceforth, first define the support for DL model exploration for developers in 1) early model development to 2) advanced stages for 2D image-toimage translation applications (define input).To operationalize T&P, we select a less explored real-world medical image formation case, i.e., image generation from raw data.Taking fewer measurements to acquire MRIs reduces acquisition time, benefits costs, and patient comfort but reduces the resultant MRI quality.The fastMRI [55] case chosen uses DL to reconstruct a high-quality artifact and noise-free MRI image from a low-quality one (see Fig. 7) acquired by accelerating the acquisition process.To illustrate real-world benefits, we utilize the baseline UNet [33] model (M 1 ) and hyper-parameters made available by the fastMRI organizers [55] for the brain multi-coil dataset.Due to GPU memory constraints, the features of the first UNet convolution layer are reduced from 256 defined to 64.All other training parameters are maintained.Identification of Actionable Insights: Here, the designer starts with the first goal of early-stage model exploration, i.e., understanding model learnings across inputs.
The designer identifies the focus on the how & why and when , i.e., , with the map & verify goal stage (see Section 4.5).Although the designer observes that the what component is missing with this mapping, the assumption is that this is not very important in the early model development stage.Hence, this gap is ignored.Next, as per T&P, to define & formalize questions with the support of Table 1, the goal is broken down into, understanding model learnings for a given set of inputs ( , Q1), and understanding the inputs characteristics that are treated similarly by the model , Q2).Next, the designer identifies support mechanisms with the help of Fig. 5.The designer marks the input datasets with their ground truths, trained model, and training context as the inputs required by the system (define input).Further, from the analysis strategies in Fig. 5, the designer includes input instance group definition (f X ), input pixel group definition (c X ), model layer selection (c f ), and activation pixel group definition (c f ).Based on T&P's examples provided per strategy (design & verify), a t-SNE scatter plot is included to analyze activations [40].Although more sophisticated image concept analysis exists [32], this is challenging to extend to medical images and is out of scope for this paper.Hence, simple image viewers are included.An initial prototype S 1 is created and deployed for verification.Using this system on M 1 (see Fig. 8a), the model developer, i.e., the user, loads the inputs and selects the UNet bottleneck layer activations of a patients' MRI images with different imaging parameters (f X ).Unusually, three distinct clusters are observed in the t-SNE activation embedding (see Fig. 8a) from the lower, mid, and upper brain slices.The user wonders if the activation differences are due to variations in the input intensities or whether this is a model problem.The designer is informed of wrongly made initial assumptions, the need for output analysis , the need for studying input-output (a-v) relations, and more support for studying input differences.
Based on this insight, the designer redefines the goal (define input) as understanding and verifying model learnings across inputs.With the support of Table 1, user questions are updated to include understanding input-output relations ( , Q3) and input distribution analysis ( , Q4).With T&P's checklist (identify support mechanisms), the designer adds functionality to study input distributions via histograms and defines custom input control groups (a) to study their effect on performance (v) such as mean square error (mse) via line graphs (design & verify).The updated system S 2 is sent to the user once again for analysis of M 1 (see Fig. 8b).To verify model invariance to brain slice location for a single patient, the user defines a = slice position and studies its effect on mse (v).The user observes an unusually high mse (v) at the top brain slice.Further, a noisy ground truth or GT is seen on visual inspection of the images (see Fig. 8b), indicating a possible problem with the inputs.Since the training context is available as meta-data in the VA system, the user notes that M 1 normalizes the input images per slice, which destroy the original image distributions.
The user corrects this to a scan-level normalization and trains a new model M 2 , keeping constant the other training parameters of M 1 .Using the existing system on M 2 , the user observes that the original distributions are maintained with scan normalization (see Fig. 8c), and slices across most brain locations are processed similarly (see Fig. 9).The change (v) in performance across slices (a) is as expected.Correction of the normalization method improves the structural similarity index (SSIM)/peak signal-noise ratio (PSNR) on the validation set from 0.89/32.4 in M 1 to 0.90/33.7 in M 2 .The top brain slice is still treated differently (see Fig. 9), likely due to its visual differences compared to the rest.However, verifying whether this is caused due to model problems versus inputs 8. Exploratory user workflow for M 1 .Stage a: The user selects the bottleneck layer [33].The user observes model processes slices differently based on location (upper (p), mid (q), and lower (r) brain cluster).Stage b: On analysis, the user spots unusual model performance (MSE, i.e., mean square error between predicted and ground-truth image) and the ground truth (GT) images.Stage c: On further analysis, the user narrows down the problem to the normalization method used, i.e., per slice normalization, and corrects it to scan normalization, preserving image distributions.Fig. 9. Verifying M 2 using scan normalization.Via the bottleneck layer activations, the user observes that most slices are treated similarly (cluster p), except for the top brain slice (cluster q).Performance (MSE versus slice location) improves toward the top of the brain, as expected.
differences requires VA methods for high-dimensional noninvariant data, currently unsolved in literature.The user continues analysis until a desired model is created.
Understanding Model Break Points: For the second goal of exploring advanced models, the designer considers users with a relatively stable model M n with the central question of verifying model robustness for real-world data.This translates to analysis of perturbed inputs to simulate realworld data, their effects on the model activations, and performance, i.e., all 3Ws (map & verify goal).With the support of Table 1 (define & formalize questions), this is broken down into possible questions -understanding input parameters that impact performance ( , Q5); understanding input parameters processed similarly by the model ( , Q2); identifying model breakpoints for high-dimensional outputs ( , Q6); and, understanding or parameterizing inputs ( , Q7).Q2 is already supported, and Q7 is not the designer's focus, hence Q5 and Q6 are considered.Referring to the strategies for each challenge in Sections 4.4.1, 4.4.2, and 4.4.3 to identify support mechanisms, the designer identifies the need to define and explore control variables (a) that affect performance.An analysis of the output sub-spaces (c Y ) is also included to analyze complex failures.Support for analysis, performance trends, and various selection operators required are already in S 2 .The designer includes histograms to analyze input parameter distributions and output metric relations (define & verify).Output image and error-map viewers are included for analysis of local failures.Additionally, an input slider is included to visually assess the gradual breakpoint [35].Finally, a new system S 3 is created.
With S 3 , the user explores the effect of acceleration, noise, and slice location (multiple a) on mse in M n (see Fig. 10).All critical input control variables such as acceleration, are identified by analyzing their effect on performance (a À v).Further exploring the model breakpoint across alpha ¼ acceleration with the system S 3 , the user observes that the model activations close to 1x and 12x acceleration are different (see Figs. 11a and 11d versus 11b and 11c).Observing the mse improvement in Fig. 11e, and mse, the user notes that M n breaks when the acceleration approaches 1x and declines after 10x.Next, the user would like to verify the reconstruction quality for scans with a pathology (e.g., tumor).The user synthetically creates images with a tumor offline.Since it is not the paper's focus, tumors are simulated with a Gaussian blob of a size and location.These created images with their metadata are loaded on S 3 for analysis.The user observes that the model suppresses the input blobs on prediction (see Fig. 12).As seen, the model only fails in the blob region, i.e., local failure.To support this, SSIM can be studied.Further analyzing v ¼ change in ssim with a ¼ blob reveals that the model is more likely fails for large blobs and accelerations.Such failures can be catastrophic, so the user returns to fix the model.
In this section, we highlighted how T&P supports VA system design.Using the H-H case, we showed how T&P could help design a system with non-trivial actionable insights.The use case is meant as an illustration and building a complex VA system is out of scope.Further, perturbations such as various types of noise, motion, and blurring [57] (a) can be studied generically across multiple applications, such as crossmodality translation (MR-CT [7]), super-resolution [78], or image registration [8].

DISCUSSION & CONCLUSION
We present a conceptual framework, Transform-and-Perform (T&P), that extends existing frameworks to emerging highdimensional-to-high-dimensional (H-H) problems.The outputs in H-H problems are high-dimensional, continuous, and have complex unknown input-output relations.These unique characteristics complicate interpretation.T&P facilitates the design of effective VA systems to analyze DL models for such problems by providing a checklist to identify and structure workflows and analysis strategies during the design process.T&P enables analysis with the 3Ws (when , how & why , what ) of model behavior.With the structured means of studying input-output relations (IEV ), question generation method, and analysis strategies, better support for analyzing the 3Ws are provided.
With ActiVis and GANViz, we show how T&P supports understanding and identifying potential improvements in  existing systems.With a practical MRI image-to-image use case, we highlight how T&P supports new VA system design and indicates research opportunities.Currently, limited systems exist for H-H problems, and with more developments a complete evaluation of T&P with VA designers would be possible.As highlighted, many challenges still need to be addressed to further study H-H models.The study of non-invariant relations is challenging with existing methods of analyzing high-dimensional data [14], [79].Another significant challenge with H-H problems is supporting the study of patterns for sub-spaces of inputs and outputs.This support is necessary since the intra-instance output variables can be processed differently by the model.
T&P focuses on continuous scalar spaces (R).Specifically, text and categorical data are out of scope.Since (high-dimensional) classification models still have continuous outputs, i.e., probabilities, they are in scope.The limited support for non-continuous data arises from the need for special support to study input-output (a À v) relations.Input transformations in such spaces are discrete, making change trends challenging to study.Extensions to T&P to cover such inputs may be feasible but out of scope for this manuscript.T&P focuses on explainable DL beyond classification for H-H problems.We primarily discuss imaging models in our cases.Additionally, although T&P supports dynamic model changes, it can be further elaborated for more interactive progressive analysis and updates.In conclusion, although T&P can be extended for broader support, we show its value in supporting VA system design and revealing research opportunities for complex H-H problems.
The inputs are transformed by a model into the outputs .The identified challenges are assigned one or more of the 3Ws in order to indicate the growing need for each aspect in VA systems.The icon colors indicate the focus area of the component.Specifically, the input and output space components, denoted in blue, are model-centric and focus on black-box model analysis to explore and improve model performance.The model space component, denoted in

Fig. 1 .
Fig. 1.An overview of the Transform-and-Perform (T&P) framework.The models, datasets and analysis goals are the inputs.The main model analysis component outlines the process of interacting with the models via a visualization (Vis) front end by specifying/exploring model behavior.The user can analyze, search and query the when, how & why, and what, i.e., 3Ws of model behavior.This analysis is done until the user gains knowledge or actionable insights to update or correct the analysis input.Challenges (C1-C6) in DL model analysis addressed by each of the 3Ws and joint analysis are shown.(Boxes : input/output entities, Circles : processes; : human, : machine).

Fig. 2 .
Fig. 2. Adaptation of existing frameworks for the proposed framework.The framework of (a) Keim et al. [44] represented by Sacha et al. [24]; (b) van Wijk [45]; (c) the T&P framework; (c.1) is a preview of the main model behavior specification unit of T&P, elaborated in Fig. 1. dS=dt and dK=dt in (b) represent the change in the specification (S) and knowledge (K) over time (t) respectively.
(Vis) component to communicate with the model behavior specification component or change visualization parameters.The model behavior specification component specifies the model behavior the user wants to analyze.Together with the visualization component, this enables model understanding.The model behavior specification includes all the aspects of model behavior, i.e., 3Ws -1) when a model behavior occurs focusing on inputs and their ground truth, 2) how & why a model behavior occurs focusing on the models and their training parameters, and 3) what the model behavior is focusing on the model outputs and performance.Designers can define the aspects and the support for the 3Ws.With the analyze, search, query component, designers can define the workflows across the 3Ws, to support analysis.Specifically, analyze, search, query component supports the study of the 3Ws and their unknown relations.For example, designers can support a workflow to study input data characteristics [32].

Fig. 3 .Fig. 4 .
Fig. 3. Structuring the study of model behavior via intrinsic symmetries or asymmetries between inputs and outputs.a is a transform on the input and v is the corresponding transform on the output .

Fig. 5 .
Fig. 5. Defining the 3Ws, i.e, when , how & why , and what components and thought seeds on how analysis can be supported.

4. 4 . 2
The how & why component aids the user in understanding the model space , i.e., how the model behaves and why this happens under input-output constraints.The how & why elements include the model architectures, parameters, training context, i.e., hyper-parameters, and data preprocessing strategies.This component supports users in studying and comparing model and training parameters, features, internal behavior, relations and trends of model attributes (see Fig. 5).Strategies to support this component include: Analyzing the training context [66] and model attributes enables understanding model internals (C3).

Fig. 6 .
Fig. 6.Generation of user questions using the 3Ws -when , how & why , and what components.The position of the task located in the triangle (right) represents the focus of the user task (left).
.4.1, 4.4.2, and 4.4.3).System inputs should be refined based on this stage.Further, updates to supported questions may also be required.Design & Verify: After using T&P to identify questions, analysis strategies, and examples in Sections 4.4.1, 4.4.2, and 4.4.3 and Table

Fig. 7 .
Fig. 7. Low-quality MRI inputs to the DL model acquired with acceleration, and their corresponding expected high-quality ideal output.

Fig. 10 .Fig. 11 .
Fig. 10.Identifying important input control variables (a) such as acceleration and noise level that impact model (M n ) performance.

Fig. 12 .
Fig. 12. Understanding model performance with alpha ¼ Gaussian blob representing a pathology.The blob in the input (left) is suppressed in the prediction (middle) when compared to the target (right).This highlights the model's sensitivity to pathologies and local output failures.

TABLE 1
Examples of User Questions and Literature From Each Analysis Focus Area.Practical Examples are Mapped Based on Their Overall Analysis Focus Area.The Table is Meant to be Indicative and Aid VA Designers in Breaking up The User Goals to Simpler Tasks Via The 3Ws.