Explainable Federated Learning: A Lifecycle Dashboard for Industrial Settings

As the adoption of federated learning (FL) in the manufacturing industry grows and systems get increasingly complex, a need to inspect their behavior arises. Stakeholders of the FL process want a more transparent system to understand the current state and analyze how its performance changed over time. However, current representation approaches are often not designed for industrial applications and do not cover the entire FL model lifecycle. We propose the lifecycle dashboard, which considers the different requirements and perspectives of industrial stakeholders by visualizing information from the FL server. In addition, our representation approach is generic enough to be applied to different use cases and industries. We evaluate the lifecycle dashboard in a semistructured expert interview, show improvements in the understandability of FL systems, and discuss possible use cases in the industry.

As the adoption of federated learning (FL) in the manufacturing industry grows and systems get increasingly complex, a need to inspect their behavior arises. Stakeholders of the FL process want a more transparent system to understand the current state and analyze how its performance changed over time. However, current representation approaches are often not designed for industrial applications and do not cover the entire FL model lifecycle. We propose the lifecycle dashboard, which considers the different requirements and perspectives of industrial stakeholders by visualizing information from the FL server. In addition, our representation approach is generic enough to be applied to different use cases and industries. We evaluate the lifecycle dashboard in a semistructured expert interview, show improvements in the understandability of FL systems, and discuss possible use cases in the industry.
T he increasing adoption of data-driven strategies in the manufacturing industry has led to an abundance of data waiting to be analyzed for competitive advantages. 1 Smart manufacturing allows companies to act more efficiently, reduce production errors, and anticipate upcoming events, such as disruptions in the supply chain. 2 One common approach for analyzing large amounts of data is machine learning (ML). ML can be implemented in several ways to utilize industrial data, which is often distributed and collected on edge devices and machines. First, ML models can be trained centrally on aggregated training data from machines operating under similar conditions. However, using a centralized cloud infrastructure also has several consequences. The high latency between the edge devices and the cloud is challenging for applications requiring fast response times. Collected data from edge devices also often include business-sensitive information that must adhere to privacy and legal guidelines, making data transfers infeasible. Additionally, as centralized infrastructure is expensive to build, this would also increase the dependence on third-party providers. 3 ML models could also be trained directly on edge devices. While this would reduce latency, there are also several challenges associated with it, including how large amounts of data are processed on the restricted resources available on edge devices. 4 In addition, less training data are available on single devices resulting in poor-performing ML models.
This problem has led to the adoption of federated learning (FL) systems, as proposed by McMahan et al. 5 After registering at a central FL server, all participating clients receive an ML model with randomized parameters. Clients use their local data to train and return the updated model to the FL server. Afterward, the server applies an FL algorithm, for example, FedAvg, 5 that averages all gathered model parameters and sends the new model back to the clients. The number of these exchanges, also known as communication rounds, can be predefined or dependent on the required model's performance. 6 FL has been used in the manufacturing industry to improve anomaly detection and condition monitoring, facilitated by various data sources, such as edge devices equipped with Internet of Things (IoT) sensors. 7 However, the adoption of industrial FL is slowed by the increasing complexity of applications and their lack of transparency. Explainable Artificial Intelligence systems try to solve this problem by making their behavior and decisions more understandable to users. 8 Explainable FL would provide stakeholders, including the service provider and consumers, detailed information about the system that can be used to optimize services like predictive maintenance and evaluate the global model's performance over time. As training data are not shared with the central server, data visualizations are limited and can only access aggregated and insensitive information from clients.
This article addresses how the status and progress of FL systems can be conveyed to different industrial stakeholders. Our main contribution is to demonstrate an explainable representation approach that improves the transparency and understandability of FL systems by visualizing the lifecycle of FL models.

BACKGROUND OF SMART MANUFACTURING
Besides supplying their customers with physical goods, companies making use of the digital transformation also provide services and offer their customers integrated solutions facilitated by smart products and IoT technologies. 9 In an industrial context, the digital representations of these devices and machines are named assets, 7 which generate and collect data that can be used to train local ML models providing services like condition monitoring and predictive maintenance. Additionally, original equipment manufacturers (OEMs) ensure that their sold machines operate in optimal working conditions to prolong their health and lifespan. However, the available training data for single assets are limited, resulting in poor predictions. While better performance could be achieved by combining training data from different customers, which often use the same machines in similar application areas purchased by a single OEM, business-sensitive data and network conditions prohibit this. Therefore, OEMs use FL to create a shared model for machines operating in similar environments, leveraging unique customer information while guaranteeing data privacy. Stakeholders of this process need to be identified to create a representation approach making the system more transparent and allowing the evaluation of FL. According to Beverungen et al. 10 the two main actors of smart service systems are service providers and service consumers. The OEM takes on the former role in the industrial FL system while its customers take on the latter. We also consider their requirements and knowledge of FL to identify three stakeholder roles in industrial FL systems.
Service Engineers work for the OEM and use their knowledge about industrial applications to find suitable use cases for FL. Like Domain Experts in ML, they support the model-building process by using their knowledge of real-world use cases where the model would be applied. 11 For example, if service engineers know about industrial pumps that work in similar environments and need to be monitored regarding their temperature, they can suggest using FL to improve anomaly detection models. Service engineers require detailed information about the system's current status, including which clients participate and how knowledge is shared between them, to improve the performance of provided services. This information would enable them to make optimizations, such as changing the cohort of a client or excluding underperforming clients from the system. Service engineers also need to evaluate if FL benefits the clients or if they would be better off training their models individually.
Data Scientists are responsible for model development and validation. 11 They work for the OEM to analyze and gain new insights from the data generated by assets. Data scientists require information about what data are recorded on edge devices to make appropriate design decisions for creating ML models. Expanding on the previous example, data scientists would design and implement the anomaly detection model for industrial pumps. Understanding why clients performed better or worse than their peers during training would also help them adapt their models accordingly.
Machine Operators often work on site and are responsible for several machines. In an ML context, they are Model Consumers 11 and use the model output of their machines to find potential problems. For example, if a predictive maintenance service detects issues, machine operators can check if there have been changes in the operating environment or if sensors have been misconfigured.

Industrial Federated Learning
Industrial FL (IFL), as proposed by Hiessl et al. 7 adapts FL for use cases and characteristics in the industry, including heterogeneous training data and diverse client requirements. Clients with the same asset type, ML problem, and FL algorithm are potential federation candidates and are grouped into populations. Additionally, IFL only allows clients with similar working conditions and data distributions to federate together in cohorts, resulting in overall better-performing ML models. As processes and results of IFL are currently intransparent, stakeholders lack a representation approach that supports them in their decision-making. Our work builds upon IFL to make the system more understandable and to better serve industrial stakeholders' requirements.

Federated Learning Visualizations
Open-source frameworks are often used to speed up the development of FL systems. While new frameworks are continually being developed, the most established ones are TensorFlow Federated and FATE, 12 with the latter allowing visualizations using FATEBoard. FATEBoard shows basic information like running time, log outputs, and current status for different components while also giving an overview of the progress of jobs. 13 Unfortunately, no detailed information regarding client anomalies is provided. 14 Another alternative is NVIDIA FLARE, which adapts existing workflows for federation and provides visualizations during FL experiments using Tensor-Board. 15 While several open-source frameworks support FL visualizations, they mainly focus on representing the model training and validation stages. We go beyond with our contribution by covering the typical industrial FL process, beginning with initial client registration of real-world assets and ending with the final model deployment, where models are used in production.
The visualization of FL systems has also been studied in the literature. Wei et al. 16 used FL in a car racing game and enabled visual inspections of the current status. Using this system, users could see how much each client contributed and how the global model improved over time. Although their method enables a thorough examination of the FL process for their specific use case, privacy-sensitive data from clients is also displayed in their visualizations. One of the most comprehensive studies for visualizing FL was conducted by Li et al. 14 which proposed a system for inspecting the running process in horizontal FL. In addition to basic metric visualizations, they enable further client anomaly detection, pairwise comparison of clients, and contribution analysis. Our work follows a similar approach for visualizing the model training phase while also focusing on the other stages of the industrial FL process. We propose the lifecycle dashboard to convey and represent the status of FL systems to industrial stakeholders by making processes more transparent and understandable.

SYSTEM DESIGN
This section outlines the design and architecture of the lifecycle dashboard and the representation approaches implemented to meet the varying requirements of industrial stakeholders. While this section expands on the IFL system proposed by Hiessl et al. 7 the core concepts can also be applied to other domains.

Requirements
Li et al. 14 have studied what requirements a visualization system for horizontal FL needs to fulfill to meet the demands and expectations of experts. Expanding on this analysis, we gathered additional feedback from stakeholders, including a senior data scientist with ten years of experience, a researcher working in the field of FL for four years, and an industrial project manager with more than 15 years of experience. We derived requirements and expected features for our representation approach by conducting an informal interview session, where we initiated a dialog based on the following topics: › What functionality would you expect from an FL dashboard?
› What data should the dashboard display? › How are you planning on using the dashboard for your use case?
Using open-ended questions as a starting point, our conversations also touched on other aspects, including genericity for different use cases and how to evaluate FL compared to individual training. From our discussions, we derived the following requirements.
System View: As service engineers want to evaluate if FL is suitable for a specific use case, they must understand the system's structure and whether FL benefits the participating clients. A system overview would allow them to understand how knowledge is shared and if processes can be improved. For example, if service engineers experiment with the system by adding or removing clients, they want to know how these actions impact the global model. In contrast, machine operators need detailed information about specific clients. Both macro and micro views must be supported to meet the expectations of both stakeholder groups.
Client Behavior: To better understand FL processes, it is essential to know the status of clients and if they participated in communication rounds. 14 Especially during training, it is important to see for each communication round which clients participated, how much they contributed, and how they fared compared to their peers. In our interviews, data scientists mentioned that abnormal client behavior, such as under or overperforming, should also be visualized as this information can be used to optimize the system. For example, the whole system could benefit by replicating the working environments of clients that consistently overperform.
FL Evaluation: For the evaluation of FL systems, not only the shared model's performance is relevant, as individual clients could achieve better results when training on their own. They could also negatively contribute to the global model if they operate in different environments that impact the quality of their training data. For example, if a new client with unusual lighting conditions joins the system, service engineers need to know if this client can still benefit from the shared FL model. Therefore, a way to evaluate if FL was beneficial for clients is required.
Genericity: In our interviews with industry experts, possible FL use cases were discussed, which can differ significantly from each other. As populations of clients can work on different ML problems, the captured metrics that measure the performance can also vary. For example, metrics like accuracy and loss are suitable for evaluating classification tasks while not appropriate for regression models. Therefore, the representation approach must provide a certain level of genericity to be applicable for different kinds of ML model types.

IFL Lifecycles
Models in FL systems undergo several lifecycle stages. 17 After the initial client registration and management, models are trained locally on clients and are aggregated by a central server. The updated model is then evaluated and optionally deployed by the clients. As these steps are central to the FL process, they need to be more transparent for stakeholders. The following lifecycle stages are examined in greater detail as the IFL system emphasizes them.
Client management: Upon starting the IFL server, clients can register to participate in the system. They also send information about themselves, including their asset information, ML problem, and training data distribution. The server uses this data to group similar clients in cohorts that federate together.
Training: Clients receive an initial model from the server and start training it on their local dataset. For each communication round, the clients also evaluate the model's performance on a predefined test dataset to collect metrics like accuracy and loss before sending the updated model parameters and metrics to the server.
Validation: After the last communication round, clients can evaluate the final model on their local test datasets. In the IFL system, clients also train a local model individually without participating in FL. With this additional model, clients can evaluate if they benefitted from FL or achieved better performance by training independently. In the last step, the performance of both models is sent to the server.
Deployment: Clients can now optionally deploy their model and use it in production. In most cases, clients will deploy the new model if it surpasses the current one in performance.

Lifecycle Dashboard
The lifecycle dashboard is a representation approach to convey the status of FL systems by visualizing data sent from the FL server. Figure 1 shows how the lifecycle dashboard integrates with an existing IFL system. The dashboard and FL server communicate through a common library that defines a standardized interface for exchanging data. This interface is designed to be generic and adaptable for different industries while supporting IFL-specific information, where clients send information about their industrial assets. For example, in the healthcare industry, where hospitals want to work on a common ML problem without sharing sensitive data, clients could provide other domain-specific information, such as the type of disease or the number of affected patients. By accepting all kinds of numeric metrics, the dashboard's visualizations are model agnostic. Clients can decide which metrics to publish based on their active tasks, including supervised, unsupervised, and reinforcement learning use cases. The dashboard is divided into the previously mentioned IFL lifecycles and visualizes each stage to make processes more transparent. It allows navigating from a general overview to a more fine-grained perspective, matching the abstraction levels defined by the FL server. For the IFL system, this corresponds to populations, cohorts, and clients.
As shown in Figure 2(a), the dashboard displays an overview that gives insights into the system's structure. By showing the cohorts of a selected population, users can better understand how knowledge is shared between clients. In addition, the average accuracy of the population is shown, giving a quick overview of how the population is currently performing. As all clients in a population work on the same ML problem (e.g., condition monitoring of pumps), this page can also include a more detailed description of the task, including which FL algorithm is used. On the map, users can see markers for the location of clients, with the color representing the cohort. More detailed information about a specific cohort is shown in Figure 2(b). By displaying the current training round and the latest training metrics, users can quickly determine the training's progression and how well clients performed. Users can also select clients to get more details about their current lifecycle stage and previous training and validation performance. For example, machine operators working on site can use this information to identify a client's status and debug possible anomalies.

THE INTERVIEWS LASTED BETWEEN 40 AND 60 MIN AND WERE HELD ONLINE, ALLOWING US TO RECORD THEM FOR LATER ANALYSIS.
The main components for the training lifecycle stage are shown in Figures 2(c) and (d). Similar to the proposed representation approach by Li et al. 14 this boxplot graph visualizes the model performance of clients in a cohort during training. The y-axis displays the elapsed training time in communication rounds, while the x-axis shows the boxplots of different metrics. Besides allowing data scientists to identify client outliers quickly, these boxplots also visualize the average client performance over time. Additionally, multiple clients can get selected to highlight and compare their behavior between communication rounds, including if they trained and returned their local model parameters. The bars on the left represent how much individual clients contributed to the global model. While it depends on the specific use case and the FL server configuration, model contributions of clients are often not of equal importance. Clients with limited data access should often contribute less to the global model than clients with vast amounts of data. However, dataset size is not always the correct choice. The FL server can also consider other contribution metrics, like data quality, during model parameter averaging. The yellow line on the graph indicates how many clients participated in each communication round. This information can help identify network problems if the yellow line suddenly changes and fewer clients participate in a specific communication round.
In Figure 2(e), the validation chart for evaluating the FL performance is shown. The accuracy of the FL and individual trained model is displayed for each participating client. While minor differences between the individual and federated model are expected, more significant differences could indicate that the client does not belong in this cohort. However, before moving the client to its own cohort, data scientists would need to check how much this client influenced the performance of the other clients in the cohort. While this client might be better off training alone, it could also provide valuable knowledge to other clients and the performance of the whole cohort would be significantly negatively impacted if the client was moved to its own cohort. As all clients in a population work on the same ML problem, performance differences between models could be attributed to external factors like lighting conditions, unwanted vibrations, or slight differences in the underlying data distributions. A machine operator on-site needs to investigate these factors to evaluate if there are differences in the environmental conditions and if they can be adapted.
As shown in Figure 2(f), the lifecycle dashboard also visualizes which clients have deployed their models. The dashboard allows users to download models of cohorts and specific clients for each communication round on their local machines. Seeing how the model changes over time can help data scientists reason about performance differences and allows them to compare clients' models.

EVALUATION
We evaluated the lifecycle dashboard prototype to determine if the different visualizations helped make the FL process more transparent and understandable for industrial stakeholders. This section outlines the experimental design and gathered results.

Experimental Design
We conducted semistructured expert interviews with a representative sample of future users to gather qualitative feedback on the presented information in the dashboard, as shown in Figure 2. In Table 1, the participants of the study are listed. We chose experts from different professions and backgrounds interested in using FL in their projects to evaluate if the prototype could support their unique workflows.
The interviews lasted between 40 and 60 min and were held online, allowing us to record them for later analysis. We introduced the participants to the IFL system and gave them access to the lifecycle dashboard.
As shown in Figure 2, the prototype visualized synthetically generated data representing a smart condition monitoring use case for industrial pumps. Two cohorts with a total of 19 clients participated in the system. The FL server was configured to use the FedAvg algorithm for both cohorts separately and to ask all available clients to train and return their local model parameters in each communication round. The clients worked on a supervised learning task, specifically an anomaly detection problem on time series data, to detect unusual working conditions, such as high temperatures or sudden temperature changes. Both cohorts finished their training in 20 communication rounds, with each client collecting metrics like accuracy, loss, and latency. Only two clients would have performed better using their individually trained model instead of the FL model in this scenario. After giving the participants a quick overview of the use case, they shared their screens and received several tasks to execute in the dashboard, including the following: 1) What problem do the clients work on, and how is the FL system structured? What is the current status of the clients? 2) What is the training status of the first cohort (progress, number of clients, metrics)? Are there any clients that consistently under-or overperformed? 3) Did FL improve the performance of clients compared to individual training? What steps should be taken to improve the overall performance? 4) Which clients have deployed their models?
The participants could freely navigate the dashboard to find the required information for completing the tasks they were given. The intent of actions and possible ambiguities during tasks was captured using a think-aloud protocol. At last, experts in the industry were asked about future application scenarios and types of assets and industrial processes that FL could support. We evaluated the transcribed interviews using thematic analysis 18 and categorized the results into overarching themes matching the presented FL lifecycles. These central themes were then subdivided into smaller groups, in which similar statements and feedback of participants were collected. The prototype-related results contain feedback we gathered concerning the dashboard and its visualizations, while the domain-related results contain information about industrial use cases and their specific requirements.

Prototype-Related Results
All the participants were able to determine the system's structure, which populations and cohorts were present, and the current lifecycle status of the FL model. The prototype also successfully conveyed information about specific clients, such as their associated assets and the latest collected metrics. After the participants had familiarized themselves with how the training graph in Figure 2(c) visualizes information, their feedback was consistently positive. Participant P2 highlighted the functionality to see how much clients contributed during training, while participant P6 noted, "[The training graph] has everything that a data scientist would like to have." While interviewees with a data science background recognized the dashboard's system debugging abilities, like identifying outliers, they expressed an additional need to inspect the trained models. Without access to training data, the participants had difficulty discovering why clients underperformed and examining models for edge cases. This demand represents a significant challenge in debugging models in FL systems, as training data stays locally with clients and cannot be accessed from the server. Participant P4 suggested that clients could provide contact information or links for connecting to edge devices directly if customers allow them to do so. Because the lifecycle dashboard was designed to meet the needs of different stakeholders, some visualizations have a higher information density and are tailored to specific stakeholders. Participants with a strong background in data science were able to determine underperforming clients in Figure 2(c), while others had difficulties interpreting how outliers are visualized. Regarding this complexity, one participant mentioned the following.
The goal of such a dashboard does not have to be that it is intuitive for all people [...], but for precisely whoever operates it (P5).
For the validation lifecycle step, interviewees were asked to interpret the validation graph in Figure 2(e). The participants were able to detect the clients who individually performed better than using their FL model and were also able to derive the next actions for improving the shared model. After recognizing that the third client would be better off not participating in FL, one participant stated the following.
I Other interviewees also responded positively to the validation graph, stating, "Above all, I find it very interesting [...] whether FL actually improved something (P1)." Additionally, all participants were able to determine which clients deployed their models.

Domain-Related Results
Different promising use cases for industrial FL were discussed, including data centers, energy communities, and smart grids. Participant P1 also highlighted that not sending training data to a central server could benefit use cases in facility and building management. Regarding the potential of FL in the industry, participant P7 stated the following.
One of the most important aspects [of FL] with our use cases is that you can learn much faster due to the [data] heterogeneity. Currently, it sometimes takes us forever to learn something, even though we have 1000 units worldwide with 30 very similar units (P7).
The interviewee was mainly referring to the learning process of their ML models. The participants were also aware of the challenges and problems associated with industrial FL.
There are many applications where the same hardware and automation devices are used, but the actual application is entirely different (P6).
While some FL prerequisites can be automatically checked by the server, like matching asset types and similar data distributions, finding similar application areas requires a semantic understanding of the specific use cases. One participant noted that some processes are too dynamic and individual for a successful FL implementation, even with matching application areas.
The major problem is finding the use cases and models. I could only imagine FL or condition monitoring [...] with very scalable and similar processes and with processes that are constantly running [...], like temperatures, pressures, and vibrations (P3).
All characteristics and aspects must be considered before deciding if an industrial process is a suitable candidate for FL.

DISCUSSION
The qualitative analysis results show improvements in the understandability of FL systems, and the gathered feedback from the interviewed experts was overall favorable. Visualizing the different FL model lifecycle stages helped convey the system's current state and increased transparency. Especially the validation graph of the lifecycle dashboard was well received by the experts as it could support the improvement of FL systems by helping to decide which clients should participate.
The study also outlined a significant challenge data scientists face when debugging FL systems. We evaluated how they could detect outlying clients using the training graph in Figure 2(c). However, as training data are not shared with the FL server, the debugging possibilities are limited after detecting client anomalies. There are several different ways to improve the debugging process. If customers are unwilling to share training data with the FL provider, they must use their own resources to debug their clients' models locally. Depending on the performance of these clients, the FL provider could also exclude them from future communication rounds if no improvements are noticeable or assign them to different cohorts. These measures would prevent a deteriorating shared model for other customers. If customers agree to let the FL provider access clients' data, the lifecycle dashboard can provide endpoints to let users remotely connect to the clients. This would enable a data scientist working for the FL provider to download the clients' data and debug the model on their local machine. Alternatively, they could use data visualizations if the FL clients provide them. In both cases, no business-sensitive client data are shared between different customers. In addition, visualizing the statistical properties of the clients' data distributions could also help with the debugging process.
The interviewed experts saw a high potential for future usage in industrial applications. The greater transparency provided by the Lifecycle Dashboard could support FL systems' rollout and help identify possible problems with participating clients sooner. However, one challenge for service engineers is to find industrial processes that are suitable candidates for FL. Applications that try to improve the understanding of such systems must provide detailed information about processes, how they work and how they change over time. Only with this information, service engineers can make informed decisions if FL is beneficial for specific use cases.

LIMITATIONS AND OUTLOOK
In this article, we interviewed seven experts to receive immediate feedback on our prototype during its earlier development phase. More debugging options would support data scientists in making better-performing models and enable deeper insights into how the system could be optimized. Currently, the developed prototype focuses on improving the understandability of FL systems but only provides limited interaction options to the users. Broader interactivity could also positively contribute to the shared model's performance, for example, by allowing users to exclude underperforming clients or move them to different cohorts. Besides the explainability aspect of FL systems, future research could also take place on how FL can be applied to other industries. Different use cases, ways of deployment, business models, and collaboration partners could be evaluated that would benefit from FL. This evaluation would also help discover new requirements for the lifecycle dashboard and its utilization in different environments.
After integrating their feedback about debugging and interaction possibilities, we will survey 30þ customers, product managers, data scientists, and FL experts. In addition, we plan on gathering insightful metrics regarding the human-centric aspects of the dashboard, including the mental workload (NASA-TLX), system usability, user experience, and technology acceptance scores. Furthermore, alpha rollouts in a living lab will enable us to integrate findings and feedback from users exposed to our lifecycle dashboard showing real-world data from a condition monitoring use case for several weeks.

CONCLUSION
In this article, we presented the lifecycle dashboard as a representation approach to convey the status of FL systems to industrial stakeholders. We identified the visualization of lifecycle stages as a crucial element to support the widespread adoption of FL in the industry by lettings users gain a better understanding and trust in FL systems. Our interviews demonstrated that our prototype increased the transparency of FL and experts indicated strong support for future industrial use cases.