Beyond building predictive models: TwinOps in biomanufacturing

— On the wave of more and more manufacturers embracing the pervasive mission to build digital twins, also biopharmaceutical industry envisions a significant paradigm shift of digitalisation towards an intelligent factory where bioprocesses continuously learn from data to optimise and control productivity. While extensive efforts are made to build and combine the best mechanistic and data-driven models, there has not been a complete digital twin application in pharma. One of the main reasons is that production deployment becomes more complex regarding the possible impact such digital technologies could have on vaccine products and ultimately on patients. To address current technical challenges and fill regulatory gaps, this paper explores some best practices for TwinOps in biomanufacturing – from experiment to GxP validation – and discusses approaches to oversight and compliance that could work with these best practices towards building bioprocess digital twins at scale.  B IOPROCESS DIGITAL TWINS are virtual representations of the complete bioprocess chain from raw materials to drug products - aiming to predict critical quality attributes (CQAs) and performance attributes (PAs) across all upstream, downstream, and secondary operations. The ultimate goal is to solve all bioprocessing tasks by providing next-best-actions, confidence of predictions, and financial impacts. Many manufacturing industries already have twins with a certain degree of integration with the control of physical systems [1,2]; however, medical and biotech companies are struggling for productionizing digital twin applications [3]. Most of the published research in life science R&D has been focusing on simulations and “hybrid models” [4] derived by combining expert knowledge and data-driven analysis that leverages experimental data and Process Analytics Technology (PAT) data [5]. Advanced (ML), or artificial intelligence (AI), techniques are being also employed

 BIOPROCESS DIGITAL TWINS are virtual representations of the complete bioprocess chainfrom raw materials to drug products -aiming to predict critical quality attributes (CQAs) and performance attributes (PAs) across all upstream, downstream, and secondary operations. The ultimate goal is to solve all bioprocessing tasks by providing next-best-actions, confidence of predictions, and financial impacts. Many manufacturing industries already have twins with a certain degree of integration with the control of physical systems [1,2]; however, medical and biotech companies are struggling for productionizing digital twin applications [3]. Most of the published research in life science R&D has been focusing on simulations and "hybrid models" [4] derived by combining expert knowledge and datadriven analysis that leverages experimental data and Process Analytics Technology (PAT) data [5]. Advanced machine-learning (ML), or artificial intelligence (AI), techniques are being also employed IEEE Software to alleviate the challenges of sparse and small data sets through, e.g., data augmentation, ensemble learning [6], and self-supervised learning (SSL) [7]. Could we already name "digital twins" such predictive models? [8]. Beside the off-line use of hybrid models, standards for integration and validation of twin technologies are key enablers towards a GxPcompliant digital twin product. In a few words, GxP compliance refers to a series of "Good Practices"defined by the Food and Drugs Administration (FDA) or the European Medicines Agency (EMA) -that guide quality in regulated industries. [9]. Wu et al. [10] states that most of FDA approved AI devices have been tested and validated only on retrospective studies, while none of the high-risk devices were evaluated on prospective studies.
While no doubts are on digital twins' value in reducing costs, increasing compliance, and optimising production of drug products [7], less clear is the DevOps architecture required to build, deploy, and maintain digital twins in strongly regulated industries such as life science [8]. Of interest is the concept of TwinOps [11], that combines practices of DevOps and model-based engineering to tackle the physical system modeling aspect. In biomanufacturing, TwinOps would ensure high-quality target profile of the final drug product, thus obeying to the Quality by Design (QbD) paradigm [4,5]. Minerva et. al. [12] propose a generic architectural model for digital twins to use in different industries. Cloud-based digital twin platforms are available on the market [13]. The concern is whether they suit complex data and model requirements along with strict regulations characterising Pharma 4.0. To the best of our knowledge, best practices for orchestration, validation, and compliance checks of automated data and model pipelines for an effective TwinOps lifecycle are in their very early stage in the pharmaceutical industry. To address these strong regulatory needs from a technology perspective, this paper outlines TwinOps requirements and tailored considerations for digital twin applications in biomanufacturing. Concluding that proper TwinOps will bring bioprocess digital twins as well as the whole Pharma 4.0 to the next level.

TWIN-OPS LIFECYCLE
For DevOps and automation, context is key: there is no "one size fits all" [14]. High complexity of bioprocesses and strict GxP regulations steer TwinOps lifecycle customisation in pharma (Fig. 1).

Experiment
The experimental phase starts with understanding the physical process, i.e., complex dynamics of fedbatch cell cultures in bioreactors, to enable its mapping with the digital product. Lack of correlation between process definition and product definition leads to twin solutions based on "ideal definitions" of the process turning into twin's limited performance in production [15].
During experimentation, first data acquisition from past process samples or through Design of Experiments (DoE) techniques allows parameter fitting and validation of initial modeling with actual measurements.

DevOps
Many companies are facing the challenge of unlocking the value of data through data science technologies [16,17]. This is because there is no adoption of DevOps concepts since initial stages of data acquisition and the modeling exercise. In biomanufacturing, making data-driven ML components, simulators, and mechanistic models work together is not trivial. TwinOps are DevOps practices extended to hybrid models, their integration, and validation.
TwinOps must ensure management and integrity of training data (from offline database, DoE, simulations, real-time sensor, and PAT data), needs for model retraining with different requirements according to twin product components (e.g., state estimator, optimiser, controller of the bioprocess), as well as concerns of model transparency and explainability. In fact, proper DevOps is also key enabler for the validation phase.
Challenges in orchestrating and deploying data ingestion pipelines from multiple sources and testing model releases in an automated fashion using Continuous Integration and Continuous Delivery (CI/CD) pipelines are discussed in Section 2.

Validation
Validating data and models is a mandatory step for deploying applications in regulated industries such as life science. Digital twin products must be GxP compliant. This means, before production releases, new generated data and models must go through an 'approval gate' to check GxP compliance of our bioprocess digital twin application.
The use of TwinOps practices plays a crucial role in overcoming quality control and validation challenges. It enables explainability through automated reporting, monitoring scripts for data and model drifts, and pipelines to score models against predefined business KPIs.

BIOPROCESS DIGITAL TWINS IN PRODUCTION
Deploying twins for practical use is an industrial concern that goes unaddressed in research papers. With respect to the level of integration of digital twins, three kinds of twins have been identified in literature: 'claimed twin', 'described twin', and 'actual twin'. Half of literatures described 'digital model' or 'digital shadow', although the authors claimed a digital twin in their papers [2].
A major issue is the lack of clarity and specificity of digital twin concepts regarding different industry fields. Adopting a domain-specific perspective, we propose a breakdown of complexities arising from bioprocess digital twins in production. To fix ideas, Figure 2 depicts a bioreactor along the production line that the digital twin user (DT user) can control through an industrial PC (IPC) system, performing actions suggested by the twin (in case of indirect control). Having manual control is a temporary solution we implement to demonstrate reliability of the twin and accumulate data before going to a fully automated process. The ultimate state is a fully integrate digital twin that continuously gathers and delivers data back to where it is needed in the value chain using a closed-loop monitoring and decision-making process to directly control the bioreactor.
In general, the two widely adopted components of a bioprocess digital twin are:  a state estimator, which gets current conditions from sensors and process parameters to predict the process outcome;  and a controller, which is triggered in case the process outcome is predicted not optimal with respect to production target. Once triggered, the controller uses an optimiser to evaluate possible actions and perform the best oneaccounting for physical process constraints and business criteria -towards steering production back to optimal target. How to build such predictive capabilities and orchestrate components of the bioprocess digital twin?

Data ingestion pipelines
There are two main types of dynamic source of data: data from real-time sensors and data from PLC. Such data is usually collected every few seconds or minutes and ingested from PAT software into the data lake. Sensor and control data history mainly comprises of in-spec events, i.e., samples around the optimal production target and control parameters set at production specifications. This means that if we retrain the state estimator with production data, it gets more and more precise around the optimal while it loses confidence when quality is out of specification ranges.
However, the main goal of the twin is controlling the bioprocess in case of out of spec events. How to keep the production at target in case of a temperature perturbation? How to handle a pipe's loss of pressure? Initial data collection done via DoE. Since lab experiments are prohibitively expensive, experimental data has small size and is seldomly refreshed. Given the manual-entry nature of this data, a data format validation pipeline as first sanity check is recommended to handle future version storage into the data lake. Experimental data primarily serves for fit and validation of mechanistic models and simulators developed to replicate the bioprocess. Such complex IEEE Software simulation models are then used to generate larger datasets of simulated out of spec events, constituting the main data source to train twin's ML components. Automated ingestion of simulated data into data lake often results a challenging task since it requires support for third-party simulation software.

Hybrid model pipelines
Main characteristic of bioprocess digital twins is the adoption of the so-called "hybrid models" [4], which help replicating complex dynamics, e.g., of cell metabolism within the bioreactor, through expertbased models while leveraging ML to learn and timely predict such dynamic patterns over time.
For off-line use, dynamic simulators serve to replicate abnormal events along the bioprocess chain and explore the solution space (process outcome) by varying process parameters. Simulated data from the mechanistic model is used to train the twin reacting on-line when the bioprocess is out of control. Such sophisticated models are often computationally slow, thus their use for real-time control is limited. ML models are employed to run on top of mechanistic models to overcome computational challenges and confer the twin adequate predictive power with respect to the actual bioprocess.
Making hybrid models work altogether require challenging re-train, validation, and deployment cycles to be performed by multiple automated CI/CD pipelines. For example, what happens when a new mechanistic model is built? We need a validation pipeline to assess simulation accuracy against process data. Qualified simulation data can thus be used to retrain the ML components, which are re-deployed in case of higher performance compared to current deployed ML models. It worth noticing that automating the deployment of the new simulation model itself might suffer limitations depending on the simulation software or source code compatibility with the twin platform.

Continuous validation
As biomanufacturing processes tends to encounter fluctuations after production starts, validating model outcomes in a continuous way is essential for having a compliant twin. By implementation of automated testing of data and models we can ensure each batch stays the limits of a golden batch, of which variability is twin-controlled. However, many described twins in the BioPharma space lack actual feedback loops to production systems. All events during production must be recorded, the information analysed, and main findings pushed back to TwinOps in an automated feedback loop.
Testing required for regulatory approval is intensive, and hardcoded data for unit tests might not be sufficient. Testing protocols must include tests by ML models to detect anomalies, as well as tests by human to demonstrate how well the twin applies human factor engineering all along. When developing, testing, and manufacturing GxPregulated products, most important GxP targets are the integrity, reproducibility, and traceability of data and models. Consequently, implementing continuous validation of GxP controls turns into achieving continuous compliance of digital twins.

Automated twins
In biomanufacturing, the end-to-end process goes from raw material to the vaccines themselves. This means, we are looking at 1000s of models for our ambitious bioprocess digital twin that are all intertwined. To make this happen, one must heavily rely on automated modeling approaches that use a mixture of SSL methods [7], hyperparameter tuning, and model searches for given defined repeatable patterns. This way, we can with limited effort run a kind of automated proof-of-concept ("Auto PoC" in Figure 2) to determine if data quality and volume is of sufficient nature to create a baseline model. This baseline model can then be deployed to the production platform without human interaction, as all validation and reporting steps are automated as well. In one hand, such level of automation ensures to stay fully compliant with almost no effort needed. On the other hand, the "Auto PoC" step also has the advantage that it can be reiterated every time new or more data comes in, to help deciding on model re-training. Based on our experience in implementing TwinOps along the ongoing developments of bioprocess digital twins, Table 1 provides an estimation of feasible automation targets by model type from a technology perspective. Increasing twin automation clearly means envisioning a "twin core product" which is more realworld data-centric than based on time consuming mechanistic models that needs SME, and simulated data not reflecting real world variance and noise.

CONCLUSION AND WAY FORWARD
The novelty of this paper lies in adopting a technology-focused lens to analyse digital twin requirements for the biomanufacturing industry. We believe this shift from providing general twin concepts of previous studies to a domain-specific perspective is necessary to break down complexity of bioprocessing operations.
Future work is needed to further prove the use of automation and DevOps through well-defined usecases along the bioprocess chain, such as fermentation in bioreactors, downstream purification steps, as well formulation and filling to ensure potency, long shelflives and stability of drug products.
In fact, adopting TwinOps practices since day zero when developing bioprocess digital twins for any given use-case will enable shaping the "twin core product". In tech terms, the core product aims at gathering domain-specific procedures and templates for continuous GxP compliance, holding across other use-cases. Considerations and best practices for compliant production deployment encompassed by this articleto a certain extent and level of detailswill help moving the whole Pharma 4.0 forward.
Beside building predictive models, we encourage biotech companies to put more efforts on identifying technical gaps and unblocking the use of DevOps and automation practices for achieving challenging regulatory targets towards compliant digital twin products.