A Probabilistic Domain-knowledge Framework for Nosocomial Infection Risk Estimation of Communicable Viral Diseases in Healthcare Personnel: A Case Study for COVID-19

Hospital-acquired infections of communicable viral diseases (CVDs) are posing a tremendous challenge to healthcare workers globally. Healthcare personnel (HCP) is facing a consistent risk of hospital-acquired infections, and subsequently higher rates of morbidity and mortality. We proposed a domain knowledge-driven infection risk model to quantify the individual HCP and the population-level healthcare facility risks. For individual-level risk estimation, a time-variant infection risk model is proposed to capture the transmission dynamics of CVDs. At the population-level, the infection risk is estimated using a Bayesian network model constructed from three feature sets including individual-level factors, engineering control factors, and administrative control factors. The sensitivity analyses indicated that the uncertainty in the individual infection risk can be attributed to two variables: the number of close contacts and the viral transmission probability. The model validation was implemented in the transmission probability model, individual level risk model, and population-level risk model using a Coronavirus disease case study. Regarding the first, multivariate logistic regression was applied for a cross-sectional data in the UK with an AIC value of 7317.70 and a 10-fold cross validation accuracy of 78.23%. For the second model, we collected laboratory-confirmed COVID-19 cases of HCP in different occupations. The occupation-specific risk evaluation suggested the highest-risk occupations were registered nurses, medical assistants, and respiratory therapists, with estimated risks of 0.0189, 0.0188, and 0.0176, respectively. To validate the population-level risk model, the infection risk in Texas and California was estimated. The proposed model will significantly influence the PPE allocation and safety plans for HCP


I. INTRODUCTION
osocomial infections (i.e., hospital-acquired infections) of communicable viral diseases (CVDs) (e.g., influenza virus, hepatitis A virus, and rotavirus infections) have posed huge challenges to public health organizations and the functioning of healthcare systems globally [1], especially during the Coronavirus disease (COVID-19) outbreak in 2019.
This work was supported in National Institute of General Medical Sciences of the National Institutes of Health (NIH) under Award Number U54GM128729.
Hospitals saw an increasing number of outbreaks of CVDs over the last decade, which had negative impacts on patient and healthcare workers' morbidity and mortality [2,3]. Among nosocomial infections, healthcare personnel (HCP) experience the highest risk [4,5] because of the direct or indirect contact with infected patients and virus-contaminated surfaces. Subsequently, these workers may spread the virus to noninfectious patients, coworkers, and their family members. In addition, containment and preventive measures in hospital settings usually overlook asymptomatic individuals and "super spreader" events [6,7]. Therefore, mitigating and preventing nosocomial infections in hospitals is an urgent and important task to lower the risk of contracting CVDs for HCP.
Modeling of nosocomial HCP infections in hospitals has been based on mathematical models to qualitatively capture the dynamics of CVDs and the effects of different control measures [8,9] when only limited data are available. One traditional mathematical model of disease spread is the compartmental SEIR (Susceptible-Exposed-Infected-Recovered) model [10]. It divides a population into four different compartments or subgroups (susceptible, exposed, infected, and recovered individuals) and employs deterministic ordinary differential equations to model the spread of a CVD. In the literature, there are many variants of this model (e.g., SIS, SIRD, MSIR, and MSEIR model). These models consider the population as homogenous without individual interactions (e.g., patients and HCP); therefore, they fail to capture the individual contact process and the effects of individual risk and protective factors [11]. To overcome the limitations of the classic models, complex systems approaches using cellular automata (CA) theory have been proposed to model location-specific dynamics of susceptible populations and the probabilistic nature of disease transmission [12,13]. The major drawback of CA models is its insufficiency in characterizing the spatial temporal information of individuals' movements and interactions [14]. Agent-based modeling (ABM) was proposed to address the limitations of CA models by accounting for the movement of individual disease carriers and the contact network of people [15]. Although the ABM approach can capture the spread of a CVD in a spatial region (e.g., hospital) over time and estimate the risk of viral infection, it requires a large amount of information of individuals' movement and high computational cost. Moreover, individuals' movements are highly restricted in hospital settings, especially for patients who have positive test results for infectious diseases.
Quantitative models have also been used as an alternative to mathematical models to quantify the effects of protective or risk factors on the infection risk of HCP over time. These models capture the disease transmission dynamics within the hospital, HCP-related risk factors of infection, and other patients and HCP as sources of infection [16]. Here, variables are treated as time-dependent variables. Two classes of quantitative models, namely measure of association and statistical survival analysis, have been proposed to estimate HCP infection risk. The measure of association approaches quantifies the relationship between the exposed and diseased HCP groups by using the adjusted odds ratio (aOR), risk difference (RD), and relative risk (RR) as the risk measures [5,[17][18][19][20]. To capture the changes of HCP's characteristics and infection risk over time, survival analysis models are used to estimate the HCP infection risk and the expected duration of time until a viral infection occurs [21,22]. Although time-dependent variables have been considered in the survival analysis models, the stochastic nature of epidemiological dynamics and individual interactions have not been investigated. Estimating the HCP infection risk of nosocomial infection is important to answer epidemiological questions in the hospital settings and provide information for PPE allocation, safety plans for HCP, and staffing strategies.
To overcome the above research gaps, this paper proposes a probabilistic domain-knowledge model of the infection risk of CVDs for HCP. The proposed model was formulated for the infection risk estimation at both individual and population levels with respect to three modes of transmissions: 1) direct contact of susceptible HCP with other infectious individuals including patients and coworkers, 2) airborne viruses, and 3) contaminated equipment and surfaces. The individual-level risk model was built based on the population grouping in the SEIR model with the consideration of the time-varying confounders to capture the dynamical contagious disease transmission mechanism. At the population-level, three subsets of features, which are introduced in Sub-section II.B, were constructed and represented by a Bayesian network [23], from which the probability of transmission from patients to HCP was estimated. The main contributions of this paper are 1) a novel time-variant infection risk analysis model to characterize the dynamics of the disease exposure risk in HCP over time and 2) an individualspecific and domain-knowledge driven infection risk to quantify the complexities of HCP's infection risk. The remainder of the manuscript is organized as follows: Section II elaborates the proposed model, model formulation, and validation; the results with sensitivity analysis and the case study on the COVID-19 are presented in Section III, discussion and conclusions are provided in Sections IV and V.

II. METHODOLOGIES
The proposed framework consists of two sub-models: (1) an individual-level infection risk model that quantifies the risk of infection of an HCP, and (2) a population-level infection risk indicator model that estimates the infection risk under working conditions at a medical facility. The output from the first submodel serves as an input for the estimation of the population infection risk in the second model. Other inputs, such as engineering control and administrative factors, were also considered in the estimation of population risk.

A. Individual-level infection risk model
The individual infection risk model aims to quantify the potential risk of infection associated with a healthcare worker subject to nosocomial infection, whose job functions require working in proximity of patients. The proposed individual-level infection risk model is formulated using the population grouping approach in the compartmental SEIR model [10], in which the population is divided into different compartments (i.e., Susceptible ( ), Exposed ( ), Infectious ( ), or Recovered ( )). However, susceptible ( ) and recovered individuals ( ) cannot transmit the virus during the length of a hospital stay, hence we do not consider these compartments in our model. Moreover, we do not assume that the recovered patients confer immunity to reinfection when being released from isolation. HCP coworkers have also been shown to contribute significantly to virus spread within the healthcare setting if contracting a virus [21,24]. To capture the virus transmission mechanism, the healthcare worker group ( ) is added to model the HCP-HCP transmission, and the infectious individuals are further classified into two sub-groups: the infection-confirmed group ( ) and the infection-suspected ( ) group. Infection-confirmed individuals are those who have lab-confirmed infections (e.g., individuals have tested positive for COVID-19 using the polymerase chain reaction (PCR) test), and the infection-suspected group includes individuals who are suspected to have the virus infection because they developed symptoms but have never tested for the infectious disease. In total, four groups ( , , , ) are considered to model the individual HCP infection risk. We denote the potential infection risk of the HCP at location (e.g., hospitals) over time from is constant, the viral transmission mechanism is modelled as a binomial process binomial processes in total. The sequence of contacts of HCP ordered by time will be superscripted by person index ( ) and compartment index ( ) as follows:

B. Population risk indicator model
The population risk indicator quantifies the potential viral infection risk associated with a hospital/clinic over the time period [ 1 : 2 ]. The population risk, annotated as ( 1 : 2 ) , is interpreted as the probability that an HCP contracts the disease under working conditions at place given the information about the individual-level infection risk of all HCP at place and the external factors. At this level, external factors from engineering and administrative controls within the hospital are considered.
Those are the factors that affect the population-level infection risk apart from the individual-level risk. Representative examples of engineering controls are high-efficiency air, ventilation rates at the workplace, and infection isolation rooms for aerosol generating procedures. Administrative controls include formal HCP training regarding protective personal equipment (PPE), training on risk factors and resources to promote personal hygiene. The ( 1 : 2 ) is computed using logistic function as: ] is the vector of individual infection risk estimates of a total number of HCP, is the scaling parameter, = { } is the vector of engineering control and administrative control factors. We denote (•) as the abbreviated notation for the function of , (•) and in (9). The function (•) can be simply formulated as a linear regression model such that: where , , and are the model parameters. Alternatively, the population risk (•) is estimated using a Bayesian network when we have access to the domain knowledge that describe the relationships between the control factors and the infection risk at population level and individual level. Here, the Bayesian network model [26] is employed to incorporate the domain knowledge that influences the virus spread. The network is formulated based on three subsets of factors from the literature that affect the risk of infection including 1) individual-level factors, 2) engineering control factors, and 3) administrative control factors (see Fig. 1). Individual-level factors include patient characteristics (e.g., time from exposure to symptom onset) clinical severity of patients), HCP-dependent factors (e.g., PPE sufficiency level, close contacts with patients, exposure level to infection, working hours per week), and intervention-related risks (e.g., endotracheal intubation, high flow nasal canula (HFNC). External factors consist of engineering control factors (e.g., high-efficiency air, ventilation rates, airborne infection isolation rooms) and administrative control factors (e.g., formal HCP training on PPE and disease risk factors, resources to promote personal hygiene). These factors are annotated as , , and respectively. Hence, using the chain rule of the Bayesian network [27], the risk (•) is estimated as: where (•) is the probability function, and (•) is the indicator variable (

A. Sensitivity analysis using simulated data
Variance-based sensitivity analysis was utilized to investigate the uncertainty of HCP's potential infection risk output caused by the variance of the input variables. for each contact raised to a higher value, hence the probabilities collectively contributed to the value of risk.

B. Model validation using COVID-19 case study
Data sets of HCPs with COVID-19 were used to validate the proposed model. Access to these data sources can be provided per requests or via the cited references. The validation was performed on three main components: the viral transmission probability model, the individual-level infection risk model, and the population-level risk model. The HCP's occupational infection risk to COVID-19, interim guidance regarding risk assessment and universal PPE policy issued by the CDC [41], and the risk factors for severe acute respiratory syndrome coronavirus (SARS-CoV-2) transmission in hospital settings from previous studies were also included to develop the model for the case study.

1) Contributing factors associated with nosocomial COVID-19 infection in healthcare workers
The major factors resulting in high risk for HCPs are 1) exposure to COVID-19 patients without using appropriate PPE, 2) involvement in aerosol-generating procedures and the interventions performed by physicians or nurses, and 3) contact with patients and colleagues during the incubation period. Many studies suggested that there is a significant association between PPE use and infection risk and that masks are the most consistent contributing measure to reduce the risk. A similar association was observed for other PPE, such as gowns, gloves, and eye protection. Other exposures and treatment practices (e.g., intubation involvement, patient care, or having contact with secretions) were found to link with increased infection risk for HCPs [28,29]. Finally, given the implementation of a universal PPE policy, the high risk of infection among HCP also arises from contacting asymptomatic patients and colleagues who are in the early phase of viral infections [19]. The risk factors for SARS-CoV-2 transmission in hospital settings identified by previous studies [17,23] were also included to develop the model.

2) Related work of HCP infection risk for COVID-19
Different regression models, including logistic regression, logbinomial, and Poisson, were used with the defined risk measures to estimate the viral infection risk among HCP groups [18][19][20][30][31][32][33][34][35][36][37]. Statistical survival analysis models were also used to estimate the HCP's risk of contracting SARS COV-2 viruses and the expected duration of time until viral infection occurs. Shah et al. [22] modeled hospital admission of healthcare workers with COVID-19 using Cox regression and conditional logistic regression. Long Nguyen et al. [21] assessed the COVID-19 infection risk among healthcare workers in contrast to the general community by examining the effect of PPE on risk. They also used Cox' proportional hazards model to calculate multivariate-adjusted hazard ratios (HRs) of a positive test. However, the major limitations of these models are: 1) the individual-specific characteristics, e.g., occupation, type of PPE used, experience level, and exposure duration to COVID-19 patients, are not considered [21,22], and 2) the simple formalism of the models without time-varying stochastic transmissions oversimplifies the complex contagious mechanism of SARS COV-2.

3) Data description
Data collected from multiple sources (e.g., COVID-19 transmission databases, health surveys/questionaries, U.S. Department of Labor databases, Cross-sectional study of UKbased healthcare workers) are illustrated in Table I.

4) Model variable selection
Variables from recent findings of SARS-CoV-2 as introduced in Sub-section III.B.1, were used to select the features. The validation was performed on three main components: the viral transmission probability model, the individual-level infection risk model, and the population-level risk model. Regarding the viral transmission probability model, we included the following covariates in the model: , _( _ ), and _ . These are significant factors suggested by the original cross-sectional study [45]. The description of these variables is summarized in Table 1S in the Supplementary material. To validate the individual-level infection risk model, the U.S. Department of Labor O*Net database was employed to quantify the risk score for healthcare-related occupations, where virus exposure time and duration and working environment were considered. For the population-level risk model, the PPE sufficiency level, regional infection risk and the hospitalization data of HCP were selected to estimate population-level infection risk in California and Texas medical centers [40,41] and implement a surrogate method for model validation. The description of these variables is summarized in Table 1S in the Supplementary material.

5) Model validation of viral transmission probability estimation using multivariate logistic regression
To validate the logistic regression introduced in Sub-section II.A., we considered different protective and risk factors for COVID-19 in the data set of UK-based healthcare workers [45] and modelled the association between these covariates and the COVID-19 infection status using multivariable logistic regression. The data set provides 6263 responses in which a composite outcome was present in 1,806 (29.4%) HCP, of whom 49 (0.8%) were admitted to hospitals, 459 (7.5%) were tested positive for SARS-CoV-2, and 1,776 (28.9%) were selfisolated. The covariates included in the model were reported in Sub-section III.B.4. The estimated coefficients and their significance are shown in Table II. The model goodness-of-fit was further assessed by the Akaike information criterion (AIC) and 10-fold cross validation. The AIC value for the above model was 7317.70 and that for the null model was 7449.75. The 10-fold cross validation accuracy was calculated to be 78.23%, which showed that the performance on test data was relatively good.

6) Model validation of the individual-level infection risk
To validate to infection risk model at the individual level, six occupations were considered using the U.S. Department of Labor O*Net database [42]. We also introduced a new variable called occupational-specific risk score denoted as to account for the differences in infection risk among different occupations. The score was computed as: where max { ℎ } is the maximum working hours per week of 6 occupations, and is the scaling parameter. The description of those variables , , , and ℎ are summarized in Table 1S in the Supplementary material. Because of the limited longitudinal data, our strategy was to validate the individual infection risk model using hypothesized scenarios of different occupational settings. Particularly, we made four main assumptions: 1) the individual-risk is the same for every individual working under the same conditions (e.g., same occupation), 2) all patients are confirmed cases, i.e., there is only one compartment , 3) the probabilities of viral transmission from all patients are the same for each occupation, and 4) the probability of viral transmission estimate for confirmed infectious patients, denoted as ̂( 1 : 2 ) , is equal to / max{ }, where max{ } is the maximum score among 6 occupations, which guarantees 0 ≤ ( 1 : 2 ) ≤ 1.

Consequently, (3) is reduced to:
, Lastly, the total number of contacts | (•) | was fixed to be 5 and the value was set to 20. Next, the risk was estimated using (11), and the results are summarized in Table III. The results of the individual-level model indicated a strong positive association between the estimated risk

7) Model validation of the population-level infection risk
The population-level infection risk was validated based on the total of confirmed COVID-19 cases of HCP reported to the CDC. The number of positive COVID-19 cases of HCP in the US up to April 9, 2020, is presented in Fig. 4. According to Fig. 4, there was a strong association between the number of positive cases among non-HCP and the number of cases among HCP by date of symptom onset. In addition, the risk of infection among HCP was closely related to the total number of positive tests among HCP and the patient loads that HCP needed to handle. For population-level, we used the following selected features: , , , . The description of those is elaborated in Table 1S in the Supplementary material. Based on (8), population-level risk estimation was reduced to a regressive equation with equal weights assigned to each variable as: The population-level infection risk model was validated using the COVID-19 data from health centers in Texas, California and other relevant sources as presented in Subsection III.B.3 and Table I. The accessible HCP COVID-19 data of Texas and California were PPE sufficiency level, the total number of hospitalizations, and the percentage of ICU beds available. So, we assumed the distributions and the expected value of (•) over the other variables to be the same for both states. The expected values of (•) was computed using (13) (see Table IV).

Features Texas California
Time from symptom onset to hospitalization The distributions of and are estimated from [38,39].  estimated ̂ values for Texas and California were 0.0084 and 0.0132, respectively.

IV. DISCUSSION
In our sensitivity analysis, we focused only on two key variables, namely viral transmission probability and the number of close contacts between HCP and patients. Specifically, the sensitivity of the infection risk to those input variables was measured by the amount of variance caused by changing the inputs. We divided our analysis into two parts: 1) the measure of sensitivity of . Surprisingly, advanced age, being a smoker or ex-smoker within one year, and having regular exposure to aerosolgenerating procedures performed on COVID-19 patients decreased the infection risk. This result seems counter-intuitive at first, but they are confounders because it was shown that HCP working directly with suspected or confirmed COVID-19 patients tended to be more cautious and self-aware in clinical environments [46]. Therefore, they had sufficient selfprotection and took containment measures; however, healthcare workers in non-communicable viral disease departments, who were potentially exposed to contagious viruses, did not have sufficient training on how to use PPE and deal with infectious diseases and lack of access to PPE and isolation equipment [47]. However, the model has several limitations. First, because we did not have access to information on HCP contact with patients and coworkers, we assumed the estimated viral transmission probability as a measure averaged over all individuals. Second, the data were gathered using surveys and questionnaires, which are subject to selection and recall bias. Third, the use of a composite outcome (including HCP with COVID-19 symptoms, HCP being exposed to risk factors, and lab-confirmed HCP infections) may have resulted in overestimation or underestimation of the infection risk.
We validated the individual-level infection risk model, implemented the model using the two-parameter regressive equation, and estimated the individual risk for six occupations. The results highly depend on the pre-defined parameters, which can be estimated in healthcare settings when data are available. It was shown that healthcare workers and nurses are frequently in close contact with COVID-19 patients, which therefore increases the risk for acquiring SARS-CoV-2 virus [48]. Because HCP can acquire infection through various pathways apart from direct patient care, such as exposure to colleagues, family members, or people in the community, the time-varying risk estimation in the model can provide informed decisions for screening HCP for COVID-19 before workplace entry. The individual risk model can be improved and more specific to better model the transmission dynamics, e.g., a model that incorporates the quantification of indoor airborne infection risks using a probabilistic framework [49].
For model validation at the population level, we considered two case studies to estimate the risk of infection of HCP in Texas and California states. Both states have a high number of lab-confirmed SARS-CoV-2 patients. The average number of hospitalizations in Texas and California were 16843 cases/day and 4219 cases/day, respectively. However, the infection risk in Texas was 0.0084 which was lower than the risk in California (0.0132). This was mainly due to the difference in patient load for each HCP per day and the two states' PPE sufficiency level. From Table IV, the average PPE sufficiency level in California was only 0.744 as opposed to 0.9355 in Texas, and the average percentage of ICU beds available per 100,100 people in Texas was significantly higher than that in California, which implies heavier patient loads in California. The model also made some important assumptions: 1) close contacts with COVID-19 patients are independent and there is no viral transmission among HCP, and 2) protective/risk factors are well-defined and sufficient to estimate the risk of infection.

V. CONCLUSION AND FUTURE WORK
The paper proposed a time-variant infection risk analysis model to characterize the dynamic of the disease infection risk in HCP over time and an individual-specific and domainknowledge driven infection risk to quantify the complexities of HCP's risk of CVDs in healthcare settings. The infection risk analysis model for HCP was estimated at both individual and population levels. The individual-level risk model was built based on the population grouping concept of the wellestablished epidemiological SEIR model with the consideration of the time-varying confounders to capture the dynamical contagious disease transmission mechanism. At the populationlevel, three subsets of features were constructed and represented by a Bayesian network, from which the probability of viral transmission from patients to HCP was estimated. To validate our methods, we have incorporated the data from multiple data sources from the US, the UK, and Taiwan for the COVID-19 case study, which contains the information about potential factors that affect COVID-19 transmission mechanism; and the domain knowledge of similar contagious diseases such as SARS or MERS from the relevant studies to estimate the risk of COVID-19 infection of HCP. For individual-level risk estimation, the model was founded on the SEIR compartmental model and developed for the occupational-specific and individualized infection risk model. As a result, the model can capture accurately the infection risk varying over time under the control of those individual time-varying confounders, and it is also able to account for the intrinsic stochastic transmission mechanisms. At the population level, the Bayesian network formalism can accommodate the limited data scenario, and it can update the parameters when more data are available. The results from two case studies are interpretable at the population level, which showed infection risk in California is higher than in Texas because of the heavier patient loadings and shortage of PPE. The major limitations of the CDC's interim guideline for risk assessment, which is inadequate in quantifying the risk of infection in an individualized HCP, have been addressed by our model. The model would significantly endorse the PPE allocation and safety plans for HCP and enhance the crisis-level staffing strategies in facilities with the staffing shortage. Longitudinal experimental designs are required to collect more COVID-19 data among HCP to validate the proposed model properly. Future work would involve: 1) model assumption validation when more data are available and sufficient, 2) model modification and reformulation if the assumptions are violated (e.g., independence assumption and new vaccinated population), and 3) validating the model with the other related case studies of communicable viral diseases.