Federated-PCA on Vertical-Partitioned Data

. In the cross-silo federated learning setting, one kind of data partition according to features, which is so-called vertical federated learning (i.e. feature-wise federated learning)[23], is to apply to multiple datasets that share the same sample ID space but diﬀerent feature spaces. Simultaneously, the image dataset can also be partitioned according to labels. To improve the model performance of the isolated parties based on feature-wise (i.e. label-wise) results, the most eﬀective method is to federate the model results of the isolated parties together. Howev-er, it is a non-trivial task to allow the participating parties to share the model results without violating the data privacy of the parties. In this paper, within the framework of principal component analysis (PCA), we propose a Federated-PCA machine learning approach, in which the PCA method is used to reduce the dimensionality of sample data for all parties and extract the principal component feature information to improve the eﬃciency of subsequent training work. This process will not reveal the original data information of each party. The federal system can help each side build a common proﬁt strategy. Under this federal mechanism, the identity and status of each party are the same. By comparing the federated results of the isolated parties and the result of the unseparat-ed party through multiple sets of comparative experiments, we ﬁnd that the experimental results of these two settings are close, and the proposed method can eﬀectively improve the training model performance of most participating parties.


Introduction
Federated learning (FL) was first proposed by Google in 2016 [12], whose main idea is to build machine-learning models based on datasets that are distributed across multiple devices while preventing data leakage and learning a shared model by aggregating locally computed updates via a central coordinating server. There is growing interest in applying FL to other applications, including some applications that may involve only a few reliable clients or multiple organizations working together to train models. In fact, FL is a machine learning setting, where multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each clients 2 Y.M. Cheung and F. Yu raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective [11].
In general, there are two FL settings, i.e. cross-device and cross-silo. In the former setting, the data is assumed to be partitioned by samples, i.e. horizontal FL [23]. By contrast, in the cross-silo, in addition to partitioning by samples, partitioning by features is of practical relevance [11], which is also called as vertical FL (VFL) [23] or feature-wise FL interchangeably. In particular, for the image datasets, it can be also partitioned by labels. It applies to the cases that two data sets share the same sample ID space but differ in feature space.
The cross-silo setting can be relevant where several companies or organizations share incentives to train a model based on all of their data but cannot share their data directly. This could be due to constraints imposed by confidentiality or due to legal constraints even within a single company when they cannot centralize their data between different geographical regions. For example, let us consider two different companies in the same city, one is a bank, and the other is an e-commerce company. The sets of their users are likely to contain most of the residents of the area, thus the intersection of their user space is large. However, since the bank records the users revenue and expenditure behavior and credit rating, and the e-commerce company retains the users browsing and purchasing history, their feature spaces are very different. Supposing both parties need to have a prediction model for product purchases based on user and product information [23], this is a typical VFL. In addition, there are also different medical centers that have large user intersection space and hold different types of personal medical image information of users, and shopping malls (online or offline) that have large user intersection space but retain different shopping records for users are worth considering. In this paper, we will focus on the cross-silo setting with data partitioned by features (i.e. vertical-partitioned data), and under which we attempt to explore the learning of principal component analysis (PCA) as PCA is a fundamental and very useful machine learning model. The basic idea of PCA is to reduce the dimensionality of a data set, in which there are a large number of interrelated variables while retaining as much as possible of the variation presented in the data set. This reduction is achieved by transforming the original variables to a new set of variables, i.e. the principal components, which are uncorrelated and ordered so that the first few retain most of the data variations [16].
This paper considers that all parties can use the model parameters of other parties to improve the model performance without revealing their raw dataset. Focusing on the vertical-partitioned data, we will propose an approach namely, Federated-PCA, which can significantly improve the models of all parties and the final results of all labels in the image dataset. Our main contributions are: (1) Proposing a federated PCA learning method and perform feature-wise(i.e. labelwise) feature extraction and data compression based on vertically partitioned data, and then obtain the results of joint PCA of all parties, (2) Building two model protocols: Central collaborative server and Fully decentralized (i.e. peerto-peer) model framework.
The rest of the paper is organized as follows. Section 2 makes an overview of related work. Section 3 introduces the framework of the proposed Federated-PCA machine learning. Section 4 shows the experimental results to demonstrate the effectiveness of the proposed approach. Finally, we give a concluding remark in Section 5.

Related Work
Most of the existing work is based on cross-device federated learning. In the cross-device setting, the data is assumed to be partitioned by examples. Different from cross-device federated learning, in the cross-silo setting, the data is assumed to be partitioned by samples and features. In particular, for the case of cross-silo FL with data partitioned by features, it may or may not involve a central server as a neutral party, and clients would exchange specific intermediate results rather than model parameters to assist in calculating the other parties' gradients [21]. In this setting, the application of techniques such as Secure Multiparty Computation or Homomorphic Encryption has been presented to limit the amount of information other participants can infer from observing the training process. The downside of this approach is that the training algorithm is typically dependent on the type of machine learning objective being pursued. Currently, the existing algorithms include trees [5], linear and logistic regression [23,10], and neural networks [14].
Privacy-preserving is one of the most important problems in federated learning. At present, the solutions proposed in privacy-preserving learning are mainly based on the following: Secure Multi-party Computation (SMC), Differential Privacy and Homomorphic Encryption. The privacy definition of federated learning can be classified into two categories: global privacy and local privacy [13]. Global privacy requires that the model updates generated at each round are private to all untrusted third parties other than the central server, while local privacy further requires that the updates are also private to the server. Privacy-preserving machine learning algorithms have been proposed for vertically partitioned data, including Cooperative Statistical Analysis [7], association rule mining [20], secure linear regression [20,18], classification [7] and gradient descent [21]. Recently, [10,17] proposed a vertical federated learning scheme to train a privacy-preserving logistic regression model. The authors studied the effect of entity resolution on the learning performance and applied Taylor approximation to the loss and gradient functions so that homomorphic encryption can be adopted for privacypreserving computations. The existing solutions in privacy-preserving learning are mainly based on Secure Multi-party Computation (SMC), and Differential Privacy and Homomorphic Encryption.
Current works that aim to improve the privacy of federated learning typically build upon previous classical cryptographic protocols such as SMC [4,9] and differential privacy [1,3,8,15]. The protocol of SMC is introduced to pro- tect individual model updates [4]. The central server is not able to see any local updates but can still observe the exact aggregated results at each round. SMC is a lossless method and can retain the original accuracy with a very high privacy guarantee. However, the resulting method incurs significant extra communication cost. Other works [8,15] apply differential privacy to federated learning and offer global differential privacy. These approaches have several hyperparameters, which should be carefully chosen because of their impact on communication and accuracy, although [19] has presented the adaptive gradient clipping strategies to help alleviate this issue. Besides, differential privacy can be combined with the model compression techniques to reduce communication and obtain privacy benefits simultaneously [1].
In contrast with the existing work, we focus more on the principal component extraction and privacy protection through Federated-PCA learning. Our approach is to share intermediate parameters only without directly sharing the original party's original data. The detailed privacy-preserving PCA framework will be presented in the next section.

The Proposed PCA on Privacy-Preserving
Vertical-Partitioned Data

Proposed Method:Federated-PCA
FL aims at achieving a common profit strategy for all parties by sharing parameters. To this end, one feasible idea is to average the parameters shared by all parties. According to this idea, we first try to directly average the eigenvector shared by all parties for processing. A plausible reason is that this method can improve the overall performance of the model. Alternatively, after considering the relationship between eigenvalue and eigenvector, another idea we propose is that the party with the largest eigenvalue weight among all participants should have more direct effect on the final federated results. Based on our idea stated above, we can achieve the FL along with two ways. The first way is to add a third-party trust coordination agent that can help all parties complete the model training work more efficiently. This third-party coordinator cannot access the data of each party, which is important for the privacy protection of all parties. On the other way, when the number of participating parties is small, as in the case of two-party or the third-party is not trusty, it is feasible to consider using a fully decentralized PCA learning method to build a co-build model provided that all parties involved are honest. Accordingly, let us consider setting up two Federated-PCA model protocols: a central collaborative server (i.e. Mode 1) and a fully decentralized (peer to peer) learning model (i.e. Mode 2) for these two ways, respectively. data of any party. After adjusting the new parameters, this federated-parameter is returned to all parties. Then, each party calculates locally based on this result. The framework of this model is shown in Figure 1. The step of this method is as follows: -Step 1: Local training: All parties involved perform PCA calculations locally, first obtain the covariance matrix S, and based on this S, find the corresponding eigenvector λ i and eigenvalue v i , i ∈ {1, · · · k} and select the largest k eigenvector and eigenvalue values; -Step 2: Model integration: The central server calculates the weight W i occupied by each party in the federated model by receiving the eigenvalue (λ i ) k , i ∈ {1, · · · k} of each party, and then combines the received eigenvector  The advantage of this model is that it is relatively efficient, and the disadvantage is that it relies on the help of third-party agencies.
Mode 2 : All parties only share principal component parameters with each other, and compute the results locally. The framework of this model is shown in Figure 2. The step of this method is as follows: -Step 1: Local training: All parties involved perform PCA calculations locally, first obtain the covariance matrix S, and based on this S, find the corresponding eigenvector λ i and eigenvalue v i , i ∈ {1, · · · k} and select the largest k eigenvector and eigenvalue values;  This model circumvents the need for third-party agencies to help participants further reduce some external risks, especially for two-party scenarios. However, the principal components of all parties need to be calculated independently, and the computing efficiency will be greatly affected.

Algorithm
We propose a standardized representation of Federated-PCA learning. Suppose we have a data matrix S = [f 1 |f 2 | · · · |f n ], which contains n data parties, each party has p variables (i.e. features, labels) and q samples (i.e. users), f i ∈ X p×n , where we assume that the number of samples q is the same and features X p for each party are different. Each party has its own database X = {x 1 , x 2 , · · · , x p } ∈ R q×p , in particular, for image dataset, x i ∈ R m×m×q ,where m is the pixel size. The final extracted principal component result is , k is the number of largest principal component representation selected. S is the covariance matrix, S q,i = XX T , i = 1, 2 · · · N, j = 1, 2 · · · q, whereN is the number of parties. Let (λ i ) k = (λ 1k , λ 2k · · · λ qk ) be the selected eigenvalues by each party, we have: where W i is the weight relationship between the eigenvector of each party and the shared feature vector, and V is the shared projection feature vector.
To achieve Mode 1 and Mode 2 stated in Section 3.1, the corresponding algorithms are given in Algorithm 1 and 2, respectively.
Algorithm 1: Federated-PCA learning with central collaborative server 1 ⇒ Run on party i server 2 Input: Data {x 1 , x 2 , · · · , x p } ∈ R q×p belongs to party i 3 Output: Principal eigenvalues λ i and eigenvectors V i Each party send (λ q,i , v q,i ) k to the central collaborative server. 9 end 10 ⇒ Run on central collaborative server 11 Input: Quantity of parties p ∈ {1, 2 · · · N} 12 Output: Value of each W i , n ∈ {1, 2 · · · N} and shared feature vector V 13 Receive λ i , V i from n parties 14 for i = 1 to n do Broadcast shared feature vector V to all parties.  Algorithm 2: Fully decentralized Federated-PCA learning 1 ⇒ Run on party i server 2 Input: Data {x 1 , x 2 , · · · , x p } ∈ R q×p belongs to party i, quantity of parties n ∈ {1, 2 · · · N} 3 Output: Final PCA learning results locally Send (λ q,i , v q,i ) k to other parties. 9 for n = 1 to N do

Experiments
This section will empirically evaluate our proposed method.

Datasets
In our experiments, first, we used structured datasets from different domains. We split these datasets differently based on feature-wise. We used a total of 10 datasets for comparative experimental testing. The information on ten data sets are given in Table 1 and then we used image dataset Fashion-MNIST [22] on Table 2. We resplit this data set according to different labels, divided into training and test sets, and performed comparison experiments on the unsplit data set and the split data set.

Comparative Results
In our setting, each party will obtain a set of eigenvectors and eigenvalues, we choose the top k PCA that basically cover 90% of the explained variance ratios and the choice of the number of principal components is determined by the party with the largest number of principal components demand. In particular, after performing PCA in a label-wise way on the Fashion-MNIST dataset [22], the explained variance ratio almost covers 70% when k = 1.
We conducted a series of comparative experiments to perform split experiments on the data set in various forms. In total, we set up 4 comparative experimental groups: undivided-party, isolated PCA party, federated-PCA, federatedavg PCA. By comparing with the four situations on the different split-datasets,  Table 2. Labels and example images in Fashion-MNIST dataset. The range of bias rates between the final results of unsplit data and isolated PCA final results, federated-PCA final results are summarized in Table 3.
Besides, the final experimental improvement of the College dataset is not very obvious. Therefore, we reanalyzed the data set and checked the correlation coefficients between different features in all datasets. We found that the correlation coefficients of the first few features were too large (close to 1), so we screened these features, preprocessed this dataset, and found that the results were significantly improved after the experiment again. The comparison results are given in Figure 6.
According to the experimental results, the proposed approach has significantly helped improve the performance of the final results of almost all datasets. Compared with the higher bias rate of normal results, the bias rate range of Federated-PCA results is basically within 10% which proves the validity and value of this method. Similarly, the image dataset is also performed very well. Figure 4 clearly shows that the final image after Federated-PCA is almost the same as the final image after unsplit label image data. To further verify that the final label-wise image result extracted more features, we performed a logistic regression classification test. In contrast to the settings where each party uses the   standard PCA method independently, Table 4 shows the classification accuracy rate is greatly improved to approximately 100% after using the federated-PCA method, which will help each party further perform local training tasks.

Concluding Remarks
In this paper, based on vertical data, our proposed Federated PCA can improve the models of almost all parties. After comparative experiments, it is clear that the model results between the federated parties and the unsplit-party are close. Therefore, this method can help some data stakeholders to solve problems such as large amounts of data and shared privacy. In reality, there is a non-linear relationship between various data, and the standard PCA method is difficult to find a good representative direction, especially for the image data, most of them show a non-linear relationship. In order to further serve more complex and more dimensional image datasets, such as medical datasets, as mentioned in [11], obtaining better models by extracting features of different categories is very valuable for medical centers and hospitals. For example, a patient may go to one medical clinic for a pathology test and go to another for radiology picture archiving. Under this situation, the features of one sample are partitioned over two clinics regulated by HIPAA [2]. In the future, we will perform Federated-KPCA learning on different kinds of non-linear data. As the privacy protection of patient data is one of our first considerations, so our work will continue to share the model parameter between parties based on the privacy protection system. During the experiment, the effect is not obvious for some structured datasets, we consider that the correlation between different features should also be the main reason for the poor final result. Later experiments will add correlation analysis in the data set preprocessing stage. Besides, considering a large number of missing data values for a certain feature in the data set, we consider whether we can directly delete this column first according to the eigenvalue calculation method proposed in [6]. The original eigenvalues are calculated from the remaining sub-matrices. This method may help us estimate the eigenvalues of missing features based on the values of other existing features. The future is accompanied by many challenges, such as external adversaries are targeting at the central collaborative server and parties, the framework of the threat model is shown in Figure 7. Another future work is to improve the verification mechanism of each party, such as establishing a screening mechanism. As a result, we will make sure each party is a safe and effective participant. Besides, we can adjust the model parameters based on the results provided by each party to adjust the model parameters to further improve the performance of the model.