Learning Canonical Embeddings for Unsupervised Shape Correspondence With Locally Linear Transformations

We present a new approach to unsupervised shape correspondence learning between pairs of point clouds. We make the first attempt to adapt the classical locally linear embedding algorithm (LLE)—originally designed for nonlinear dimensionality reduction—for shape correspondence. The key idea is to find dense correspondences between shapes by first obtaining high-dimensional neighborhood-preserving embeddings of low-dimensional point clouds and subsequently aligning the source and target embeddings using locally linear transformations. We demonstrate that learning the embedding using a new LLE-inspired point cloud reconstruction objective results in accurate shape correspondences. More specifically, the approach comprises an end-to-end learnable framework of extracting high-dimensional neighborhood-preserving embeddings, estimating locally linear transformations in the embedding space, and reconstructing shapes via divergence measure-based alignment of probability density functions built over reconstructed and target shapes. Our approach enforces embeddings of shapes in correspondence to lie in the same universal/canonical embedding space, which eventually helps regularize the learning process and leads to a simple nearest neighbors approach between shape embeddings for finding reliable correspondences. Comprehensive experiments show that the new method makes noticeable improvements over state-of-the-art approaches on standard shape correspondence benchmark datasets covering both human and nonhuman shapes.


INTRODUCTION
T He shape correspondence learning problem is funda- mental to geometry processing and computer vision and has been used as a key component in many downstream applications such as deformation modeling [1], texture mapping [2], and medical imaging [3], to name a few.
Dense correspondences between a pair of shapes can be established by measuring the similarities of extracted feature descriptors.Traditional approaches have identified a set of geometric feature descriptors, including extrinsic and intrinsic descriptors [4], [5], [6], [7], [8], [9].However, these handcrafted descriptors often lead to inaccurate and time-consuming solutions.More recently, we have seen the emergence of data-driven approaches built upon modern machine learning techniques that learn the optimal features directly from massive shape pair datasets [10], [11], [12], [13], [14], [15].However, the major dissatisfaction here is a need for supervised learning, which relies on a sufficient number of labeled training pairs of high-quality ground truth correspondences, which are known to be scarce and difficult to obtain.By contrast, the unsupervised approaches [16], [17], [18], [19], [20] seek to remove the dependency on ground truth correspondence by employing autoencoder-inspired architectures, where they construct the deformation between a pair of shapes and leverage point reconstruction to learn suitable features for measuring the similarities between shapes.However, most of them suffer from the nontrivial optimization of the deformation and reconstruction, thus often requiring additional regularization or constraints, e.g., cycle consistency [21] and local smoothness [19], and usually achieving limited generalization performance.
In this paper, we present a new unsupervised learning framework for shape correspondence between pairs of point clouds.Inspired by the classical locally linear embedding (LLE) algorithm [22] initially used for nonlinear dimensionality reduction, we make the first attempt to adapt this concept for shape correspondence.LLE succeeds in exploiting the local Euclidean geometry of manifolds to approximately preserve the local Euclidean geometry within neighborhoods.In a similar manner, because a point cloud shape is a sampled version of a smooth manifold, it is desirable to learn shape embeddings capable of capturing the underlying structure of the manifold.Our method achieves this by learning high-dimensional neighborhood-preserving embeddings of low-dimensional shapes such that nearby and corresponding points fall close to each other in both the high-dimensional embedding space and in the lowdimensional input space.
Another point of departure is that in recent approaches like functional maps, non-rigid deformations between shapes are expected to become linear transformations once shapes are projected into a higher-dimensional embedding space.Essentially, point-to-point correspondences between Fig. 1.Given the source and target shapes (X and Y) as input: (1) We first project the low-dimensional point cloud coordinates into the highdimensional embedding space; (2) We align feature embeddings of X and Y by computing the optimal locally linear transformation that can best cross-reconstruct each feature embedding of X using its top-K nearest neighbors (red color) in the embedding space of Y; (3) We associate the embeddings of these neighbors to their original point cloud coordinates and further leverage the reconstruction weights of the optimal transformation to reconstruct a shape Ŷ sharing the same point indices to X (both in blue color); and (4) The embedding network can be optimized by minimizing the divergence between Ŷ and Y.
shapes are generalized as a linear map between the corresponding function spaces [23].It is worth emphasizing that our approach is fundamentally different from the functional map-based approaches [14], [23].These approaches interpret the basis as the embedding for each shape and represent the mapping between a pair of shapes as a change of basis matrix, by applying a global linear transformation to every shape point.By contrast, our approach treats maps between shapes as locally linear transformations between embeddings, where each shape point has its own linear transformation computed from local neighboring regions.The locally linear transformations succeed in identifying the underlying structure of the shape manifold by enforcing embeddings of shapes in correspondence to lie in the same universal/canonical embedding space, which eventually helps regularize the learning process and leads to a simple nearest neighbors approach for finding reliable correspondences.
We achieve our goals through the following steps, all driven by the idea of marrying LLE and the construction of high-dimensional nonlinear embeddings of point cloud shapes (Figure 1).Assume we have a nonlinear embedding structure taking a low-dimensional point cloud and returning a high-dimensional embedding, whose weights we seek to learn.Given two shapes whose correspondence we seek, in the first step, we attempt to cross-reconstruct each source point (in the embedding space) from its nearest neighbors in the target point cloud embedding.This step mirrors the first stage of LLE.Next, we take the obtained reconstruction weights and cross-reconstruct a point cloud shape (in the original low-dimensional space) from the same set of nearest neighbors in the target point cloud (again in the original low-dimensional space).We thereby obtain a reconstructed point cloud in one-to-one correspondence with the source point cloud but based on the nearest neighbors in the target point cloud.We then minimize a suitable divergence measure between the cross-reconstructed and target point clouds with respect to the unknown weights of the nonlinear embedding.Minimization of the crossreconstruction error minimizes the distance between the original and reconstructed point clouds, and minimization of the divergence measure brings the cross-reconstruction and target point clouds into register.In this way, we build an unsupervised shape correspondence engine capable of end-to-end learning of nonlinear universal embeddings of shape point clouds.

Contributions
In summary, our contributions are: • A new perspective on finding dense correspondences between shapes as locally linear transformations in a high-dimensional embedding space, as a superior way to regularize the embeddings of shapes in correspondence via forcing them to lie in the same canonical embedding space • An unsupervised shape correspondence learning framework for extracting nonlinear shape embeddings that preserve distances within local neighborhoods, estimating locally linear transformations in the embedding space, and reconstructing shapes via the alignment of probability density functions (PDFs) built over reconstructed and target shapes • A divergence measure for bringing the crossreconstructed and target shape PDFs into register, which shows improved performance over popular Chamfer distance (CD) and Earth mover's distance (EMD) measures • A significant improvement compared to existing state-of-the-art methods on standard benchmarks covering both human and nonhuman shapes.
Comprehensive experiments show that the new method makes substantial improvements while showing strong model generalization across datasets with efficient training and inference.More importantly, the proposed idea could be useful for matching problems in other modalities such as images and meshes and matching problems in crossmodality such as images to point clouds, and it is a promising approach for other tasks requiring the application of manifold learning concepts.

RELATED WORK
Shape Correspondence and Matching.Early efforts at representing correspondence (before the deep learning era) used the inexact weighted graph matching formulation with a permutation matrix and outlier for correspondence representation [24], [25], [26], which is further followed by simultaneous pose and correspondence estimation.Simpler (but not necessarily better) methods such as Iterative closest point (ICP) [27] and Chamfer matching (CM) [28] started seeing deployment, followed by the emergence of Earth mover's distance (EMD) [29] and transportation-based distance measures.Simultaneously, soft correspondences via softassign [30], [31], [32] for both linear assignment and quadratic assignment alternatively estimated the transformations and updated the explicit point-to-point correspondence.Coherent Point Drift (CPD) [33] is similar to Robust Point Matching (RPM) [32] and used Gaussian radial basis functions (GRBF) instead of thin-plate splines (TPS) for nonrigid deformations.RPM L 2 E [34], [35] leveraged the L 2 E estimator for estimating transformations bootstrapped from the shape context [36].Later, point cloud density estimation approaches [37], [38], [39], [40], [41] have appeared coupled with distances between density functions optimized w.r.t. the unknown spatial transformation without establishing the explicit point correspondence.Representative approaches include KC [38], GMMReg [39], [40], and CS [41], to name a few.
The functional map was introduced in the pioneering work of Ovsjanikov at al. [23] for solving non-rigid shape matching by avoiding the direct estimation of point-topoint correspondence and instead modeling linear transformations between the functional spaces of shapes, which is followed by subsequent extensions such as [17], [42], [43], [44], [45], [46].Recently, Diff-FMaps [14] interprets the eigendecomposition based on the Laplace Beltrami Operator (LBO) as higher-dimensional embeddings of shapes.More importantly, it demonstrated that learning a canonical embedding is a nontrivial problem, and splitting the correspondence learning into two parts (invariant embedding + linear transformation) is beneficial in regularizing the embedding learning in challenging settings.Meanwhile, we have also witnessed progress in the unsupervised functional map approaches [17], [21], [47] that consider structural penalties on the inferred maps, e.g., bijectivity or orthogonality.
The spatial approach is another direction in related work.3D-CODED [16] and Elementary [48] matched the deformable shapes by jointly encoding shapes and correspondences via deforming templates.CorrNet3D [18] exploited DGCNN [49] to project shapes into a high-dimensional feature space.They enforced unsupervised feature learning by constructing a symmetric deformer for point cloud reconstruction.Trappolini et al. [50] proposed a transformer-based framework to efficiently estimate the transformation between point cloud shapes.DPC [19] demonstrated a self-and cross-reconstruction framework to learn the latent affinity via a simplified point reconstruction, which is completely different from existing encoder-decoder frameworks [16], [18] regressing ordered point clouds to determine matching points.DPC normalized the similarity of each feature embedding's K nearest neighbors via softmax and cross-reconstructed a shape using the corresponding input points and similarity scores.Additional mapping loss and self-reconstruction have to be included to impose smooth constraints, otherwise, the performance of DPC drops significantly without them (see their ablation study on design choices).In this paper, we focus on learning LLEs capable of capturing the underlying structure of the shape manifold, with the main goal to find a proper design of the embedding via leveraging local neighborhood relations.Our method simultaneously learns the embedding and optimal locally linear transformation using the LLE-inspired firststage method and point reconstruction.Consequently, the cross-reconstruction of the source shape using target points gets an implicit regularization via the close relationship to the reconstruction of the source's high-dimensional embedding counterparts.We conjecture that this is exactly what is missing in DPC and hence explains our superior performance using just cross-reconstruction. Shape Descriptor and Feature Learning.The shape analysis community has actively investigated extracting descriptors and feature maps from shapes to capture geometric properties around the neighborhood of points of interest.A detailed discussion of classic hand-crafted descriptors can be found in [51], [52].
Early attempts focused on invariance under a global spatial transformation, e.g., a rigid motion, as shown in shape context [36], spin image [4], and multiscale local feature [53].The community has then extended to the nonrigid case by considering geodesic distances and conformal factors [54], [55].Later, the diffusion geometry [56] established invariant metrics based on eigenvalues and eigenvectors of the Laplace Beltrami operator obtained from shapes, showing significantly more robustness compared to the geodesic counterparts [57], [58].Follow-ups include global point signature (GPS) [7], heat kernel signature (HKS) [8], and wave kernel signature (WKS) [9].
The data-driven feature descriptors have shown their advantages over the hand-crafted features in robustness and efficiency.The bag-of-features (BoF) descriptor extracted the frequency histograms of geometric words from shapes [59].In [60], a robust and invariant point signature is learned from the contextual 3D neighborhood information of salient points for shape matching.Recently, the community has started to extract deep features from shapes in a data-driven fashion.For example, GCNN captured invariant shape features from triangular meshes [61].PointNet [62] showed that the learned features could be used to compute shape correspondences.FMNet [44] leveraged a Siamese residual network [63] for descriptor learning.SplineCNN [64] introduced a novel convolution operator based on B-splines to filter the geometric input efficiently.DeepGFM [65] used KPConv [66], a classic point cloud convolutional filter, to extract robust shape features.
Our approach belongs to the data-driven approaches.To the best of our knowledge, our approach is the first attempt to adapt the classic LLE algorithm for unsupervised shape correspondence learning into an end-to-end learnable framework.The correspondence is determined via nearest neighbor searching once we learn such discriminative feature representations.

LOCALLY LINEAR EMBEDDING
Before describing the proposed approach, we provide a background on the classic LLE framework [22].Given input point features, LLE has three steps: first, it identifies K nearest neighbors for each point; second, it uses least-squares to compute the weights for reconstructing each point from its nearest neighbors; and third, it reuses the same set of weights and computes the embedding of each point in a low-dimensional space.Specifically, let X = {x i ∈ R D |i = 1, . . ., N } denote the input point set.LLE builds a K nearest neighbors (KNN) graph over X by measuring the pairwise Euclidean distance and removing self-loops from the graph.Denoting x il ∈ R D as the l-th nearest neighbor of point x i , we obtain the LLE reconstruction weights by solving where R N ×K W := [w 1 , . . ., w N ] T denotes the reconstruction weights, and R K w i := [w i1 , . . ., w iK ] T denotes the weights associated to the KNN neighbors {x il } K l=1 for reconstructing point x i .The sum-to-one weight constraint K l=1 w il = 1 leads to the specific properties-each point x i is invariant to rotations, translations, and rescalings of itself and its nearest neighbors [22].The optimal W can be found by solving a constrained least squares problem as detailed in Appendix A. Given the optimal W , LLE further finds the lower-dimensional embeddings of input points denoted as D, by solving a sparse eigenvalue problem [22].

LTENET
Having briefly summarized LLE, we introduce our novel approach to unsupervised shape correspondence learning called LTENet-Locally Linear Transformation based Embedding Networks.To the best of our knowledge, this is the first attempt at introducing an LLE-inspired algorithm that represents maps between pairs of shapes as locally linear transformations while simultaneously deploying an LLE shape reconstruction objective to optimize nonlinear embeddings towards the same universal/canonical space.In this section, we first define the shape correspondence problem.We then introduce the dovetailing of locally linear transformations with novel LLE point cloud reconstructions for learning both the optimal transformation and embedding.Finally, we describe the divergence measure between the cross-reconstructed and target point clouds.The pipeline of LTENet is summarized in Figure 2.

Problem Definition and Objectives
Let point clouds R N ×3 X := [x 1 , . . ., x N ] T and R N ×3 Y := [y 1 , . . ., y N ] T denote the source and target shapes, respectively, where x i , y j ∈ R 3 and N is the number of points.Our goal is to find a point-to-point correspondence or map defined as T X Y : X → Y such that every point x i in X has its corresponding point y j * := T X Y (x i ) in Y, where 1 ≤ i, j * ≤ N .Inspired by LLE [22], the proposed LTENet leads to suitable embeddings for shape correspondence that preserve the local configurations of nearest neighbors.From a high-level perspective, it shares a similar spirit with the LBO operator which relies on preserving distances between nearby points.While the LBO operator is usually constructed over point cloud coordinates, LTENet operates on nonlinear feature embeddings obtained from deep neural networks, which are more robust and efficient to compute.It is worth mentioning that the proposed approach directly takes raw point cloud coordinates as the input without containing any point connectivity information.

Optimal Locally Linear Transformations
Given X and Y, we extract their nonlinear feature embeddings F X , F Y ∈ R N ×D , respectively, via a neural network F. This allows us to transition from the point cloud coordinates (R N ×3 ) to a higher dimensional embedding space (R N ×D ) where finding shape correspondence is more likely to be successful as demonstrated by the functional map paradigm [23].
To align F X and F Y , we must find a transformation between them while also jointly optimizing F to obtain suitable embeddings for estimating shape correspondence.Recall that Equation (1) in LLE is able to approximate the original input points using the LLE reconstruction weights and nearest neighbors.Denote F Ŷ ∈ R N ×D as the feature embeddings obtained by applying the transformation to F Y to achieve the alignment, i.e., F X ≈ F Ŷ .Given all f X i ∈ F X and f Y j ∈ F Y , we implement the transformation by first considering minimize where T denotes the reconstruction weights, and As observed from Equations ( 2) and (3), we make two unique modifications compared to Equation ( 1 Matching by Nearest-neighbors Y w i a P A t t k u 2 2 c e O 2 I 1 d s r q r M E E u 2 e P 7 J m 9 O A / O k / P q v H 2 3 z j i T m S 3 2 C 8 7 7 F 3 r V m j w = < / l a t e x i t >

Shape Ŷ
< l a t e x i t s h a 1 _ b a s e 6 4 = " X 0 G + i I 5 u c 6 E j S M E 5 u 3 r 0  (2) select top-K neighbors for each feature embedding f X i based on the cosine similarity between F X and F Y ; (3) estimate the locally linear transformations following Equations 2 and 3 to best reconstruct F X using F Y and denoted as F Ŷ ; (4) reconstruct a shape Ŷ following Equation 6; (5) learn the embedding network F via minimizing the divergence D CS (P ( Ŷ), P (Y)); and (6) determine the correspondence using nearest neighbors between embeddings.
picking nearest neighbors from itself, we conduct a crossreconstruction such that the feature embeddings F X of the source shape will select nearest neighbors from F Y in the target shape, which enforces embeddings of shapes in correspondence to lie in the same universal/canonical embedding space.
Intuitively, Equation ( 2) represents the spatial transformation between shapes, e.g., a non-rigid transformation, in the input point space as an equivalent locally linear transformation between F X and F Y .An optimal W X Y implies that F X has been properly reconstructed by F Ŷ using F Y .Following LLE [22], [67], the optimal W X Y can be found by solving a constrained least squares problem: where I is the identity matrix and 1 ∈ R K×1 is the matrix filling all elements with one.G X Y i denotes the Gram matrix defined as where η Y i ∈ R D×K stack all feature embeddings of the K neighbors of f X i found in F Y .Adding γI to Equation (4) leads to numerically stable solutions by avoiding the possible singularity of G X Y i [67], [68].This also links our work to Robust LLE with an 2 norm-based regularization (see Appendix A for details).

Learning Canonical Embeddings
The cross-reconstruction weights W X Y allow us to represent F X in terms of F Y .However, this does not necessarily imply W X Y will lead to a better embedding network F suitable for shape correspondence.To show that, we observe that the closed-form expression of W X Y only depends on the Gram matrix G X Y , where each G X Y i is constructed based on the feature difference between f X i and η Y i .Therefore, the optimal W X Y essentially relies on F X and F Y .W X Y is optimal in terms of the reconstruction of F X using F Y but is not optimal for shape correspondence if no additional optimization step is applied.
To properly train the embedding network, we associate the high-dimensional embeddings to the original lowdimensional point cloud coordinates.Specifically, observing Using the same coefficients, we could reconstruct the low-dimensional point ŷi ∈ R 3 for each f Ŷ i , which gives where y l ∈ Y is the point associated to f Y l .We then obtain R N ×3 Ŷ := [ŷ 1 , . . ., ŷn ] T , interpreted as the linearly reconstructed shape for F Ŷ in the basis elements (point coordinates) of Y.The indices of Ŷ are in exact one-toone correspondence with the indices of X .Also, ŷi can be understood as a soft correspondence of x i because the indices of {y l |l ∈ N Y (f X i )} are the same indices of the top-K nearest neighbors of f X i .Because these neighbors are selected from F Y with the highest similarity to f X i , the point y l associated to each neighbor could be a candidate matching point of x i .If each point x i finds its approximate matching point ŷi , we would expect Ŷ to be similar to Y as the training progresses (Figure 3).To this end, we can train F by solving where D(•, •) defines a dissimilarity measure.Because Ŷ and X share the same point indices while being different from those of Y, we do not have an obvious one-to-one correspondence between Ŷ and Y, e.g., ŷ1 probably does not correspond to y 1 , and so on.LLE vs. LTENet.LLE finds the embedded vector for each input point by solving an expensive eigenvalue problem (a projection from the high-dimensional input space to the low-dimensional embedding space).In contrast, our approach approximates source points via shape reconstruction using linear combinations of nearest neighbor target point coordinates Y and the (closed-form) weights W X Y obtained from first stage LLE.Unlike LLE, which is focused on identifying suitable low-dimensional embeddings for highdimensional input data, LTENet establishes dense correspondences between shapes by forcing a pair of shapes to lie on the same manifold.This is achieved by our fully differentiable LTENet framework, which pushes their embeddings towards a locally linearly invariant space via maximizing the similarity between a shape and its reconstructed counterpart.LTENet demonstrates a principled approach by adopting the central concept of classic LLE for shape correspondence.Next, we introduce a suitable distance measure D(•, •) for end-to-end training.

Implicit Correspondence Learning via the Alignment of PDFs
Most unsupervised approaches [19] adopt popular CD and EMD measures to reconstruct point clouds, which are sensitive to outliers or are computationally intensive.Point clouds are quite often nothing but discrete samples of the underlying continuous shapes and surfaces.Therefore, we instead represent point clouds as probability density functions and seek to minimize a divergence between reconstructed and original shapes.Formally, given the shape X = [x 1 , . . ., x N ] T , we represent an arbitrary point x by its kernel (Parzen) density estimation (KDE) of the PDF using an arbitrary kernel function K(•): where σ is the bandwidth parameter.We choose the Gaussian kernel as the kernel function due to its nice properties: it is symmetric, positive definite, and its value approaches zero when the point x moves away from the center y while being controlled by a decay factor determined by σ.
Inspired by [38], [39], [41], [69], we adopt the Cauchy-Schwarz (CS) divergence [70], denoted as D CS (q, p), to measure the similarity between two density functions, which is defined as which is symmetric for any two PDFs q and p such that 0 ≤ D CS < ∞ where the minimum is obtained iff q(x) = p(x).
We substitute Gaussian kernel PDF estimators for Y and Ŷ into q(x), p(x), and make straightforward manipulations based on the convolution theorem for Gaussian functions (see the detailed derivation in Appendix B), which gives Later, we will show that CS leads to better performance compared to CD and EMD objectives by handling outliers using the Gaussian kernels [71].Specifically, the Gaussian kernel G is able to mitigate the oversensitivity to outliers by suppressing large distances between reference and reconstructed shape points.In CD and EMD, these large distances due to outliers negatively impact model training, leading to degraded performance.It is worth mentioning that CS is closely related to graph cuts and Mercer kernel theory [71].Implementation of the CS loss.The CS divergence loss can be implemented in PyTorch with a few lines of code.To handle numerical issues, we leverage the Log-Sum-Exp trick as shown in Algorithm 1.

Algorithm 1
The CS divergence implemented in PyTorch.

The Training Objective
Similar to Equations ( 2) and ( 6) using W X Y to reconstruct F Y and Ŷ, we compute the reconstruction weights W YX to approximate the original input shape X , which results in the reconstructed shape R N ×3 X := [x 1 , . . ., xN ] T .
In addition, we approximate the original Y using W YY and Y, which is a self-construction process to obtain the approximate shape R N ×3 Ỹ := [ỹ 1 , . . ., ỹN ] T .The approximation of X is similarly expressed as R N ×3 X := [x 1 , . . ., xN ] T .The final training objective is defined as where λ cross , λ self , and λ reg are the hyperparameters to balance different losses and D(•, •) is the CS objective in Equation 10.E r (•, •) is the optional smoothness term defined as the mapping loss [19], which encourages points in Ŷ (or X ) to remain close if their one-to-one corresponding points in X (or Y) are close to each other.E r (X , Ŷ) is defined as where α is a hyperparameter configured by following [19].E r (Y, X ) is similarly defined.

Test Phase
In the test phase, we obtain the correspondence for each source point x i by selecting a point from the target shape whose embedding is the nearest neighbor of x i 's embedding based on the cosine similarity.This gives

Summary
We have presented the LTENet framework for unsupervised shape correspondence learning, which unifies nonlinear embeddings, LLE transformations in the embedding space, point cloud reconstruction, and implicit correspondence learning with the CS divergence.We consider the following analogy for LTENet: CS divergences and top LLE transformations are to shape correspondence as Kullback-Leibler (KL) divergences and top linear classifiers are to classification.By doing so, we are able to learn universal feature embeddings where correspondences are directly obtained using nearest neighbors.This is also analogous to the open-set classification problem where we handle samples of unseen classes by comparing feature distances between these samples and nearest neighbor trained examples of seen classes.

EXPERIMENTS
In this section, we compare LTENet against recent state-ofthe-art approaches on several well-established datasets for shape correspondence, and we conduct ablation studies.

Experimental Setup
Datasets.Following [18], [19], we conduct experiments on standard datasets covering both human and nonhuman shapes.For human shapes, we use the large-scale SURREAL dataset [16] prepared by 3D-CODED [16], which leverages SMPL [72] to generate a total of 230, 000 samples.We select arbitrary shapes as training pairs from SURREAL.We then evaluate on the challenging SHREC-19 [73] containing 430 non-rigid shape pairs generated from 44 real human scans.For non-human shapes, we adopt SMAL [74] and TOSCA [75] for training and evaluation, respectively.SMAL provides the 3D articulated parametric model for animals.We create a training set of 10, 000 shapes by generating 2, 000 samples under each animal category.We pair arbitrary shapes of the same category.Similarly, we consider 41 animal figures out of the total 80 objects in TOSCA to match species in SMAL and generate 286 test shape pairs from the SMAL dataset accordingly.Evaluation metrics.A common evaluation metric is the geodesic distance error assuming a known point adjacency matrix, which is unavailable in point clouds.Instead, we follow [19] to calculate the correspondence error as where T X Y (x i ), T gt X Y (x i ) denote the predicted and ground truth correspondence of point x i w.r.t.Y and • 2 is the 2 norm of a vector.Additionally, we use the error tolerance = r/dist max coupled with a tolerant radius r, where dist max = max{ y i − y j 2 , ∀i, ∀j} denotes the maximal distance of all pairwise point distances in Y.The correspondence accuracy under is defined as where 1 is the indicator function.We set different values between 0% to 20%.Implementation details.The proposed LTENet is not limited to a specific model architecture for the embedding network F. We followed [18], [19] to adopt a variant of DGCNN [49] as F, where its core component is the popular EdgeConv operator that builds a dynamic graph over points for learning the feature embeddings.We refer the reader to [49] for more details.Our models were implemented in Pytorch [76].We used the AdamW optimizer [77] with an initial learning rate of 0.0003, momentum 0.9, and weight decay of 0.0005.We used a cosine decay learning rate scheduler for 300 epochs and 10 epochs of linear warmup.We trained models with a batch size of 8 on a server equipped with AMD EPYC ROME microprocessors and NVIDIA A100 GPUs.Baseline methods.We consider recent state-of-the-art unsupervised shape correspondence learning approaches (DPC [19] and CorrNet3D [18]) as competitive baselines.We compare against supervised approaches, including Diff-FMaps [14], 3D-CODED [16], and Elementary Structures [48], and mesh-based approaches, including the unsupervised SURFMNet [17] and the supervised GeoFMNet [65].

Results on Human Datasets
We explored two training and evaluation settings.For a fair comparison, we followed DPC [19] to train our models by selecting the first 2000 shapes out of the total 230,000 samples in SURREAL and evaluated on the official 430 SHREC pairs (SURREAL/SHREC).We also trained models on random pairs generated from SHREC and evaluated on the same test pairs (SHREC/SHREC).Quantitative evaluation.Table 1 summarizes the acc at 1% error tolerance, indicating a near-perfect correspondence matching and average correspondence error err.On SHREC/SHREC, we achieved a competitive performance against DPC.On SURREAL/SHREC, we outperformed all baseline models.Specifically, SURFMNet [17] and GeoFM-Net [65] achieved impressive performance.However, they require the expensive computation of the LBO basis and complex test-time post-processing [27], [78].Our method achieved approximately 5× and 2.5× accuracies compared to SURFMNet and GeoFMNet, respectively, while showing a comparable run-time inference speed to DPC [19], which is about 100× faster against SURFMNet and GeoFMNet.Diff-FMaps [14] suffers from over-fitting on training samples without exploiting shape priors, e.g., local smoothness.Cor-rNet3D [18] shows improvements over 3D-CODED [16] and Elementary Structures [48] but requires nontrivial optimization in the Sinkhorn-inspired DeSmooth module and the decoder, which limits its generalization performance.DPC [19] is the current state-of-the-art method using learning the latent affinity via a simplified point reconstruction.Our LTENet achieved the best acc of 20.7% and the lowest err  Reference target aaaaaaaaaaaaDPC [19] aaaaaaaaaa LTENet (ours) a Ground-truth

Results on the Nonhuman Datasets
We trained models on the SMAL dataset and evaluated on the unseen TOSCA dataset that contains animal objects with diverse poses (SMAL/TOSCA).Quantitative evaluation.As shown in Table 1, LTENet achieved the best performance on SMAL/TOSCA in terms of the acc at 1% and err.The significant pose and shape differences between SMAL and TOSCA impact 3D-CODED and Elementary Structures relying on a single standard template, e.g., a standing cat.They struggle to handle shapes in different categories and various poses in the TOSCA test pairs.The proposed LTENet achieves an acc of 38.1% at 1% error, obtaining an increase of 4.3% in absolute accuracy compared to DPC's best acc of 33.8%.The detailed correspondence accuracy under different error tolerance values can be found in Figure 5 (left), which shows that our method obtains a substantial improvement over other methods.Qualitative evaluation.Figure 5 (right) provides some visual results on the TOSCA test pairs, which verify that our method could generate more accurate correspondence predictions.More results can be found in the appendix.

Model Robustness under Presence of Noise
We investigate the robustness of the learned embeddings by perturbing the test dataset with Gaussian noise in the setting of SMAL/TOSCA, which is particularly challenging due to the presence of noise that ruins the underlying shape structure.Specifically, we select DPC as the competitive baseline.For the test samples from the TOSCA dataset, we add Gaussian noise with zero means and different standard deviations, i.e., 0.001, 0.005, 0.01 to source shapes.
As can be seen in Figure 6, Our approach outperforms the state-of-the-art DPC in terms of correspondence accuracy and shows comparable performance in correspondence errors.Our method demonstrates moderate resilience against noise.Added noise heavily drops those correspondence accuracies under small error tolerance, e.g., less than 5%.In a similar manner, the additional experiment conducted in the setting of SURREAL/SHREC can be found in the appendix.

Comparisons between DPC and LTENet
Our LTENet highlights a novel approach to learning locally linear shape embeddings capable of capturing the underlying structure of the shape manifold, which we achieve by marrying LLE with the construction of high-dimensional neighborhood-preserving shape embeddings.We built our architecture following the self-and cross-reconstruction framework in DPC [19].Though both seek to learn good embeddings, our LTENet encourages the best locally linear alignment between shape embeddings without ambiguity via the closed-form expression of reconstruction weights.Table 2 demonstrates its benefit by summarizing our key results and additional results from the appendix of DPC.We clarify that E r (the smoothness term) is the same mapping loss as in DPC.Due to DPC's lack of a mechanism to enforce local linearity of embeddings, E r is required in DPC for better performance.Without this term, DPC suffers a significant drop from an acc of 17.7% to 11.4%.Our LTENet significantly outperforms DPC by enforcing a suitable manifold learning on shape correspondence under the same setting.

Combining Locally Linear Embeddings with Functional Map-inspired Globally Linear Transformations
An interesting exploration is to understand the performance gap between the learned embeddings from our LTENet and their optimal embeddings transformed from globally linear transformations using the ground-truth correspondence.Inspired by the functional map paradigm [14], [23] where  The performance gap between the learned embeddings from our approach and the transformed embeddings using optimal linear transformations in the SURREAL/SHREC setting.Adding the optimal linear transformation to learned embeddings significantly boosts the performance of DPC and our approach and leads to nearly perfect correspondence matching.(b) The performance gap evaluated similarly in the SMAL/TOSCA setting.Adding the optimal linear transformation is beneficial to our approach while being harmful to DPC, which suggests that a potential mismatch problem might have happened -the embedding obtained from DPC is suitable for shape correspondence, but it is not necessarily suitable for use as the basis in a functional map framework where shape embeddings are related by a linear transformation.(c) The qualitative example of correspondence predictions between a pair of real-world scans.Despite the significant differences in pose and the presence of challenging non-isometry, our approach shows its resilience by producing reliable correspondence results.
shape embeddings are related by a linear transformation, given the fixed point-to-point correspondence matrix Π X Y and a pair of shapes X and Y, we treat our learned embedding F X , F Y as the fixed bases and retrieve the optimal linear transformation A X Y = ((F X ) † Π X Y F Y ) T (see the detailed derivation in Appendix C).Under both settings of SURREAL/SHREC and SMAL/TOSCA, we evaluate (1) the matching using the learned embeddings F X and F Y , (2) the matching estimated by finding nearest neighbors between F X A T X Y and F Y , and (3) the matching by replacing our embeddings with DPC's embeddings and following the evaluation in (1) and ( 2).
The results are summarized in Figure 7. Interestingly, we observe consistent improvements after applying the additional linear transformation (Opt) to LTENet in both SURREAL/SHREC and SMAL/TOSCA.DPC achieves better performance in SURREAL/SHREC with optimal linear transformation.However, it suffers a significant performance drop in SMAL/TOSCA.This suggests that a potential mismatch problem might have happened -the embedding obtained from DPC is suitable for shape correspondence, but it is not necessarily suitable to be used as the basis in a functional map framework.Accordingly, it implies that shape embeddings of DPC are not associated with the linear transformation in its higher-dimensional embedding space.Though our approach does not suffer from such a mismatch problem, we admit that only a limited study was conducted on standard datasets covering human and nonhuman shapes.Future work is suggested to further investigate the mismatch problem by harmonizing universal embeddings and linear transformations, ideally in an unsupervised learning setting, which is promising given the great progress made in unsupervised functional map learning [17], [21], [47].

Evaluation on Real-world Data
Inspired by [18], [19], we examine our approach on model generalization and robustness by visualizing correspondence predictions between a pair of real-scanned shapes  We create the test point clouds by randomly sampling 1024 points from each original 3D shape model.Despite the fact that these two shapes have significant differences in pose and present challenging non-isometry, our approach shows its resilience by producing reliable correspondence results.

Ablation Study
We   Handling symmetry of shapes remains an under-explored research area.For shape correspondence though, there exist some attempts [80], [81], which we believe are promising.
The effect of embedding dimension.Given X and Y, we extract their nonlinear feature embeddings F X , F Y ∈ R N ×D , respectively, via a neural network F. The embedding dimension D should be adjusted to achieve a balance between overfitting and underfitting and efficiency.Table 5 demonstrates that the model of D = 512 gives a good generalization performance while being efficient, which we used thereafter for LTENet.

DISCUSSION
Our experimental results showed that LTENet achieves superior performance compared to state-of-the-art unsupervised shape correspondence methods.We attribute this to the learning mechanism capable of capturing the underlying structure of the manifold and fully exploiting the local Euclidean geometry of manifolds within local neighborhoods.
Our approach encourages the best locally linear alignment between shape embeddings without ambiguity via the closed-form expression of reconstruction weights.The local linearity used in our approach leads to implicit regularization and universal/canonical embeddings between a pair of shapes in correspondence.We demonstrated the performance gap of our learned embeddings, with and without additional functional map-inspired globally linear transformations, and showed consistent improvements made by the additional linear transformations.We observed a mismatch in that embeddings learned from an unsupervised shape correspondence method are not necessarily suitable to be used as basis embeddings in the classic functional map framework.

Limitations and Future Work
It has been demonstrated that shape correspondence approaches are struggling in disambiguating shape symmetries [19], [80], [81], [82].In our work, we also observed the symmetry issue -it leads to noisy training signals by wrongly matching components between shapes, e.g., associating the left hand in one human with the right hand in another human due to their opposite orientations.Future work is suggested to handle symmetry by exploiting priors on shapes to impose additional regularization or constraints on embedding learning.Many extensions of LLE, such as modified locally linear embedding (MLLE) [83], LLE with geodesic distances [84], and LLE with penalty functions [68], have been proposed to further improve LLE.It is promising to incorporate these advanced designs and adapt them to shape correspondence for better performance.The discovered mismatch problem suggests that further exploration of learning embeddings suitable for use as bases in the functional map is a promising direction to establish a unified shape correspondence framework, particularly in the unsupervised learning setting.
In this work, we focus on the matching problem between point cloud shapes.In the future, we propose to extend our method to matching problems in other modalities, e.g., images and meshes, and cross-modality matching problems, e.g., images to point clouds.

CONCLUSIONS
We have presented a novel approach to unsupervised shape correspondence learning between pairs of point clouds.LTENet is unique in that it introduces an LLE-inspired algorithm that represents maps between these shapes as locally linear transformations in the high-dimensional embedding spaces and leads to the learning of universal/canonical embeddings for shapes in correspondence.The embedding learning is driven by minimizing a suitable divergence measure between the LLE cross-reconstruction of source and target point clouds.Remarkably, LTENet achieves state-ofthe-art performance on standard benchmark datasets while showing strong model generalization across datasets with efficient training and inference.aaaaaaaaReference target aaaaaaaaaaaaDPC [19] aaaaaaaaaaaaa LTENet (ours) aaaaaaa Ground-truth  aaaaaaaaReference target aaaaaaaaaaaaaDPC [19] aaaaaaaaaaaaaaa LTENet (ours) aaa Ground-truth n b P y t W b a q l W y e M o w B E c w y m 4 c A 4 1 u I Y 6 N I D A A z z B C 7 x a j 9 a z 9 W a 9 z 1 q X r H z m A P 7 I + v g G w 5 O X D A = = < / l a t e x i t > F X < l a t e x i t s h a 1 _ b a s e 6 4 = " g v D 9 7 B G z 1 9 t b v w 1 j N P a 4 2 r R r l Z z + M o g g N w C C r A B G e g C S 5 B G 3 Q A B v f g E T y D F + 1 B e 9 J e t b d 5 a 0 H L Z / b B L 2 g f X 0 4 f n w 0 = < / l a t e x i t > D CS (P ( Ŷ),P (Y)) < l a t e x i t s h a 1 _ b a s e 6 4 = " p Z 4 Y K s W Q f Z f w K 8 k h H S m 3 C 6 F b g D g = " > A A A B + X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o v g q i S l V J c F N y 4 r 2 A e 0 M U y m k 3 b o Z B J m J o U S + i d u X C j i 1 j 9 x 5 9 8 4 a b P Q 1 g M D h 3 P u 5 Z 4 5 Q c K Z 0 o 7 z b Z W 2 t n d 2 9 8 r 7 l Y P D o + M T + / S s q + J U E t o h M Y 9 l P 8 C K c i Z o R z P N a T + R F E c B p 7 1 g e p f 7 v R m V i s X i U c 8 T 6 k V 4 L F j I C N Z G 8 m 0 7 f B p G W E 8 I 5 l l / 4 T P k 2 1 W n 5 i y B N o l b

Fig. 2 .
Fig.2.The pipeline overview of LTENet: (1) extract nonlinear shape embeddings F X and F Y , given X and Y; (2) select top-K neighbors for each feature embedding f X i based on the cosine similarity between F X and F Y ; (3) estimate the locally linear transformations following Equations 2 and 3 to best reconstruct F X using F Y and denoted as F Ŷ ; (4) reconstruct a shape Ŷ following Equation6; (5) learn the embedding network F via minimizing the divergence D CS (P ( Ŷ), P (Y)); and (6) determine the correspondence using nearest neighbors between embeddings.

Fig. 3 .
Fig. 3. Visualization of reconstructed point clouds of the proposed LTENet.We obtain the reconstructed shapes in cross-reconstruction ( X and Ŷ) and self-reconstruction ( X and Ỹ) from models at different training epochs.Starting from random initialization on the embedding network, it is clear that the reconstructed shapes are getting closer to the source and target shapes X and Y as the training progresses.

Fig. 4 .
Fig. 4. (Left) The correspondence accuracies under different error tolerance values in the SURREAL/SHREC setting.Our method achieves better performance compared to the state-of-the-art DPC.(Right) Visual examples of SHREC test pairs.DPC contains outlier matches, e.g., wrongly matching hands to thighs or feet to hands.LTENet generates more accurate and smoother predictions.

of 5 . 8 ,
which significantly exceeds the accuracy of DPC by 17.0% and reduces the error by 4.9%.The accuracies in Figure 4 (left) indicate that we achieved a clear improvement, especially for almost-perfect matching with < 5%.Qualitative evaluation.We provide visual examples in Figure 4 (right), showing the clear improvement made by LTENet (see more results in the appendix).

Fig. 5 .
Fig. 5. (Left) The correspondence accuracies under different error tolerance values in the SMAL/TOSCA setting.Our method substantially improves the correspondence accuracies under all tolerance values.(Right) Visual examples of TOSCA test pairs.DPC suffers from prediction errors caused by the difficulty of distinguishing between left and right or rear and front legs.Our method generates more accurate correspondence predictions closer to the ground truth correspondence maps.

Fig. 6 .
Fig.6.The evaluation of correspondence prediction of TOSCA test point clouds in the SMAL/TOSCA setting with additional noise.From (a) to (c), we gradually add stronger Gaussian noise with zero means and larger standard deviations, i.e., 0.001, 0.005, 0.01, to source shapes.Our method demonstrates its resilience against noise.

Fig. 7 .
Fig. 7. (a)The performance gap between the learned embeddings from our approach and the transformed embeddings using optimal linear transformations in the SURREAL/SHREC setting.Adding the optimal linear transformation to learned embeddings significantly boosts the performance of DPC and our approach and leads to nearly perfect correspondence matching.(b) The performance gap evaluated similarly in the SMAL/TOSCA setting.Adding the optimal linear transformation is beneficial to our approach while being harmful to DPC, which suggests that a potential mismatch problem might have happened -the embedding obtained from DPC is suitable for shape correspondence, but it is not necessarily suitable for use as the basis in a functional map framework where shape embeddings are related by a linear transformation.(c) The qualitative example of correspondence predictions between a pair of real-world scans.Despite the significant differences in pose and the presence of challenging non-isometry, our approach shows its resilience by producing reliable correspondence results.

Fig. 8 .
Fig. 8. (left) Model performance with different numbers of nearest neighbors; (right) Model performance with different kernel bandwidths.Choosing a suitable bandwidth or number of nearest neighbors leads to better performance.

Fig. 9 .
Fig.9.Visual examples of SHREC test pairs.The experiment follows the SURREAL/SHREC setting.The shape correspondence mappings are color-coded.In the first six rows, we compare our LTENet against the state-of-the-art DPC[19] and show the clear improvement made by our LTENet on generating more accurate correspondence predictions.The last row shows the typical failure example of LTENet and DPC[19].Handling symmetry and rotation of shapes remains a challenging problem and requires future investigation.

Fig. 10 .
Fig. 10.The evaluation of correspondence prediction of SHREC test point clouds in the SURREAL/SHREC setting with additional noise.From (a) to (c), we gradually add stronger Gaussian noises with zero means and larger standard deviations, i.e., 0.01, 0.05, 0.1, to source shapes.

Fig. 11 .
Fig. 11.Visual examples of TOSCA test pairs.The experiment follows the SMAL/TOSCA setting.The shape correspondence mappings are color-coded.The proposed LTENet generates more accurate correspondence predictions compared to DPC.

TABLE 1
Accuracy and error.The proposed LTENet achieves state-of-the-art shape correspondence performance, indicated by the correspondence accuracy at 1% tolerance (acc, in percentage) and the average correspondence error (err, in centimeters).

TABLE 3 Ablation study on model choices.
We conduct the ablation study under the SURREAL/SHREC setting.Methodλ self = 1 λ cross = 1 λ reg = 10 acc ↑ err ↓ from the Scan the World project collection [79] in Figure7(c).
conduct the ablation study to evaluate the comparative effectiveness of the different components in LTENet under controlled experiments.All ablated versions were trained and evaluated following the SURREAL/SHREC setting.
Choices of the kernel bandwidth σ.As our CS objective is closely relevant to a fixed-bandwidth KDE with Gaussian kernels, it is important to choose a suitable bandwidth fitting the underlying data distribution of the training dataset -either too large or too small bandwidth values

TABLE 4 Ablation study on the training sample size.
We conduct the ablation study under the SURREAL/SHREC setting.

TABLE 5 Ablation study on the effects of embedding dimension.
We conduct the study under the SURREAL/SHREC setting.Similarly, we can set different numbers of nearest neighbors for the locally linear transformations, i.e., K = 5, 10, 20, 49. Figure 8 (left) demonstrates that K = 10 leads to better performance.The impact of training sample size.We trained models by increasing the training sample sizes from 2000 to 230,000.Table 4 summarizes the experimental results of training models using different training samples.The experimental results show that 2000 samples are sufficient.A subtle difference is found in the slightly increased err when training models with 230, 000 training samples, which we suspect is due to more training samples that are symmetric and rotated being included in training, thus creating noisy training signals.