Marker Displacement Method Used in Vision-Based Tactile Sensors—From 2-D to 3-D: A Review

The vision-based tactile sensor has been proven to be a promising device for sensing tactile information. Among such sensors, the marker displacement method (MDM) is the most common method used in such sensors for representing and extracting contact information. It uses the position field and displacement field of a marker array to characterize the original tactile information and further achieves multimodal tactile perception through original information processing. This article is the first to classify MDM into three typical categories based on the dimensionality perspective: 2-D MDM, 2.5-D MDM, and 3-D MDM. A comparison study is presented with a focus on the principles, characteristics, applications, and distinctions of these three methods. The latest literature has also been researched as the arguments. Finally, a summary of these three categories is presented as a helpful reference.


I. INTRODUCTION
R OBOTICISTS are working to develop robots with the ability to perform in unstructured, complex, and changing environments [1], [2]. To achieve greater flexibility and robustness, a robot must be able to perceive, recognize, and understand the environment. Although the success of computer vision has led to significant improvements in robot perception on unstructured natural conditions [3], [4], the robot perception capability in direct contact with the environment still needs to be improved. Robotic vision could not provide sufficient perceptual information when the visual perception is impaired or the scale of interaction is too small. In such cases, we need to find another means of robot sensing as complementary.
Inspired by the physiology of human touch, robotic tactile perception has garnered much research attention [5]. Tac-tile information has been confirmed to complement visual information and plays a significant role in robotic perception and manipulation tasks [6], [7]. Over recent years, many different types of tactile sensors have been developed, and some of them have been successfully used in multisensory feedback systems for robots and are constantly expanding their application scenarios [8], [9], [10], [11]. Among them, vision-based tactile sensors (also called visuotactile sensors) have emerged as a promising solution for robotic contact perception [12], [13], [14], [15], [16]. They belong to the camera-based type of optical tactile sensors [17]. Unlike flexible tactile sensing arrays, such sensors use a soft elastomer as the contact medium and obtain tactile information from the contact surface through image acquisition devices and postprocessing algorithms. They have drawn interest due to their simple structure, easy preparation, and the ability to obtain rich tactile information through high-resolution visual images.
Vision-based tactile sensors are valued for their capability of sensing multimodal tactile information. By constructing the mechanical model of the soft elastomer and designing the corresponding postprocessing algorithm, it is viable to achieve the multiple tactile sensory, e.g., surface shape recognition [18], force-field measurement [19], contact region estimation [20], and slippage detection [21]. Noteworthily, in current studies, the reconstruction and recognition are commonly achieved by the mapping relationship from the deformation information  [17], [19], [22], [23], [24], [25], [36], [27], [28]. The soft elastomer with a maker pattern serves as the contact part with external objects. Markers in the pattern move when the elastomer deforms, and the movement of them can be captured by cameras through the designed optical system. Further processing of the displacement information allows for multimodal tactile reconstruction.
to each contact modal [12], [16]. Therefore, the deformation of the contact surface is typically regarded as the original tactile information that can be used to reconstruct other contact features.
Several approaches have been put forward to visualize the contact deformation, chief among these being the marker displacement method (MDM), as shown in Fig. 1. MDM uses the displacement of some markers, placed on the surface or inside of the soft elastomer, to reflect the deformation [12]. When the soft elastomer is deformed by external contact, the markers will move accordingly. The movement of these markers can be captured by cameras and processed by subsequent algorithms. In particular, the vision-based tactile sensors based on MDM have the potential to enable analytical modeling of all kinds of contact mechanic characteristics, provided that suitable algorithms can be designed. Currently, researchers have developed a number of vision-based tactile sensors, representative of which include FingerVision [22], GelSlim [19], GelSight [24], GelStereo [25], GelForce [27], and TacTip [17]. These sensors all adopt MDM mainly or partially as the approach of contact information representation and extraction.
The in-depth understanding of MDM will help researchers to use it in visual tactile sensors. However, it is noteworthy that the existing research does not strictly distinguish between the different types of such methods. Researchers generally agree that since the solution of contact information is a hyperstatic problem, it is always possible to uniquely determine different types of contact properties as long as enough low-dimensional original information is available. However, revisiting MDM from the level of dimensionality, it is not hard to find the mechanistic differences in existing approaches. The complete original tactile information is a 3-D field. For MDM, the real-3-D measurement can be achieved only when the 3-D information of each marker can be confirmed completely and reliably. However, since different application scenarios have different requirements for tactile perception, the corresponding MDM shows different properties and performance. Without differentiation, there might be a lack of systematic and comprehensive guidance in analyzing and optimizing the characteristics of MDM and selecting the most suitable method.
According to relative works, we divide the existing MDMs into three categories according to the dimensionality of the deformation information (the details are introduced in Sections II-IV).

A. 2-D MDM
A single camera can acquire 2-D tactile images as the raw material of information extraction. Since the contact in the tangent plane and normal direction both can affect the 2-D displacement of the markers in the camera's view, it is possible to extract tactile information of 3-D only from 2-D tactile images. This method requires extracting the 2-D positions of the marker array and using an information fusion approach to obtain 2-D or 3-D contact characteristics.

B. 2.5-D MDM
Since the style and layout of the markers can be changed, other indirect features different from coordinates can be used to represent depth information (e.g., the marker size and shape on the image plane). Thus, by extracting both the 2-D displacement field and such features from the tactile images, pseudo-3-D (2.5-D) deformation can be measured to achieve 2-D or 3-D tactile perception.

C. 3-D MDM
Binocular or multieye imaging can be convenient to achieve 3-D tactile image acquisition through stereo vision and therefore enables the reconstruction of fine 3-D contact information. It can be achieved by adding cameras or designing the optical system. This method can ensure the accurate 3-D coordinate value of each sampling point.
This article recategorizes the existing MDMs as the 2-D MDM, 2.5-D MDM, and 3-D MDM, and presents a comparison study on these three different types of MDM. Our discussion focuses on the MDM and therefore does not include other methods commonly used in visuotactile sensing systems (e.g., the photometric stereo-based reflective membrane method [29]). For sensors that use multiple sensing principles at the same time (e.g., GelSight [24] and GelSlim [30]), we only discuss the part of them that is related to MDM. The main contributions of this article include the following.
1) For the first time, we adopt a new analysis based on the dimensionality and classify the existing MDM into 2-D MDM, 2.5-D MDM, and 3-D MDM. Based on the latest literature, we review the principles, characteristics, and applications of the three methods in detail. The analysis in this article focuses on the underlying mechanisms and their differences of these approaches, and the characteristics that result from such distinctions. 2) We summarize in a comparative manner the basic features, advantages, and disadvantages of the three methods and the scenarios to which each is applicable. Compared with the previous reviews, we mainly focus on the differences in the essence of the method, rather than the specific sensor design. This work can provide a valuable reference for researchers who are interested in applying MDM in fields such as vision-based tactile sensors.

II. 2-D MARKER DISPLACEMENT METHOD
The main feature of the 2-D MDM is the single-camera measurement. It obtains tactile information directly from a single-camera image, as shown in Fig. 2(a). When the soft elastomer with a marker pattern is deformed by external loads, the movement of markers with the deformation can be photographed with a single camera (preferably an orthographic projection). Thus, the 2-D coordinate information of each marker can be recorded in the camera's image coordinate system.
Usually, the obtained 2-D coordinate information is considered a coupled whole for postprocessing. By means of single-camera measurement, it is impossible to directly obtain the complete three coordinate values of each sampling point. Therefore, compared with considering the specific value of each marker, research prefers to conduct overall operation on the tactile image in the form of fields and build the mapping relationship from 2-D displacement (or coordinate) field to other tactile features.
From the perspective of dimensionality, the single-camera measurement method relies only on the 2-D displacement information of the marker array in the camera image space, i.e., the 2-D deformation information. Therefore, we refer to this method as the 2-D MDM.

A. Principles of 2-D MDM
Taking the physical model shown in Fig. 2(b) as an example, we summarize the basic principles of 2-D MDM from related work to realize tactile retrographic sensing. Let the 3-D position of a marker point P k i in the sensor coordinate system be P k i (x k i , y k i , z k i ), and the 2-D position in the image coordinate system be p k i (u k i , v k i ), at the kth camera frame.
The external reference matrix of the camera is H e and the internal reference matrix is H i . According to the well-known camera model, the relationship between the image coordinates p k i (u k i , v k i ) and the spatial coordinates P k i (x k i , y k i , z k i ) can be expressed, in the form of homogeneous coordinates, as where s is a scaling factor. If the camera aberration is ignored, s is numerically equal to the vertical distance from the point P k i to the center of the camera, which can be expressed as Let the 2-D image coordinates of P k−1 at the kth camera frame. Therefore, the 2-D displacements ( u k i , v k i ) in the image coordinate system and the spatial displacements ( x k i , y k i , z k i ) in the sensor coordinate system satisfy Equation (1) denotes a relational equation between the 3-D spatial coordinates and the 2-D image coordinates. However, this relationship is not a one-to-one mapping. In other words, solving the position information with three unknown quantities is an indefinite solution problem. As long as independent equations equal to or redundant with the number of unknown variables can be obtained, i.e., the number of makers is at least 3n/2, it is still possible to achieve the inverse recovery of 3-D tactile information from n discrete points with 2-D displacement information. Therefore, 2N sets of 3-D coordinates Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
if a proper matrix H X→U is chosen, and the relationship between the 2-D displacement and the 3-D displacement satisfies  The 2N sets of 3-D coordinates obtained do not necessarily belong to the real marker points. We can refer to them as "virtual" marker points. It means that by using the 2-D coordinates of 3N real marker points, we can obtain the 3-D coordinates of 2N virtual marker points. Although these marker points are not real, they constitute a displacement field that can reflect the contact deformation, thus allowing the sensor to obtain pseudo-3-D contact information. H X→U and H X→ U are invertible matrices reflecting the relationship between 2-D and 3-D information, which are determined by the selected positions of virtual marker points, and the internal and external parameters of the camera. Moreover, when the ratio between the number of 2-D coordinates obtained and the number of 3-D coordinates required exceeds a critical value, the problem is transformed into a super stationary one. In this case, H X→U can be calculated using least squares [31].
The above discussion shows that it is theoretically possible to reconstruct 3-D contact characteristics by obtaining redundant 2-Dcoordinate information (this does not mean that the mapping relationship must be constructed through a matrix). Since such 3-D information obtained by 2-D MDM is indirectly confirmed, we can call it pseudo-3-D information. Besides, in many cases, the acquisition of 2-D contact properties alone is sufficient (e.g., slip and rotation measurement [32], [33]). For this class of problems, the dimensionality of the original tactile information can be reduced, and only 2-D position and displacement fields obtained directly are required to be extracted. The idea of reconstructing 2-D contact properties based on 2-D coordinate information is similar to the above 3-D approach.

B. Technologies and Implementation
Section II-A introduced the basis of acquiring 2-D or pseudo-3-D tactile properties using 2-D displacement information (or position information) in 2-D MDM. In addition, two technical problems need to be solved to implement this method in sensors.  two frequently used approaches to extract the deformation: obtain each marker's displacement (or coordinate) through recognition and tracking or directly use the 2-D tactile image as the input for end-to-end learning. a) Marker recognition and tracking: In a real sensor, the markers are a series of objects with a certain size [16]. The GelSight sensor used ink dots as markers [24], the GelForce sensor embedded markers of two colors (red and blue) [27], and the TacTip sensor used pin-shaped markers distributed on the inner wall [34]. Existing studies regard the geometric center of these markers as the targets of position and displacement measurement. Therefore, the commonly used method is to extract the area covered by the marker's image using algorithms such as blob detection and to calculate the 2-D position of blo's gravity center [18], [24], as shown in Fig. 3(a). Liuet al. [35] proposed a learning-based marker localization network called Marknet, which improved the precision compared with the traditional detection method.
Besides, the markers identified are unorganized point sets, and the displacement of these points needs to be obtained by tracking the same marker points in consecutive frame images. Currently, three types of tracking methods are mainly used.
Type 1: An easy-to-implement approach is to search near a marker point of the previous frame to find the marker point with the closest distance to it in the current frame [18], [27], as shown in Fig. 3(b). The above approach is based on the fact that the marker position changes very little in the adjacent frame images. When the distance between two points is less than a threshold value, these two points can be considered identical, and a matching relationship can be established to achieve marker tracking.
Type 2: Related studies have also achieved nonrigid and rigid matching for active tracking by an orderly organization of the markers. Ito et al. [20] used the regularity of marker array in spatial layout to assign an identification number to each marker. Therefore, each point could be tracked even if the markers moved quickly or the displacements of them were large. Choi and Tahara [36] used the virtual marker method to integrate 2-D point cloud information, thus estimating 3-D contact locations. However, since the arrangement of the marker array can also be affected by the deformation, such methods may also fail under heavy load. Different from them, Li et al. [26] proposed the continuous marker pattern (CMP) to build a physical association in the texture space between each measured marker point. Compared with discrete marker patterns, the CMP method can support rigid point-set registration, but it may also increase the difficulty of marker recognition.
Type 3: The array of markers can be replaced with a higher density of scattered spots, which makes it suitable for effective methods such as optical flow and dense inverse searc [28], [37]. Du et al. [37] used a random color pattern to characterize the deformation and estimated depth information based on the dense optical flow method and Gaussian density feature extraction. Wang et al. [38] used particle image velocimetry (PIV) to process tactile images with semimarkers. The Viko sensor used random pixel markers and the dense optical flow method to obtain contact areas and shear forces [39], [40]. The use of random color patterns and corresponding algorithms can be applied to other scenes with perception requirements, such as the robot leg system [41] and the microlens array (MLA)-based electronic skin [42].
b) End-to-end prediction of tactile images: Since the marker points are in dynamic motion during contact, the marker recognition and tracking process is often affected by many factors, e.g., markers could be lost if they move out of the frame of the camera image or overlap with each other under great distortion; the interference of external light leakage and internal lighting reflection could affect the reliability of the tracking and recognition process, etc. This issue remains a common problem in the vision-based sensors community.
A feasible idea is to directly predict deformation information from tactile images in an end-to-end manner. The convolutional neural network (CNN) method is a typical example [43]. It does not need explicit processing steps to detect and track markers but directly uses tactile images with feature information as the training data. There are differences in the information present and feature selection, e.g., GelSight can have both marker and depth information on the same tactile image, while the tactile images of TacTip only have markers. By designing the appropriate network architectures, end-to-end prediction methods are able to obtain 2-D and 3-D deformation information directly and even other types of tactile properties.
Compared with marker recognition and tracking, the learning-based approaches are valued for their robustness. When the markers are lost or moved out of the image frame, using the standard marker recognition and tracking method may be challenging, while the end-to-end prediction method, such as CNN, can still give acceptable results. However, their major limitation is the requirement of numerous training data, which may increase the senso's manufacturing cycle and reduce the algorithm's generality.
Since the above two methods are usually combined in existing sensors to meet different requirements, we will not strictly distinguish them in the following introduction.
2) Build the Mappings Between 2-D Deformation and 3-D Contact Characteristics: In the discussion of Section II-A, the process of mapping from 2-D coordinate information to 3-D coordinate information is expressed in the form of matrices. In practical applications, the mapping results can be replaced with other types of contact characteristics (e.g., distributed force), and the mapping relationship H X→U does not need to be linear or explicit. A tactile sensor can have corresponding sensory functions by constructing a mapping relationship from a 2-D displacement field (or 2-D tactile image) to specific 3-D tactile information similar to (5) or (6). Such tasks are usually achieved through model-based analytical approaches (e.g., by constructing finite-element models of deformation [44], [45]) and are also suited to be combined with machine learning techniques [17], [24], [46].
GelSight and TacTip are the most concerned embodiments applied (or partially applied) to 2-D MDM. The GelSight sensors mainly use reflective surfaces and the photometric stereo method to contact geometry and can also print markers to measure the force and moment [24]. For GelSight versions with markers, multilayer neural networks were used to process the marker array for obtaining the relationship between 2-D displacements and 3-D forces [47], [48]. In this process, the deformation reflected by the markers was converted into force and slip information through 2-D MDM [49], which was then processed into the visuotactile learning models as part of the input. In addition, model-based approaches were also used to handle perception tasks under dynamic contact. Kolamuri et al. [33] proposed a rotation measurement algorithm to detect rotational patterns and displacement, which helped to promote grasp stability. Huang et al. [50] proposed a physics-inspired model to deal with the problem of liquid oscillation. Experiments showed that this model could estimate liquid properties under dynamic contact with high precision.
The TacTip sensors adopted biomimetic design by emulating the human fingertip's internal structure [34]. The soft elastomer of TacTip was embedded with nodular pins (mimicking the dermal papillae and intermediate ridge structure of the dermal-epidermal boundary), and the contact information perception could be realized through 2-D MDM processing of pin images. In the existing research, a series of probabilistic classifiers, training models, and control strategies were used to achieve 3-D real-time tactile interaction [17]. For example, Lepora [51] used the SVM classifier, the Gaussian process regression (GPR) model [52], and CNNs [53] to complete different tactile perception tasks. Such technology was further integrated into the shadow modular grasper, where tactile features were extracted from high-dimensional original images through multivariable linear models [54]. Besides, some new designs have been developed based on the standard TacTip. For example, the miniaturized TacTip was integrated into the Pisa/IIT SoftHand and could use the structural similarity index measure (SSIM) and the CNN to obtain contact feedback, therefore achieving the closed-loop control of robot hand [55]. TactTip-SoftH shows that vision-based tactile sensors can reach a size as the human fingertip. The T-MO robotic hand used the similar design of TacTip to develop underactuated fingers with tactile perception and used support vector machines to realize slip detection and other functions [56], [57].
In other relevant studies, the construction of such mappings is commonly realized through modeling and learning. The GelForce sensor calculated the planar displacements by matrix relations, thus compensating for the lost dimension [27]. Fang et al. [58] used back propagation (BP) neural network to obtain the displacements of marker arrays for calculating 3-D force vectors. Zhang et al. [59] developed the FingerVision sensor (different from the FingerVision sensor proposed by Yamaguchi and Atkeson [22]) and used convolutional long short-term memory (LSTM) networks to achieve slip detection. Besides, they used the Helmholtz-Hodge decomposition algorithm to solve the pattern of marker displacement, thus enabling the detection of contact force and slip [60]. Zhang et al. [61] proposed a shape detection and correction algorithm based on k-nearest neighbor (KNN) algorithm used in FingerVision, which can guarantee almost 100% recognition accuracy under given experimental conditions. Sferrazza and D'Andrea [62] and Sferrazza et al. [63] used markers randomly distributed in different depth layers to reconstruct the force distribution information through optical flow, finite element, and deep learning methods. The DelTact sensor proposed a 3-D contact reconstruction method using 2-D tactile images by convex optimization modeling of contact geometry and projection relationship [64]. In addition, technologies in other fields are also being migrated into the field of visuotactile sensing (e.g., neuromorphic vision-based sensing [65], [66]). We believe that such attempts will provide new ideas for constructing perception mapping and further expand the applicable scenarios of 2-D MDM.

C. Related Applications
Based on the above discussion, 2-D MDM can derive the deformation information from 2-D displacement fields of real markers for a variety of modalities, including contact spatial surface [18], force distribution [63], slip field measurement [32], rotation measurement [33], geometric features detection [53], and dynamic tactile sensing [50], among others. Using these obtained contact characteristics, visuotactile sensors can provide tactile information of different dimensions and focuses for tasks, including robot grasping, operation, and active measurement [15]. Since 2-D MDM widely adopts datadriven approaches, such processes can be implemented in two steps. First, tactile features and benchmarks are extracted from 2-D tactile images through learning or modeling methods. Then, these related results are used as input for control of tasks such as grasping and operation.
In the existing research, Song et al. designed a control strategy to change the sensors' perception characteristics to deal with different stages of the operation process. They applied this strategy to skewering deformable food using the GelSight and FingerVision sensors. Wang et al. [68] proposed a SwingBot robot integrated with GelSight. They extracted physical features from tactile information through end-to-end self-monitoring learning, thus completing accurate swing-up animation. Wilson et al. [69] designed a two-finger robot gripper. The gripper was equipped with multiple Gelsight sensors and could realize grasping and in-hand operation through the use of shell and tensile stress. She et al. [70] input the cable pose, contact area, and force estimation information obtained by GelSight into a pose controller and a grasping controller, thus realizing the task of following a dangling cable. Lepora and Lloyd [53] and Lloyd and Lepora [71] proposed a novel pose-based robot servo and pushing method using TacTip. They used CNN methods to predict the 3-D pose of the object's edge or surface relative to the tactile sensor. The key aspect of the control system was to make the expected pose insensitive to the shear deformation of the sensor. Besides, the tactile features acquired by sensors based on 2-D MDM were also directly used in tasks such as object characteristics measurement [72], [73].

D. Discussion
Due to the reduced requirements for optical systems and marker patterns, 2-D MDM is relatively simple, stable, and highly flexible. For applications where only 2-D tactile information needs to be utilized, 2-D MDM can be performed at a high speed and with low hardware expense. In tasks such as shape detection, the array of markers can be replaced with a higher density of scattered spots to satisfy effective methods such as dense inverse search [28], [37]. Moreover, by choosing the appropriate machine learning techniques (e.g., CNN), 2-D MDM methods can perform well under specific operating conditions. In such cases, the measurement effectiveness depends mainly on the selection of features, the quality of the raw data, and the design of the learning framework.
However, since the depth information of each marker cannot be directly obtained, the quality of the measurement results depends heavily on the algorithms and the models. The postprocessing process cannot fully compensate for the missing dimensionality of the original information, which could lead to significant errors between the measured and true values in some cases. In other words, it takes work to guarantee the accuracy of 2-D MDM in terms of details. From this point of view, 2-D MDM methods tend to be more suitable for measuring overall contact properties (e.g., concentrated forces and moments [67], [74]). In addition, different mapping models need to be trained to reconstruct the tactile information for each modality, such as morphology, slip, and contact force distribution. It could also cause high sensor overhead and low generality, and the interpretability and generalization performance need to be improved [15].

III. 2.5-D MARKER DISPLACEMENT METHOD
In 2-D MDM, markers reflect the deformation information of single-point sampling. The usual processing is to extract the geometric center (i.e., the marker point) of each marker's image in the camera planes and calculate the position and displacement of markers for subsequent use (as introduced in Section II-B).
The 2.5-D MDM is likewise a type of MDM that uses a monocular camera but adopts information supplements on this basis. This method uses some selected features to indirectly represent the depth information while acquiring the 2-D position information of the markers, as shown in Fig. 4(a). In other words, the 2-D displacement field is used to reflect the horizontal information of the markers in the sensor coordinate system, and the depth information obtained from the indirect measurement is used to reflect the vertical information. The camera converts these two parts of information into a tactile image at the same time during the shooting process. In this way, the information supplement method can indirectly realize the 3-D information measurement of discrete sampling points.
From the perspective of dimensionality, the information supplement method relies not only on the 2-D displacement information of the markers but also on the features that represent the depth information. Such feature quantities can indirectly characterize the z-direction position and displacement but have a certain degree of information loss and distortion. It means that this method can only reflect the third dimension of the deformation information to a certain extent. Therefore, we refer to this method as the 2.5-D MDM.

A. Selection of Indirect Features
The core of the information supplementation method is to select and implement depth information representation. Based on the characteristics of the visuotactile sensor, the selected feature quantity preferably has the following properties.
1) The feature can be associated with each marker. In order to obtain full 3-D field information, all markers should provide a specific feature bound to themselves. In other words, this feature quantity should be an endowed property of each marker, such as its shape, size, and color.
2) The relationship between the change in feature quantity and the longitudinal displacement is uniquely determined. Each marker's feature quantity can change with the elastic deformation of the material when the external load acts on the soft base component. Generally, a unique and stable analysis model is necessary for characterizing the depth information (e.g., the marker's image size changes approximately linearly with the indentation, and the marker eccentricity changes with the sine of the surface angle). More complex nonlinear models can be constructed using data-driven approaches (such as CNN).
3) The information characterization of this feature can be effectively implemented through the preparation process and algorithm design. Since the tactile perception task mainly occurs in small-scale contact scenarios, the markers are usually small and densely lined up. Therefore, the selection of features needs to consider the feasibility of process preparation and the difficulty of recognition. The existing research [75], [76], [77], [78], [79], [80], [81], [82] mainly uses the size change of each marker's image in the camera plane as the indirect feature (the details are introduced in Section III-C). When the sensor's soft elastomer is deformed under contact, the markers will move accordingly. Take the marker p as an example: for the loads acting in the horizontal direction (such as tangential force and torque), the main displacement direction of marker p is parallel to the camera plane, and thus, the marker's imaging dimension changes are not distinct. For the normal force, the marker p moves toward the camera and may increase in size due to elastic stretching (from d 1 to d 2 ), causing the marker's imaging dimension on the image plane larger (from x 1 to x 2 ). Another typical feature is the change in the marker shape. For circular marker points, when extrusion or shearing occurs, the image of a marker on the camera plane will tilt into an ellipse. Therefore, the change of shape eccentricity can provide information about skin angle, namely, the spatial gradient of the indentation field.
The above discussion shows that the size or shape variation of markers on the image is suitable features for reflecting the deformation information in the z-direction. Since there are few representative works using other indirect features in the field of visuotactile sensing, we will mainly introduce the method based on the geometric changes of marker image in the subsequent introduction (the other methods are shown in Section III-D).

B. Principles of 2.5-D MDM
Unlike 2-D MDM, 2.5-D MDM depends on indirect features related to the logo. Therefore, the type of marker determines the information available from the marker displacement. According to the attachment between the markers and the soft elastomer, we divide the existing markers into three categories.
1) The marker spheres that are embedded in the soft elastomer (e.g., GelForce [27]). Such markers are indirectly connected to the skin, and their deforms are usually ignored. We can denote them as rigid markers.
2) The marker dots are prepared by printing or etching (e.g., GelSight [24]). Such markers can move and deform with the squeezing and shearing of the skin, and we can call them deformable markers.
3) The markers that are attached by stiff pins to the sensor's soft skin (e.g., TacTip [34]) can be referred to as pinattached markers. For the rigid markers and the deformable markers, the geometric change of each marker's image is used in 2.5-D MDM, as shown in Fig. 4(b). An ideal camera with the focal length f is placed directly below the marker layer. The marker p is at a horizontal distance r and a normal distance h from the main optical axis of the lens. A load F N acting along the vertical direction on the contact surface causes a deformation of the soft elastomer and results in a small-scale displacement of p along the normal direction by h. We assume that the displacement of the marker p in the horizontal direction is zero. Let the effective dimensions of p be d 1 and d 2 before and after the deformation occurs, respectively. According to the pinhole imaging mode where x 1 and x 2 denote the imaging dimension of p in the camera plane before and after the deformation, respectively. We define the relative change rate of imaging dimension as α, and therefore, To derive the relationship between d 1 and d 2 , we consider the difference of marker attributes.
Type 1: The case of rigid markers is shown in Fig. 4(c). Let the diameter of the marker sphere p be d, and the manifestation dimension of p before and after the deformation can be calculated as where Assume that the marker size is small enough relative to the distance of the markers from the camera (i.e., d ≪ h and r ). d 1 and d 2 can be expressed as Since the contact deformation is a small amount (i.e., h ≪ h and r ), the relative change rate can be calculated by substituting (11) into (8) as Type 2: The case of deformable markers is shown in Fig. 4(d). Let the diameter of the marker dot p be d. Initially, the dimensions of p are the original value. As the load is applied, the elastic stretching causes an increment in the marker's actual size during the movement. We can use a uniaxial compression model to represent the deformation of the marker p [75]. The ratio of transverse strain to longitudinal strain is determined by Poisson's ratio under the assumption of online elasticity, and thus, where ν denotes Poisson's ratio and t denotes the effective thickness of the soft elastomer. Therefore, the relative change rate can be calculated by substituting (13) into (8) as Combining (12) and (14), the displacement of the marker p in the vertical direction and the rate change of dimensions approximately satisfy a linear relationship (A is a scaling factor) Equation (15) is the basis of the 2.5-D MDM based on the image size change. It states that the change amount in the distance between the marker and the camera is approximately proportional to the change amount in the area covered by the marker in the camera image space. Therefore, at each camera frame k, we can obtain the 3-D displacement of marker i by tracking its 2-D movement ( u k i , v k i ) in the camera image and the rate of change of its geometry α k i , according to the function where f expresses the camera's focal length. Thus, the 3-D position field can be calculated by adding the obtained position changes with the initial coordinates of each marker precalibrated.
Besides, for the pin-attached markers, the geometric changes of markers themself are ignored, but the pins' leverage effect can amplify the indentation into shear, as shown in Fig. 4(b).
Type 3: It is the case of pin-attached markers. Let the length of pins be l, and the tangential displacement of marker p under normal force F N be r. According to [17], when the surface deform is small, the horizontal coordinate of the pin's root is unchanged, and the tangential displacement of the marker p can be calculated as r = l sin θ = l sin tan −1 (dzdr ) (17) where dz/dr expresses the surface gradient of the deformed skin and θ expresses the gradient angle. Under the small-angle approximation, (17) can be written as Equation (18) means that the marke's tangential displacement (i.e., the shear strain of the skin) excited by normal extrusion is proportional to the surface gradient, and the amplification factor is the length of the pin. Therefore, the indentation information is amplified into shear information by the leverage of the pin. Although the normal and tangential strains are reflected in the same plane, they can still be identified since they produce different displacement patterns (normal: dipole or multipole fiel; tangent: uniform field) [17]. This method is closer to 2-D MDM in information processing, but the source of its indirect feature is the attribute of pins (not coordinate values). Thus, we also include this method in the scope of 2.5-D MDM.

C. Technologies and Implementation
The 2.5-D MDM and 2-D MDM have the same ideas for processing 2-D displacement fields, including three parts: marker recognition, marker tracking, and mapping relationship construction. Since Section II-B has discussed the above process in detail, we only focus on the acquisition of normal displacement by indirect features.
1) Obtain the Indirect Features of Markers: Conventional marker preparation methods include printing or filling [76], embedding [77], and 3-D printing or casting [78]. Since the images of the markers, prepared by the above methods, can be directly reflected in the camera plane, a simple way is to obtain the indirect features according to (15) without other processing. The FingerVision sensor approximated the normal force by the size change of the markers (one of the methods) [77], [79]. The transformation from the original deformation to the 3-D force was achieved by three conversion coefficients, which was similar to (16). The F-TOUCH sensor used the change of marker area across the camera image to indicate the z-direction displacements of markers [80]. The force evaluation based on this method has been revealed to be more accurate than that of GelSight.
Besides, the deformation of the directly formed image is usually small. In order to obtain a significant image size change rate and reduce the proportion of error in measurement, the existing research has attempted in the optical system and marker preparation. Guo et al. [81] adopted the depth from defocus (DFD) method to determine the distance between each marker and the camera. They used convex lens imaging to determine the spot size of markers directly. The ChromaTouch sensor used two layers of semitransparent markers to judge the compression deformation using the change of mixing color content [76], [82]. This approach could enhance the feature variation and support the postprocessing based on the light spectrum. The Fin-gerVision with whiskers measured the normal force using the whisker markers that could easily deform [78] (similar to the design of TacTip [17]), thus improving the measurement sensitivity.
2) Supplement Depth Information by Other Means: Since MDM was often used in combination with other types of visuotactile sensing approaches, features obtained by such methods could be used to supplement depth information. In the existing studies, a common method was to obtain the contact geometry by means of photometric stereo, so as to determine the z-coordinates of the markers. For example, Ito et al. [83] allowed the LED light to pass through the red-colored water (for scatteration and absorption) and then calculated the depression depth of the touchpad surface according to the relationship between the image pixel' color value, the illumination, and the distance. The obtained 3-D shape of touchpad was used to calculate the normal component of each marker's displacement, which was then used to judge whether a mark was a sticking dot or a slipping dot [20]. GelSlim 2.0 [84] and GelSlim 3.0 [19] used the gel deformation in the z-direction to indicate the normal displacements of markers. This enabled GelSlim to reconstruct the 3-D distributed force field using the inverse finite-element method (iFEM) based on the 3-D displacement field. Although the above methods do not conform to the properties mentioned in Section III-A, we also regard them as 2.5-D MDM under extended semantics.
Besides, an enlightening idea is to use the marker size and shape change to help predict the geometric information through learning-based approaches (e.g., the CNN method). In the tactile image, the variation in marker size can represent the indentation field, and the eccentricity of the marker can give information about the spatial gradient of the indentation field. Although different from the idea of converting indirect features into displacement, this method also belongs to the category of 2.5-D MDM.

D. Related Applications
Using 2.5-D MDM, the marker point' pseudo-3-D coordinate and displacement can be obtained from the image information to achieve multimodal perception directly. The existing works include curvature measurements [82], contact process tracking [77], force distribution measurements [84], and multiaxis force measurements [80], among others. For 2-D tactile information such as slip fields [79], [85], the reconstruction method of 2.5-D MDM is not different from 2-D MDM. As for 3-D tactile information, such as force distribution [84], 2.5-D MDM can use the pseudo-3-D deformation information to achieve the reconstruction by the iFEM. For example, suppose that the stiffness matrix H X→F of the soft elastomer can be obtained using calibration or learning technologies. In that case, the contact force distribution can be derived similar to (6) The application scenario of 2.5-D MDM is basically the same as that of 2-D MDM. For the first two types of markers (rigid and deformable), Yamaguchi and Atkeson [77] applied the FingerVision sensor to the Baxter robot as the end-effector, and the system can achieve stably cutting vegetables. Using the obtained contact force, torque, and FingerVision's unique proximity vision, different behaviors could be performed (e.g., adaptive capture, handover, and in-hand manipulation [79], [86]). The above work was mainly completed by analyzing tactile behaviors for grasping and manipulation tasks, and designing appropriate control strategies [22]. Based on Fin-gerVision, Belousov et al. [87] developed a controller library that contained rich control strategies and tactile skills and completed two challenging tasks: distinguishing objects with different characteristics and architectural assembly.
For the third type of marker, only the TacTip sensor currently uses such a design. Section III-B has introduced that the pin-attached markers can transfer the indentation in the normal direction to the horizontal direction. On this basis, Cramphorn et al. [88] proposed a Voronoi-based method to reconstruct key tactile features. They used the Voronoi tessellation principle to generate cells with each marker point as the centroid and form the area of the bound cells. The area change of each Voronoi element could reflect the normal deformation, while their centroid displacement of the centroid could reflect the tangential deformation. This method can reconstruct normal force, shear, and contact area without training a classifier or regressor and has been applied in the perception and visualize mid-air haptic [89].

E. Discussion
Compared with 2-D MDM, the indirect features introduced in 2.5-D MDM can complement the normal measurement results, thus making the measured 3-D deformation closer to the ground truth. For example, due to the ability to obtain the pseudo-3-D coordinate field of the marker array, 2.5-D MDM can obtain a relatively complete and rich contact morphology. Therefore, 2.5-D MDM has a high potential for object geometry identification, feature measurement, and contact area determination. Using the obtained 3-D displacement field, 2.5-D MDM can achieve dense normal force distribution reconstruction, which is harder to achieve with 2-D MDM.
In addition to reflecting the depth information of each sampling point, another advantage of indirect features based on imaging size changes is that it is relatively simple to implement. In general, the imaging size variation of markers always exists in the process of contact for the area of speckle pixels to supply depth information, without adding any other devices or changing the hardware design. This facilitates the procedures of 2.5-D MDM sensors inspired by 2-D MDM sensors.
However, the use of indirect features implies a complex preparation process and detection algorithm, which makes the sensor more difficult to fabricate and may introduce more error terms. In addition, the z-directional displacement of markers does not strictly satisfy a stable relationship with the change rate of dimensions.
1) For the rigid markers, (12) shows that the proportionality between h and α is related to the horizontal position r , which indicates that the linear relationship can only be approximated when the horizontal displacement of the marker is small compared to the normal distance.
2) For the deformable markers, (13) holds only under the assumptions of uniaxial compression and linear elasticity. In fact, the soft elastomers' shape and boundary conditions are more complicated, and it is usually impossible to construct analytical expressions.
3) For the pin-attached marker, although indentation features are converted into more sensitive shear features according to (18), the two features have merged and the skin shearing might distort the reconstructed indentation. Therefore, the decoupling and calculation difficulty of normal information estimation may increase.
The above discussion implies that the accurate and reliable detection of 3-D information in 2.5-D MDM may also call for the assistance of appropriate machine learning methods. Therefore, 2.5-D MDM prefers to be a pseudo-3-D measurement method under one implementation with low hardware cost but cannot be classified into 3-D MDM.

IV. 3-D MARKER DISPLACEMENT METHOD
According to the discussions in Sections II and III, depth information is important for the multimodal tactile perception of vision-based tactile sensors. Although artificial intelligence techniques have been widely used in the enhancement and applications of tactile sensors, there is no denying that the completeness of the information dimension will bring a higher data-driven effect.
Both 2-D MDM and 2.5-D MDM use only a single camera. It makes sense in vision-based tactile sensor designs since the sensors need to be compact for integration into robotic systems. However, we note that in recent years, some research efforts using the multicamera system have gradually emerged [23], [26], [90], [91], [92], [93], [94], [95], [96], [97] (the details are introduced in Section IV-B). These sensors employed two or more cameras to form the parallax based on stereo vision principles, thus directly measuring the 3-D coordinates and displacements of each marker, as shown in Fig. 5(a).
From the perspective of dimensionality, we collectively refer to such approaches as the 3-D MDM. More generally, we can also include the visuotactile sensing techniques using depth cameras [99], [100] or time-of-flight (ToF) cameras [101] into 3-D MDM. In this section, we detail the principles of 3-D MDM and the applications in vision-based tactile sensors and discuss the possible development prospects.

A. Principles of 3-D MDM
Stereo vision is not a novel concept. In nature, primates, including humans, possess a pair of eyes. They are at the front of the head and have a large area of overlapping visual fields [102]. We can refer to it as the common view area. In the common view area, parallax is formed when both eyes simultaneously see a feature position in the physical space. The presence of parallax allows humans to use binocular image signals to obtain depth information of feature points, thus creating a sense of stereo. In computer vision, stereo vision has also been widely developed. By simulating the principles of human eye vision and combining camera models, triangulation, and depth map methods, we can use two cameras to obtain the distance between the object and the camera, thus enabling, for example, the application of 3-D morphometry [103].
We present the measurement principle of 3-D MDM using binocular vision as an example. In general, binocular visionbased tasks employ stereo correction methods, which enable stereo matching based on epipolar geometry constraints [104]. However, for most vision-based tactile sensors based on MDM, the targets that need to be stereo recognized are the distinctive markers. We only need to identify these feature points without acquiring a full-image depth map. Therefore, we adopt a more flexible approach to obtain the 3-D coordinates of each marker, as shown in Fig. 5(b). Select the marker P, locating in the common view area of the camera L and camera R, as the detection target. Assume that cameras L and R satisfy a synchronous triggering condition, ignore the distortion, and the internal reference matrices are I L and I R . The positions of the camera centers in the world coordinate system are O L (x ol ,y ol , z ol ) and O R (x or ,y or , z or ). At the kth camera frame, the position of P in the world coordinate system is P k (x k p ,y k p , z k p ), and the 2-D coordinates of its image points in the two camera images are p k l (u k l , v k l ) and p k r (u k r , v k r ).
Therefore, the light vectors can be calculated as where sl k p and sr k p are unknown scaling factors and f l and f r are the camera focal lengths. The vectors can also be expressed as In (19) and (20), there are six equations but only five unknowns. It is a system of overdetermined equations, but such a result is acceptable. Since the two rays do not intersect at point P in the nonideal case, we can use the extra part of information to correct the error [26]. A possible approach is to pick the common normal P k l P k r of O L P k and O R P k and use the midpoint of P k l P k r as a measure of the marker P k [97]. As shown in Fig. 5(c), let the light vectors actually point to P k l and P k r . Then, P k can be calculated as where t l and t r are parameters determined by the orthogonality relation and satisfy [26]   The angle between vectors O L P k l and O R P k r is the viewing angle difference between the two cameras. According to the principle of binocular vision, the closer the viewing angle difference is to 90 • , the smaller the effect of measurement error on the common normal P k l P k r . Thus, the determined marker point position is more accurate. It is the reason for the preferred view angle difference of 90 • in the optical path design [23].
The above discussion shows that with the use of binocular cameras, the intersection of a pair of optical rays can be used to determine the certain marker P, thus enabling reliable 3-D coordinate measurements. When the number of cameras exceeds two, we can obtain multiple sets of optical rays pointing to the marker. The multicamera system can lead to richer data, thus providing the possibility of fitting more accurate results using methods such as energy optimization.

B. Technologies and Implementation
Different from 2-D MDM and 2.5-D MDM, 3-D MDMbased sensors need to meet the hardware and software requirements of stereo vision. In this section, we mainly discuss the existing research in the establishment of stereo vision system. 1) Build the Stereo Vision System: A common way to build a stereo vision system in the vision-based tactile sensor is to replace the monocular camera with a stereo camera. Zhang et al. [90] proposed a tactile sensor that can measure 3-D displacement fields using binocular vision. Kakani et al. [91] captured the translation and rotation of the markers with a stereo camera. The GelStereo sensor also used a stereo camera and the related stereo matching algorithm to calculate the contact depth information [25].
Another approach is to use multiple monocular cameras and let them shoot at different positions and angles of view. Muscularis [93], TacLINK [94], and IoTouch [95] used two cameras, which were arranged at the top and bottom of the robot link, to capture the 3-D displacement of the marker array on the inner wall. Such designs were also used in the ProTac sensor with proximity perception function [96]. The Tac3D sensor [23], [97] realized simultaneous measurement of two virtual cameras from different angles through a monocular camera, which was achieved by arranging mirrors to construct two optical paths.
Compared with the former, the latter method can expand the perception area of the tactile sensor and reduce the impact of occlusion on the measurement. Still, the camera trigger may not be synchronized if different physical cameras are used. A feasible solution will be introduced later in Section IV-C.
2) Establish the Stereo Vision Matching: In Sections II-B and III-C, we have introduced the three processes of 2-D MDM and 2.5-D MDM: marker recognition, marker tracking, and mapping construction. In 3-D MDM, due to the use of multiple cameras (usually two), it is also necessary to match the marker points identified in different camera images (i.e., image registration).
A worthwhile approach is to narrow the search range when matching feature points by epipolar geometry constraints [90], [91]. As shown in Fig. 3(c), let p l and p r be the imaging positions of the marker point P in the two image planes. Define the intersection of the baseline O l O r and the two image planes as the epipoles: e l and e r . The lines p l e l and p r e r are called the epipolar lines corresponding to p l and p r , respectively. When the camera pose is determined, the position of epipoles in the image plane is constant. Therefore, if the position of p l is determined, p r must be in the intersection line p r e r of the epipolar plane O l O r p l and the image plane of the right camera. The following relationship is satisfied: In (23), E is called the fundamental matrix and K l and K r denote the projection matrix of the two cameras.
For visuotactile sensing, some 3-D MDM-based sensors did not adopt the epipolar geometry constraints but use a simplified matching idea: first, perform the path tracing on marker points, and then, correlate the points obtained from the left and right camera measurements with each other according to the ordinal number [23], [94]. This requires that the algorithm can rely on the rule of markers' arrangement to achieve marker point sorting in the tracking phase [25].
In addition, the marker-set pattern used in 3-D MDM could be replaced with a dense speckle layer (similar to the dense scattered spots of 2-D MDM in Section II-B). In [92], the marker pattern in GelStereo was updated to a semitransparent color pattern. Li et al. [98] used the dense speckle layer to provide matching features. Such designs could provide more texture information and adapt new approaches such as selfsupervised disparity estimation. The stereo matching of these sensors was usually achieved by using the 3-D digital image correlation (3-D-DIC) algorithm.

C. Virtual Stereo Vision System
The approach presented in Section IV-A is the theoretical basis for 3-D MDM. It enables the successful application of stereo vision-based 3-D inspection methods in visionbased tactile sensors. Currently, the miniaturization trend of imaging equipment has significantly promoted the compactness of vision-based tactile sensors with a single-eye camera. However, in a visuotactile sensing system using stereo vision, the image-matching task requires synchronous shooting of cameras. Adding a consumer binocular camera or an automatic trigger module will increase the overall size. Thus, compared with 2-D MDM and 2.5-D MDM, 3-D MDM needs to solve the tradeoff of synchronous triggering and compactness additionally.
Virtual stereo vision is a promising solution that is easy to implement. This technique was first applied in the 1980 s to study the motion mechanism of bubbles [106] and gradually evolved to 3-D trajectory reconstruction focusing on measurement accuracy [107], [108]. Virtual stereo vision is capable of implementing stereo vision with a single camera. It uses a single physical camera and a mirror reflection system to mirror two or even more virtual cameras and, therefore, can achieve stereo vision measurement, as shown in Fig. 6(a).
The Tac3D sensor was the first to introduce virtual stereo vision to the field of visuotactile sensing. The optical structure of Tac3D 1.0 is shown in Fig. 6(b). By refracting the left and right light, the physical camera could be mirrored into two symmetric virtual cameras, and the image planes of  Fig. 6(c). Thus, each marker's stereo image pairs could be obtained simultaneously in the same physical camera view (equivalent to having two cameras shoot at different angles simultaneously). This approach reduced the number of cameras under the condition that 3-D measurements were achieved and did not require a synchronization controller to achieve simultaneous triggering, thus reducing sensor weight and cost. In addition, since the light path becomes more curved, longer light paths can be obtained in tight spaces. Tac3D 2.0 [26], [97] further optimized the optical path structure and reduced the sensor size, and achieved the compactness close to that of the vision-based tactile sensor using a single camera with the same size, as shown in Fig. 6(d).
In summary, the virtual stereo vision system is expected to contribute to the tradeoff between structural compactness and synchronization trigger in 3-D MDM. The compact design of the multiview virtual camera system could also be realized by systematically optimizing the optical path structure. However, virtual stereo vision causes a portion of the camera field of view to be lost. Related works should consider the ambivalence of effective field of view area and the size of optical system.

D. Related Applications
Three-dimensional MDM can provide relatively reliable 3-D coordinate and displacement fields. The information obtained by 3-D MDM is consistent with the ground truth, thus providing the possibility of high-quality multimodal perception. The existing works include friction coefficient measurement [23], 3-D geometry reconstruction [26], contact position estimation [91], force distribution measurement [94], and slip field measurement [109], among others. The reconstruction method of 3-D MDM for 3-D tactile information is similar to that of 2.5-D MDM, which is also achieved by constructing the mapping relationship between the 3-D displacement field (or coordinate field) and 3-D contact properties.
Although there are a few relevant research works on 3-D MDM, existing studies have proved that this method has great application potential. The TacLINK [94] and IoTouch [95] based on iFEM and CNN could be used for robot parts with different shapes, and therefore, it was possible to lay them in a large scale. This technology could be extended to dexterous manipulation, human-computer interaction, and other fields. The ProTac robotic link combined 3-D MDM based on the DNN model and proximity sensing technology, and the proposed design could be extended to other types of tactile sensing elements and further extended to robot arm applications [96]. GelStereo uses point cloud registration [25] and neural network [92], [105] to handle different application scenarios, including in-hand object localization and insertion, and adaptive capture. The above applications illustrate the potential of 3-D MDM in the application of robotic grasping and manipulation tasks.

E. Discussion
The major feature of 3-D MDM is that it can obtain the original information of full dimensions. This means that 3-D MDM can directly infer the indentation depth field, thus reducing the requirement in algorithm. Since the original 3-D information obtained has higher confidence, 3-D MDM can further improve the detail accuracy of contact geometry and force distribution field with the same mapping strategy or network architecture. In addition, the learning-based techniques commonly used in 2-D MDM and 2.5-D MDM can also be applied in 3-D MDM. By constructing models that map from 3-D displacement fields to other tactile properties, 3-D MDM has the potential to reconstruct richer mechanical contact properties.
In addition to achieving 3-D information acquisition, adopting multiple cameras can also expand the perception range of the sensor (i.e., effective measurement format) [16]. For example, the OmniTact sensor had multiple micro-cameras built to shoot the gel-based skin from different directions and angles to detect multidirectional deformations and achieve global perception [110]. Trueeb et al. [111] designed a visionbased robotic skin with four small embedded cameras arranged in an array to expand the sensing area without using additional reflective components. This means that the multicamera system can not only be applied to sensors based on 3-D MDM but also stimulate the innovative development of visuotactile sensing using 2-D MDM and 2.5-D MDM.
Since tactile sensors usually need to be integrated into intelligent systems, robots, and the Internet of Things (IoT), high compactness is always needed. When the number of cameras increases, the sensor structure becomes bulky. In addition, the requirement of higher imaging range and distance in multivision has increased the difficulty of sensor integration. Due to the investment in smartphone technology and the development of camera miniaturization, the available camera modules have reached the millimeter-level size, and vision-based tactile sensors of fingertip size are becoming available [55], [110]. For the research of using stereo vision in vision-based tactile sensors, we recommend customizing the lens (with short focal length and wide angle of view) and the small printed circuit board (PCB) to mount them. In addition, virtual stereo vision and other technologies can offer inspiration to reduce the number of synchronous triggers and improve the compactness.
Besides, the complex optical path structure could introduce other types of errors, such as the refraction error of the light between different materials. Several studies on optimizing optical system optimization in 3-D MDM have emerged recently: Tac3D 2.0 considered the elastomer refraction effect compensation in the sensor calibration [97]. The GelStereo sensor built the refraction model of a multimedia optical path [92]. Ma et al. [112] proposed the BVTS model for correcting refraction effects. We hope that future research will further systematically investigate the design and optimization of optical models and reduce the hardware performance requirements of the corresponding models. Such works will help improve the applicability of 3-D MDM.

V. CONCLUSION
This article presents a detailed study and a categorization of the MDM. MDM is a technique commonly used in the field of visuotactile sensing. By using a camera to photograph the marks prepared on the sensor contact elastomer, a tactile image containing the position change of the markers can be obtained, and the tactile information can be further obtained by postprocessing and analyzing the tactile image.
For MDM-based visuotactile sensors, the deformation of the elastomer is usually considered original tactile information and is characterized in the form of 2-D or 3-D position information (coordinates and displacement) of the marker array. Due to the ability to obtain the exact position of the anchor points on the contact surface, using MDM can directly track the deformation process of the contact surface by interpolation and therefore has an advantage over other tactile information representing methods for measuring dynamic tactile information (e.g., dynamic process and distributed force measurements).
In addition, the use of MDM for multimodal tactile perception has high potential since it uses controllable data and algorithms to construct mapping models based on the original tactile information. Researchers can use the latest machine learning techniques and physical models to extract various types of tactile information from the original displacement field (or coordinate field).
The complexity of tactile perception stems in large part from the richness of the information discipline. Different dimensions of tactile information, including 2-D and 3-D information, should be acquired when dealing with various problems. To help researchers choose the appropriate research approaches, we first categorize MDM into 2-D MDM, 2.5-D MDM, and 3-D MDM based on the dimensionality perspective (see Table I).
Two-dimensional MDM is one of the more commonly used MDM. It relies only on the monocular camera to acquire the marker array's 2-D displacement field (and coordinate field) and uses well-designed mechanical models or data-driven learning techniques to achieve information fusion, thus reconstructing the tactile information. This approach has relatively high flexibility and low hardware cost and is suitable for 2-D tactile perception and 3-D concentrate characterization. Since the original tactile information has only 2-D, it could be challenging to obtain 3-D contact distribution using 2-D MDM. However, this provides a broad exploration space for research work using more advanced end-to-end learning technologies.
The 2.5-Dimensional MDM supplements 2-D MDM with selected indirect features reflecting the location of the markers in the third dimension (i.e., depth information). This method has similar properties to 2-D MDM in measuring 2-D tactile information while also enabling the acquisition of 3-D contact distribution properties using a pseudo-3-D displacement field. Due to the lack of information dimension, the 3-D field quantity obtained directly by this method still has some errors compared with the ground truth. It is helpful to design learning networks that are sensitive to the indirect features to improve the quality of 3-D information measurement.
Three-dimensional MDM employs a multicamera system and can achieve tactile perception using the stereo vision method common in computer vision technology. The 3-D displacement (or position) field obtained by 3-D MDM has a high quality even without relying on learning technologies. Therefore, 3-D MDM can be considered as having a relatively high upper performance limit. However, the main obstacles constraining this approach are the oversized structure and the additional errors. Researchers are trying to optimize it with camera miniaturization, imaging system design, multimedia refraction model design, and so on to promote the application of 3-D MDM.
We believe that the MDM will be newly improved with the further development of vision-based tactile sensors. Future research work will use more advanced computer vision and image processing technology to improve the performance of MDM further. In addition, recent research has proved the potential of visuotactile sensing technology based on MDM in reflecting human touch behavior and interface phenomena. Li et al. [98] proposed a novel 3-D MDM to measure traction stress with high spatial and temporal resolution and studied the evolution of adhesion stress and the creep mechanism of snails. Based on TacTip, Pestell et al. [113] built SA-I and RA-I bionic tactile channels and used 2-D MDM to construct artificial tactile signals that closely resemble real tactile afferent activity recorded from monkeys on the same stimuli. With the introduction of the RA-II channel, this multimodal perception system could reach the inspiration of artificial texture perception from the natural touch [114]. We expect the MDM-based visuotactile sensing technology to get more attention in interdisciplinary applications across robotics, physics, and biology.