Evaluation of Communication and Human Response Latency for (Human) Teleoperation

We previously introduced a novel mixed reality teleguidance system dubbed human teleoperation (David Black et al., 2023 and Black and Salcudean, 2023), in which a human (expert) leader and a human (novice) follower are tightly coupled through mixed reality and haptics. Our first evaluation of human teleoperation is in the context of tele ultrasound, in which a sonographer or radiologist’s gestures are copied by a remote novice to carry out an ultrasound examination. In this paper, a communication system suitable for implementation of human teleoperation is presented and characterized in various network conditions, over Ethernet, Wi-Fi, 4G LTE, and 5G. To obtain a full understanding of latency in the system, the human response time is additionally characterized through a series of step response tests with 11 volunteers. The step responses were obtained by tracking the position of, and force exerted by, the human hand in response to a change in the mixed reality target. Different rendering methods were evaluated. The round-trip communication latency is 40 ± 10 ms over 5G, and down to 1 ± 0.6 ms over Ethernet for typical throughputs. The human response time to a step change in position depends on the step magnitude, but is between 485 to 535 ms, while the reaction time to a change in force is 150 to 200 ms. Both lag times are greatly decreased when tracking a smooth motion. Thus, we demonstrate that the system is network agnostic and can achieve good teleoperation performance and secure, low latency communication in appropriate network conditions. This brings the human teleoperation concept a step closer to human trials in a clinical environment, and the presented tools and concepts are applicable to any high-performance teleoperation system, and especially for mixed reality guidance.


Evaluation of Communication and Human Response
Latency for (Human) Teleoperation David G. Black , Dragan Andjelic , and Septimiu E. Salcudean , Life Fellow, IEEE Abstract-We previously introduced a novel mixed reality teleguidance system dubbed human teleoperation (David Black et al., 2023 and Black and Salcudean, 2023), in which a human (expert) leader and a human (novice) follower are tightly coupled through mixed reality and haptics.Our first evaluation of human teleoperation is in the context of tele ultrasound, in which a sonographer or radiologist's gestures are copied by a remote novice to carry out an ultrasound examination.In this paper, a communication system suitable for implementation of human teleoperation is presented and characterized in various network conditions, over Ethernet, Wi-Fi, 4G LTE, and 5G.To obtain a full understanding of latency in the system, the human response time is additionally characterized through a series of step response tests with 11 volunteers.The step responses were obtained by tracking the position of, and force exerted by, the human hand in response to a change in the mixed reality target.Different rendering methods were evaluated.The roundtrip communication latency is 40 ± 10 ms over 5G, and down to 1 ± 0.6 ms over Ethernet for typical throughputs.The human response time to a step change in position depends on the step magnitude, but is between 485 to 535 ms, while the reaction time to a change in force is 150 to 200 ms.Both lag times One particularly relevant procedure to which telehealth can be applied is ultrasound (US).This is useful not only for remote or under-resourced communities [5], [6], but for Focused Assessment with Sonography in Trauma (FAST) examinations of trauma patients on ambulances [7], for elderly patients in care homes for whom mobility is difficult [8], for COVID-19 patients [9], [10], and even for patients in hospitals when radiologists have to cover call in several hospitals at once.Remote training of sonographers is another popular application [11], [12].Point of Care Ultrasound (POCUS) is becoming increasingly more popular [13].Existing approaches to tele-ultrasound include robotic teleoperation as well as multimedia applications that combine verbal and graphical guidance on a smartphone or tablet application.
Robotic US systems can provide high precision, low latency, and haptic feedback [14], [15], [16], [17].One system has demonstrated clinical utility in trials [18], and much recent work has focused on autonomous robotic US [19], [20], [21], [22], [23], [24].Good reviews of robotic ultrasound systems is found in [21], [25], [26].Despite the large body of literature in this field, the issues of safe human-robot interaction and guaranteed robust autonomy remain difficult, especially from a regulatory perspective.Further limitations include restricted workspaces, time consuming set-up, large physical size that prevents use in ambulances, and cost, especially compared to inexpensive US systems.The questions of cost and complex setup and maintenance in particular make it difficult to deploy such systems in small communities where they are needed.
Conversely, systems sold by Clarius Mobile Health Corp., Butterfly Network, and Philips use a portable US probe with images and video conferencing available via a cloud interface on a mobile phone or tablet application.Though inexpensive and flexible, the desired probe pose and force are given verbally or with some overlays of arrows or pointers on the US image, which is very inefficient, leading to high latency and low precision.These systems are designed more for expert review of images captured by a capable sonographer rather than guidance of a novice.
Robotic teleoperation and video conference-based teleguidance fall on either end of a spectrum from performance to ease of use and deployment, leaving a large gap for solutions that are both flexible and easy to use and precise and efficient.In a previous paper [1], we introduced a novel concept of "Human Teleoperation" through mixed reality (MR) which bridges this gap.In this control framework, the human follower is controlled as a flexible, cognitive robot such that both the input and the actuation are carried out by people, but with near robot-like latency and precision.This allows teleguidance 2576-3202 c 2024 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
that is more precise, intuitive, and low latency than verbal guidance, yet more flexible, inexpensive, and accessible than robotic teleoperation.Augmented reality takes the real environment and augments it by adding visual information to it [27].This augmentation can take place within the real environment itself or on a video stream or similar.Mixed reality is a subset of augmented reality in which the visual cues are overlaid onto the real, physical environment using overlays in an optically transparent headset such as the Microsoft HoloLens 2, MagicLeap 2, Nreal Light, and the Meta Quest Pro [28].The ability to project 3D information seamlessly into the real environment is the primary enabling technology for human teleoperation, as this information can be used to guide a novice follower while he/she interacts with the environment, for example for an ultrasound exam.
Human trials were carried out in [29] to investigate this ability of humans to act as a robot in tracking an input MR signal, showing promising performance.However, these tests were performed with the expert and follower sides connected over WiFi, on a fast network.Introducing communication latency can have a strong negative effect on teleoperation performance.Kaber and Zhang found that performance decreases above 150 ms delay for haptic tasks in teleoperation [30], while Jay and Hubbold determined that delays of 69 ms in visual feedback and 187 ms in haptic feedback are disruptive to a user manipulating a haptic device [31].The same group later found that delays of 25ms affected specific haptic tasks for collaborative virtual environments [32], but that the effect of delays is highly task-dependent.Though users seem to be more sensitive to delays in visual rather than haptic feedback, when a haptic delay is perceived, performance drops much faster than if the delay is visual [30].
Thus, latency is key for almost all aspects of the teleoperation.The sonographer relies on visual feedback from the ultrasound images, the haptic feedback, and the video stream from the mixed reality headset to decide where to move and what force to apply.Delays in any of the data lead to a very unintuitive experience.Masuda et al. achieved telerobotic ultrasound latency of < 1 second [33], compared to previous experiments where they had 4-5 seconds of latency [34], which they described as "very stressful" [33].However, even delays of 1 second in force control can cause instability.Niemeyer and Slotine proposed the use of wave variables to maintain stability for time-delayed force reflecting teleoperation [35], which have since been improved for timevarying delays [36], using disturbance observers [37], time domain passivity control [38], μ-synthesis [39], and more.
Given the profound effects of time delays on performance, stability, and controller design, it is important to minimize and then measure and characterize these delays in any system.Since human teleoperation is a human-in-the-loop system, however, the delays associated with the human response time are also critical and should be evaluated.
Therefore, in this paper we present a communication system which uses a secure, high-speed, network-agnostic Web Real Time Communication (WebRTC) interface, described in Section II-A.Section II-D describes a number of tests that were performed on the communication system to characterize its performance in different network conditions, including latency tests over Ethernet, WiFi, 4G LTE, and 5G, all with various signal conditions.To our knowledge, WebRTC has not previously been used or tested for teleoperation systems, and no other remote ultrasound system in the literature presents detailed communication system design or thorough characterization thereof in different network conditions that it may practically be exposed to.Though presented in the context of human teleoperation, this is very generally applicable to any teleoperation system or collaborative MR application.The data channels shown here can be replaced by other ones, and the tests are independent of what type of data is being sent.
The user-related time delays in the human-in-the-loop system are evaluated through step response tests with 11 subjects.To perform these tests, this paper describes improvements to the prototype developed in [1], to take direct force input from the expert.A dummy ultrasound probe for the follower with 6 axis force/torque sensing and 6-DOF position and orientation (pose) tracking is also described (Section II-B).A novel visual control system for the forces was developed, as described in Section II-C.The tests are significant to show that the perceptual and cognitive delay in the human subjects is within the range observed in prior work (cited above) to allow successful human teleoperation.The measured values provide a starting point for control system design, and to optimize the human teleoperation system response.To the best of our knowledge, no other study has explored the human response when guided by an MR interface.Further tests of the visual force control and human tracking ability are presented in [29].

B. Human Teleoperation
The human teleoperation system (described in [1], [2], [40]) is being developed for hand-over-hand remote guidance of procedures such as US.It consists of the follower/patient side in a remote community and the expert side in a medical center, which communicate over the Internet.The follower, who need not have any US experience, wears an MR headset (Microsoft HoloLens 2) which projects a virtual US transducer into the follower's scene.The expert sonographer controls the virtual probe in real time using a haptic controller (Touch X, 3D Systems, Inc.) to input the desired pose and force.The follower tracks the virtual probe's motion with his/her real probe on the patient.The expert, in real time, receives the US images, a video stream of the patient with the virtual and real probes in position (called an MR capture), and is in verbal communication with the follower.Additionally, the follower sends a spatial mesh of the patient, generated by the HoloLens 2, to the expert.This provides the expert-to-follower coordinate transform.The mesh is also rendered haptically as a virtual fixture for the haptic device, giving the expert the sensation that they are physically interacting with the tissue.Alternatively, measured forces and/or US probe pose can be fed back directly in a bilateral teleoperation architecture.Some possible architectures are described in Section I-A above.These traditional methods differ from human teleoperation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.only in that they use a robotic arm rather than a human one.
The system is shown in Fig. 1.
The effectiveness of the approach was demonstrated first using a WebSocket server and Robot Operating System (ROS) on a local wireless network (WLAN), showing large improvement in accuracy and completion time compared to existing teleguidance methods [1].The following sections describe and characterize a new communication architecture with different throughputs and network conditions, and present tests of the human contribution to latency in this human-in-the-loop system.

A. Communication
The US images, video feed, and spatial meshes require a large bandwidth while haptic feedback and MR teleoperation necessitate very low latencies for stable, transparent, and intuitive teleoperation.Accounting of required throughputs is shown in Table I.The follower side is very biased towards uplink although available uplink bandwidth is usually smaller than downlink.Thus, bandwidth is particularly important in this system.
A WebRTC-based system [41], [42] is more suitable to meet these requirements and support tele-US at large distances.This framework provides a direct peer-to-peer connection between the expert and follower, thus removing server-related delays.
Data is sent either over data channels (general data) or media channels (video encoded streams), which are built upon Stream Control Transmission Protocol (SCTP) [43] and Realtime Transport Protocol (RTP) [44] respectively.RTP is built on top of UDP (User Datagram Protocol) but highly optimized for real time video communication while SCTP can act much like UDP with some improved features.As with UDP, dropped packets can be ignored for maximum performance.In this case there is no guarantee that packets sent in a specific order will arrive in the same order.However, a second configuration is available which guarantees chronological ordering while not retransmitting dropped packets.Finally, full acknowledgement and retransmission can be configured as well, leading to more TCP-like behaviour.These settings are tested in Section II-D.Generally, however, given the high-performance application, the higher speeds of UDP are preferable to the reliability of TCP.Dropped packets are quickly replaced with new information, and local consistency checks are in place.
A further benefit of WebRTC is that it uses several existing sub-protocols such as Session Description Protocol (SDP) and Interactive Connectivity Establishment (ICE) to establish an optimal connection between two peers over any network and through any router NAT (Network Address Translation) scheme or firewall.This is achieved by having both peers connect to a signaling server and exchange SDP information.Based on this information, they can automatically discover an efficient route through the different network hops between the peers.Once connectivity is established, all data is sent directly peer to peer, and the signaling server is no longer needed.In addition, WebRTC uses Datagram Transport Layer Security (DTLS) to ensure the connection is authenticated and encrypted.Thus, all transported information is secure, making it ideal for this medical application.
We have implemented WebRTC-based communication for the human teleoperation system, using separate data and media channels for each of the rows in Table I, in addition to two control channels that exchange occasional commands.The signaling server is implemented in Python and runs on a password-protected Web server hosted on Heroku, a cloud platform.All SDP data is securely encrypted before being sent to the server, and is decrypted by the other peer.The flexibility of WebRTC allows the system to work without any modification over Ethernet, Wi-Fi including enterprise networks in universities or hospitals, 4G LTE, or 5G.
In collaboration with Rogers Communications, we have set up an antenna which connects to mobile networks and to a Wi-Fi router, which in turn connects to the HoloLens via Wi-Fi or to a PC via Ethernet, thus allowing the HoloLens or PC to communicate over the mobile network.A diagram showing the setup is in Fig. 4. The University of British Columbia was the first campus equipped with a non-standalone (NSA) sub-6GHz 5G network in North America by Rogers, allowing the system to be tested over 4G and 5G.The 5G network in particular holds promise for achieving the required bandwidth and latency, and provides additional features such as multiaccess edge computing (MEC), allowing costly computations to be outsourced at very low latency to a server at the base station.Furthermore, 5G can utilize a mm-wave band, leading to vastly improved latencies and throughputs.Testing the benefits of both MEC and mm-wave will constitute future work.

B. Instrumented Test Probe
In order to complete the teleoperation system, force and pose feedback are required from the real ultrasound probe.The measured force is compared to the desired one in order to generate the visual force indicator for the follower to track.Similarly, the measured pose can be compared to the desired one to produce a feedback signal, or it can be used in conjunction with the measured force to estimate the mechanical tissue impedance to feed back to the expert's haptic device.In [29] and Section III-C, the measurements are used to characterize human performance in the system.
To implement pose sensing, several options were explored.An inertial measurement unit (IMU) can provide accelerometer and gyroscope readings which give a good orientation estimate but are subject to large drift and not feasible for position tracking.Optical tracking using an NDI Polaris or similar device is fast and accurate and was tested with our system.However, it loses tracking when the reflective markers are occluded, which happens often during an ultrasound exam.Initial work on a similar infrared-marker-based optical tracking system using the HoloLens IR sensor was carried out in [45].However, this suffers from some of the same occlusion problems, and it was found that the HoloLens 2 tracking was only accurate to about 3-4 mm and had a relatively low update rate which was not sufficient for this application.Sensor fusion with optical tracking and IMU data has also been explored [46], but adds complexity.We instead utilized an electromagnetic tracking system (NDI driveBAY) which does not rely on line-of-sight and is accurate to about 1.4 mm and 0.5 • .With a readout rate of up to 420 Hz and very small size, it is ideal for this application.
The electromagnetic sensor includes a small sensing element (Fig. 2 a), and a transmitter (Fig. 2 e), which also defines the sensing coordinate frame.ArUco markers [47] are included in known positions on the transmitter, allowing the HoloLens to accurately determine its pose in the HoloLens frame, thus providing the transform from measured force coordinates to desired force coordinates.
For force sensing, an ATI Nano25 6-axis force/torque sensor was used for its high precision (0.02-0.06 N), reliability, and small size.This can be installed between a 3D printed shell and the ultrasound probe as done in [14], [48], [49], [50], [51].For the tests presented here and in [29], it was instead installed at the tip of a 3D printed dummy ultrasound probe to ensure best possible accuracy.The instrumented dummy ultrasound probe is shown in Fig. 2.
Both sensors are connected to a PC, referred to as the sensor PC, which communicates the readings to the HoloLens via WebRTC, over the local WiFi.

C. Mixed Reality for Pose and Force Tracking
The primary premise of human teleoperation is efficient tracking of pose and force using MR overlays.The speed and accuracy for tracking step changes is presented in Section III-C.For these tests, a virtual ultrasound probe was projected into the follower's field of view, as shown in Fig. 3.The follower's goal is to align his/her probe as well as possible with the virtual one, thus matching the desired pose.In some lighting conditions, the virtual probe can occlude the real one, leading to increased position error.Thus, the effectiveness of a full probe rendering was compared to a scheme in which the central part of the virtual probe was removed in [29].Additionally, the full probe's opacity can be adjusted.For the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.step response tests in this study, a full probe rendering was used.
Forces are also an important part of US imaging, determining which structures are visible, and ensuring they are not too deformed.To achieve force tracking, several visual force rendering methods were developed and tested in [29].The two most promising schemes are shown in Fig. 3. First, the expert applies their desired force to the haptic device.Forces in US are typically between 0-20 N [49], but the haptic device is limited to 8 N. Thus, the forces are scaled down, which has the added benefit of decreasing the load on sonographers, who are known to suffer from increased incidence of musculoskeletal injury [52].The follower's measured forces are then compared to the desired ones to generate an error signal.The virtual US probe then either changes color continuously between blue, green, and red (Fig. 3 d) or an error-bar grows continuously towards or away from the patient and changes color (Fig. 3 c) to indicate too little force, good force, or too much force respectively.The error-bar approach in particular is shown to be very effective [29], and is used in the step response tests.
In most teleoperation tasks involving contact, forces and positions are controlled in orthogonal subspaces; i.e., forces are controlled normal to the surface being contacted, and positions are controlled in the two tangent directions [53].This applies very well to ultrasound procedures as well, so the step response tests were performed on a flat, rigid surface with forces normal to the surface and motions tangent to it.

D. Communication System Latency Tests
While the human-computer interaction and human tracking performance are characterized in [29], and the force feedback and practical use in a clinical environment will be evaluated carefully in future work, this paper focuses on latency, both in the communication system and in the human response time.We performed a number of tests to determine system performance over different networks and in various conditions.
To perform these tests, the human teleoperation system was modified to send synthetic data of a specific size, generated randomly and sent at a set rate.Thus, the throughput could be adjusted.A diagram of the test setup is shown in Fig. 4. The data was communicated constantly for 4 minutes during each test.A separate data channel with timing packets was set up to measure the latency.In fact, the round trip time (RTT) was measured instead of latency because clock drift between two devices can easily be of the same scale as the communication latency, making direct measurement of latency impractical.Instead, a specific procedure was devised to cancel out the clock drift, as follows.
Every 50 ms, a microsecond-resolution, 64-bit timestamp, t 1 , is measured on the follower side and sent to the expert side.Immediately upon receipt of the message, the expert side measures its own timestamp, t 2 , appends it to the packet, and prepares to send it back.Directly before sending, another timestamp, t 3 , is measured and appended.When the follower receives the response, it immediately measures a fourth timestamp, t 4 .The RTT can then be calculated as If we consider a clock drift of δt, and denote times in the follower clock with a prime (e.g., t ), then in the follower clock, t 2 = t 2 + δt and t 3 = t 3 + δt.The RTT from (1) then becomes Thus, the clock drift is effectively canceled out.For an approximate latency figure, one can take RTT/2.Since most networks are faster for downlink than uplink, however, this is not necessarily a good approximation, so we use RTT for the remainder of the paper.Different amounts of data were sent over different WebRTC channels to simulate the data in the teleoperation.In total, 9 different throughputs were tested, each in 7 different network conditions.These are outlined in Tables II and III respectively.During testing, SINR values varied randomly by a few points.The different conditions were achieved by testing in different locations.The expert side was stationary in one building, while the follower side was moved to a lab two buildings down for some of the tests.
No test was performed in poor 5G signal because the network automatically switched to 4G in this case.Indeed, in an NSA network, as was available for our tests, a given user equipment (UE) device connects to the nearest base station via a relatively static LTE primary carrier, but may dynamically send data over 4G or 5G, depending on the UE's current throughput and latency requirements.It is not known or controllable by the user which network the data is ultimately sent over.However, in our testing, data was sent at a high rate and throughput and we moved to a location where the 5G carrier had a far higher signal to interference plus noise ratio (SINR) than the LTE.Thus, with the markedly faster results, it can safely be assumed that the data was sent over 5G.In the 5G tests, the LTE SINR was around 2-3dB.To test 4G latency, the antenna was configured not to connect to 5G.Note, it was not possible to configure the antenna to connect only to 5G as no Stand Alone (SA) 5G network was available.The Ethernet and WiFi tests in Table III refer to the follower being connected directly to the Internet via Ethernet or WiFi.The expert PC is always connected via Ethernet.Fig. 4 shows the path taken by data between the expert and follower.Notice that when communicating over 4G or 5G with the HoloLens 2, there is first a hop over WiFi; i.e., the HoloLens 2 connects to the RF antenna via WiFi.To establish the delay associated with this hop, an equivalent C# program was written for the sensor PC, which was attached to the antenna via Ethernet.All mobile network tests in Table III were carried out over Ethernet, and then an additional set of tests was performed to determine the added latency from the WiFi hop.The latency over Ethernet was also measured, representing the best reasonably achievable performance.
When data is sent over a data channel, it is first added to the channel's send queue, which tends to fill if there is network congestion, leading to packet delays.We therefore experimented with splitting the US channel which had a large throughput into 2 smaller channels.This test was repeated at 2.17 Mbps and 4.17 Mbps in medium 5G conditions (Table III) to determine the effect.Furthermore, as mentioned in Section II-A, the underlying transmission protocol can be configured as reliable (TCPlike retransmission of dropped packets), ordered (packets guaranteed to arrive in order), or none.We performed tests at 46.1Kbps and 2.17 Mbps throughput in medium 5G conditions (Table III) using each mode to determine the effect on latency.

E. Human Response Tests
Finally, initial results in [1] showed that the system's latency was limited not by the communication latency, but rather by the reaction time of the follower.To quantify this carefully, a series of step response tests were carried out for force and position tracking of the follower.Using the experimental setup described in [29], n = 11 healthy volunteers aged 20-64 (mean age 32) were asked to track the virtual US probe as fast as possible with the instrumented probe from Section II-B.The step response consisted simply of a series of step changes in pose or force.The directions of the position jumps and interval between steps were randomized to avoid the subject learning and anticipating where or when the next step would occur.Each input signal was generated on the expert side and sent via the described communication system, over WiFi, to the follower, where it was rendered by the MR headset and tracked by the follower.For force, the error-bar rendering was used (Fig. 3), and the follower held the dummy probe against a rigid table.The desired forces were normal to the surface.All desired and measured positions and forces were logged on the HoloLens with timestamps to allow precise comparison.
All of the step response tests were performed at a constant amplitude of 10 cm and 11 N for position and force respectively.However, we expect the response to be slightly different at different amplitudes.This was tested in a few subjects by having them perform the step response tests at four different input amplitudes: 2.5, 5, 10, and 15 cm, and 3, 6, 12, and 18 N.The tests all occurred indoors in the same lighting conditions, and the user group was diverse to avoid bias (various ages, male and female, various professions and backgrounds).The Kolmogorov-Smirnov test was used to measure statistical significance.

A. Communication System Throughput and Latency
The results of the tests in good network conditions are found in Table IV, showing the difference between Ethernet, WiFi, 4G, and 5G.With Ethernet, it would even be possible to maintain a 1 kHz control loop for bilateral teleoperation with force feedback.The WiFi link adds about 4-6 ms delay on top of the Ethernet but is still very fast.The mobile networks are slower but still fast, with 5G being about 5-10ms faster than 4G.Both become significantly slower as throughput grows very large, but remain for the most part below the thresholds for human notice in haptic or visual feedback cited in Section I-A.Further tests in medium to poor network conditions are shown in Fig. 5.
In these tests the network demonstrates similar behaviour to good conditions for low throughputs.However, at about 1 Mbps the RTT makes a sudden, large jump of more than an order of magnitude, then stays relatively constant.In medium Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.5G, the jump occurs later, at about 2 Mbps.Fig. 6 compares the different networks and network conditions for the two distinct cases, before and after the jump in RTT.We see that at low throughputs, good LTE and medium NR have similar performance, but that the medium 5G has a long tail towards large RTTs as throughput increases, though the median remains relatively low.Good 5G has a significantly better performance at all throughputs than LTE (p < 0.001), and within LTE, variance decreases with improving signal condition.

B. Queuing and Reliability
The large latencies are in part due to long queueing delays in the send buffers as network congestion increases.A test was performed to determine if splitting the US channel into two would improve performance by adding a second send buffer.The results are shown in Table V.Interestingly, splitting leads to significantly worse performance, and this does not even take into account the overhead of synchronizing packets and recombining the image on the expert side.Finally, the effect of packet reliability was tested and is shown in Table VI and Fig. 7.At small throughputs there is relatively little difference, though reliable packets are still significantly slower than the other two.At large throughputs, however, there is a marked difference.As seen in Fig. 7, the reliable mode leads to massive delays which are unacceptable in this system.Conversely, ordered and unordered means remain very similar, though with ordered packets the median is lower and the variance is much larger, as seen by the relatively large number of outliers.Thus, as hypothesized in Section II-A, a UDP-like, communication without retransmission is best for this system where low latency and high data rate are key.

C. Human Step Response
In total, 418 step responses, plotted in Fig. 8, were measured and analyzed.We can define reaction time (RT) as the time delay between the initiation of the desired step and when the user starts moving, specifically when the second derivative is maximum.We define rise time to be the time taken between starting the motion and finishing it, i.e., when the second derivative is minimum.The reaction times, rise times, and steady-state errors are listed in Table VII.
The RTs in the down steps are faster on average than on the up steps (Table VII).In the tests, the probe was moved from a central position to a point 10 cm away in a random direction, then always back to the central position.Similarly, the force always returned to the same low value.Despite differences in the interval between steps, returning to a more familiar position or force decreased the reaction time significantly (p = 0.047 for position, p < 0.001 for force).This implies that part of the reaction time involves processing in which direction to move, or how hard to press.
On the steps up in particular, the fastest responses also have larger overshoot and more oscillation.Clearly, these users adopted more aggressive, higher-gain controllers.All such users were young (< 25).Indeed, there is a positive correlation between age and RT in the step responses, with correlation coefficient 0.5 (p < 0.001).As expected, older participants reacted more slowly.However, there was no similar correlation between age and rise time (correlation 0.06), so the limitation appears to be cognitive, not physical.
The force RTs were much faster than the position RTs (p < 0.001), likely because no motion was required, only a change in force.On the contrary, rise times were much slower for forces (p < 0.001), because the followers had to rely entirely on the visual feedback rather than having an intuitive feel for whether they had achieved the desired value, unlike for the position tracking.This is discussed more in Section IV.
The step responses were repeated in a few subjects at four different input amplitudes, to determine the effect.At   larger magnitudes, the rise time is of course larger, since the person has to move further (correlation 0.98, p = 0.018).However, the RT is also slower (correlation 0.93, p = 0.07).These are shown in Table VIII and agree well with Fitt's Law, which describes a relation between reaction time and motion amplitude [54].The same trend is not present in the force step response, where no physical motion was required.
IV. DISCUSSION AND CONCLUSION In the original system described in [1], latency tests on the ROS WebSocket implementation showed on average 11.4 ms delay for pose and force transmission over a local network.In contrast, the new system has delays of 1.07 ± 0.57 ms over Ethernet or 5.80 ± 3.30 ms over WiFi on average for the same local network.In addition, the latency of the video data over WebRTC is between 100-200 ms whereas previously with Windows Device Portal it was found to be ≥ 4 seconds, and with ROS it was infeasibly slow.Furthermore, the new system can run remotely over the Internet, is secure, and works on mobile networks as well.It achieves RTTs of 38 − 67 ms over 4G LTE and 27 − 70 ms over 5G, depending on data throughput.The presented system is thus a major improvement over the original prototype.
From the results we can conclude that the communication should be run without retransmission, and likely without ordering guarantees.High-volume data channels should not be split, although the US images require further consideration (see below).For optimal performance WiFi should be used when possible, or even over Ethernet via a USB-C adapter on the HoloLens 2. WiFi adds 4-6 ms latency over the 1 ms latency achieved by Ethernet.Both 5G and 4G offer high performance as well when needed, though more care is required.In worse network connectivity, some parts of the system, for example the video conference, may have to be turned off, and the video and US quality should be adjusted dynamically.This is already the case for the video stream, but it needs to be implemented for the US.Only in poor LTE connectivity with SINR < 4dB or so is the teleoperation with transmission of reasonable quality US images not feasible.In all cases, for good network conditions the teleoperation latency is strongly dominated by the human response time, which is between 150-550 ms.For poor network conditions and large throughputs, however, this relation can reverse, which leads to very unintuitive teleoperation.This condition should be avoided.The step response RTs match well with previously proposed values for human visual system RT.Badau et al. describe three different RTs -simple RT, recognition RT, and cognitive RT -which have significantly different values [55].Simple RT is for tasks where the subject sees an indicator and pushes a button, whereas in recognition RT the user has to recognize a specific object among a collection of shapes and locate and click on the object.This explains the difference between the force and pose RTs in Section III-C.Force is a simple RT: the user sees the error-bar change and pushes down, which happens very fast.On the other hand, pose involves a recognition RT and is thus slower: the user has to recognize which direction the probe moved in, and follow it.In this way, much of the processing of where to move occurs before initiating the motion for pose tracking, while for force tracking deciding how hard to press occurs during the motion.Hence, the rise time for pose is much faster than force.
It was also found that younger users were faster and in some cases adopted a more aggressive controller with overshoot.This precisely mirrors what is found in [56] and [57].
Finally, Carlton argues that the reaction time approach to studying processing delays is not appropriate when visual information constitutes feedback from continuous motion [54].This suggests that better performance can be expected during teleoperation when motions are relatively smooth and continuous, as opposed to the large steps shown here.Indeed, our results regarding tracking delays for continuous motions and force sequences in [29] are much faster than the discrete reaction times from the step responses (Table VII), and are more realistic representations of an ultrasound exam.Nonetheless, the step response tests presented here represent a worst-case response time, which is important to know.Furthermore, the response time and accuracy is dependent on the rendering method used to show the desired position and force.In [29] we tested four different rendering schemes and present here step response results using only the best two.However, better schemes likely exist, so the results presented here constitute a baseline.
Although the transmission protocol decided upon in the above tests achieves its performance by ignoring dropped packets, there is still retransmission at a lower level.The mobile network itself can run in acknowledge or nonacknowledge mode, in which dropped or corrupt packets are retransmitted or not, respectively.The Rogers network used in the tests runs using a default "best effort" quality of service (QoS), which includes acknowledge mode.As it is a public network, we were unable to change this or test its effect.Furthermore, again since it is a public network, the tests were subject to the amount of traffic currently loading the network from students, faculty, and staff on the university campus.For this reason, all tests were performed early in the morning when few students were present.However, configuring the network to treat packets from this critical medical application with a different QoS -i.e., with higher priority and without acknowledge mode -would further increase performance and reliability.Though performance was already sufficient for human teleoperation, and thus no special QoS configuration is required, the performance of haptic feedback could likely benefit.
A limitation of this study is that all the network tests were carried out on a single network, which is subject to certain configurations as explained above.Different networks in different locations will lead to slightly different performance.Similarly, the expert PC was connected to the Internet via an institutional enterprise network, which likely adds some latency.Further, the human study was limited.Though the volunteers represented a mix of sexes and ages, further tests should be performed on a larger, more diverse population of novices and a specialized population of sonographers.While the tests presented here aimed to ascertain human performance limitations, specific performance tests should be carried out for ultrasound, using realistic motion ranges from sonographers and radiologists for standardized exams [58].Additionally, the effect of increased communication latency on the teleoperation should be tested, including the follower's ability to track and the expert's ability to guide the task effectively.
Currently, the US images are streamed with jpeg compression from the Clarius C3HD3 device to the sensor PC using the ClariusCast API.From here, they are forwarded to the expert and follower via WebRTC.However, as seen in the results, the large amount of throughput required for this can seriously affect the communication latency.Sending individual jpeg images is highly inefficient, especially considering that the US image does not change much from frame to frame.Thus, future work will investigate sending difference images between frames, with video encoding such as H.264 or VP8 and variable quality depending on the connection.This will dramatically reduce the required throughput.
Future work will also involve performing human trials with patients in the community and expert sonographers at Vancouver General Hospital to establish the practicality of the system.We are also developing miniaturized force sensing transducers which can be integrated in a low profile shell on an US probe to provide force feedback without disrupting the ultrasound imaging [59].Using the measured forces we can study stable and transparent force reflection in bilateral teleoperation under time delays imposed by the human response time.Furthermore, the human-computer interface can be optimized, and reinforcement learning for autonomous US guidance can be explored.This constitutes an exciting avenue for autonomy since there is no possibility of dangerous or unpredictable robot actions as the AI would control only the virtual probe.
To our knowledge, WebRTC has not been used in the context of telerobotics or tightly coupled teleguidance, to which it is well suited.The architecture and tests presented in this paper are very general and can thus be a benchmark or reference for others building telerobotics applications.They also show what performance can be expected at certain bandwidths or signal conditions.The human tests can also inform the design of any AR/MR/VR system that involves human interaction.

Fig. 1 .
Fig. 1.System Architecture.k is a scaling factor for the force while T is the transform from expert to follower coordinates, obtained from the mesh.The force feedback (dotted lines) has not yet been implemented.More details are in Section I-B.

Fig. 2 .
Fig. 2. Instrumented dummy US probe (c) for tests, including pose sensing (a) and force sensing at the tip (d).The pose sensor is shown next to a thumb tack for scale.Both sensors connect to a PC (b), and the electromagnetic transmitter (e) has ArUco markers for registration.

Fig. 3 .
Fig. 3. Follower side showing the follower wearing a HoloLens 2 (a), the holographic user interface (b), and the different force rendering schemes: Error-bar (c), and Color (d).

Fig. 4 .
Fig. 4. Test setup for testing the communication system.POE = Power Over Ethernet.The WiFi router can either be connected to the mobile network or directly to the Internet via a wired connection.The HoloLens and sensor PC can similarly connect to the router via WiFi or Ethernet.

Fig. 6 .
Fig.6.RTT with large and small throughput in different network conditions.

TABLE I UPLINK
AND DOWNLINK THROUGHPUT ON THE FOLLOWER SIDE, WHICH IS MOBILE AND THUS MORE BANDWIDTH LIMITED.ALL THE DATA IS CONSTANTLY SENT, FOR AN APPROXIMATE THROUGHPUT OF 6.81 MBPS, EXCEPT FOR THE MESH DATA WHICH IS ONLY SENT RARELY ON DEMAND.IMPROVING THE US STREAMING WILL DECREASE THE REQUIRED BANDWIDTH VERY SUBSTANTIALLY.THE TIMING CHANNEL IS USED TO CALCULATE LATENCY AS DESCRIBED IN SECTION II-D

TABLE II LATENCY
TESTS WERE PERFORMED WITH THESE THROUGHPUTS.THE LAST 5 ROWS TEST POSSIBLE SIZES OF US STREAM, WHERE 6.81 MBPS CONSTITUTES SENDING THE US WITH JUST JPEG COMPRESSION TABLE III LATENCY TESTS WERE PERFORMED IN THESE NETWORK CONDITIONS TO SIMULATE CONDITIONS THAT WOULD BE ENCOUNTERED IN THE FIELD.SINR = SIGNAL TO INTERFERENCE PLUS NOISE RATIO, RSRP = REFERENCE SIGNAL RECEIVED POWER, RSSI = RECEIVED SIGNAL STRENGTH INDICATOR

TABLE IV RTT
(MS) VERSUS THROUGHPUT IN GOOD SIGNAL CONDITIONS FOR DIFFERENT NETWORKS

TABLE V EFFECT
OF SPLITTING HIGH-VOLUME CHANNEL INTO TWO

TABLE VI EFFECT
OF PACKET RELIABILITY ON RTT .ALL PAIRINGS ARE SIGNIFICANTLY DIFFERENT WITH p < 0.001 EXCEPT NONE AND ORDERED FOR LOW THROUGHPUT

TABLE VII STEP
RESPONSE RESULTSFig.7.Effect of packet reliability on RTT for 2.17 Mbps throughput in 5G with medium signal quality.