phd-thesis/3-perception/vhar-system/2-method.tex

\section{Concept}
\label{principle}

The visuo-haptic texture rendering system is based on:
\begin{enumerate}[label=(\arabic*)]
  \item a real-time interaction loop between the finger movements and a coherent visuo-haptic feedback simulating the sensation of a touched texture,
  \item a precise registration of the \VE with its real counterpart, and
  \item a modulation of the signal frequency by the estimated finger speed with a phase matching.
\end{enumerate}

\figref{diagram} shows the interaction loop diagram and \eqref{signal} the definition of the vibrotactile signal.
The system consists of three main components: the pose estimation of the tracked real elements, the visual rendering of the \VE, and the vibrotactile signal generation and rendering.

\figwide{diagram}{Diagram of the visuo-haptic texture rendering system. }[
  \setstretch{1.2}
  Fiducial markers attached to the voice-coil actuator and to augmented surfaces to track are captured by a camera.
  The positions and rotations (the poses) ${}^c\mathbf{T}_i$, $i=1..n$ of the $n$ defined markers in the camera frame $\poseFrame{c}$ are estimated, then filtered with an adaptive low-pass filter.
  These poses are used to move and display the virtual model replicas aligned with the \RE.
  A collision detection algorithm detects a contact of the virtual hand with the virtual textures.
  If so, the velocity of the finger marker ${}^c\dot{\mathbf{X}}_f$ is estimated using discrete derivative of position and adaptive low-pass filtering, then transformed onto the texture frame $\poseFrame{t}$.
  The vibrotactile signal $r$ is generated by modulating the (scalar) finger velocity ${}^t\hat{\dot{X}}_f$ in the texture direction with the texture period $\lambda$ (\eqref{signal}).
  The signal is sampled at \qty{48}{\kilo\hertz} and sent to the voice-coil actuator via an audio amplifier.
  All computation steps except signal sampling are performed at \qty{60}{\hertz} and in separate threads to parallelize them.
]

\section{Description of the System Components}
\label{system_components}

\subsection{Pose Estimation}
\label{pose_estimation}

A \qty{2}{\cm} AprilTag fiducial marker \cite{wang2016apriltag} is glued to the top of the actuator (\figref{device}) to track the finger pose with a camera (StreamCam, Logitech) which is placed above the experimental setup and capturing \qtyproduct{1280 x 720}{px} images at \qty{60}{\hertz} (\figref{apparatus}).
Other markers are placed on the real surfaces to augment (\figref{setup}) to estimate the relative position of the finger with respect to the surfaces.
Contrary to similar work, using vision-based pose estimation allows both to free the hand movements and to augment any real surface.

A camera external to the \AR headset with a marker-based technique is employed to provide accurate and robust pose estimation with a constant view of the markers \cite{marchand2016pose}.
We denote $\pose{c}{T}{i}$, $i=1..n$ the homogenous transformation matrix that defines the position and rotation of the $i$-th marker out of the $n$ defined markers in the camera frame $\poseFrame{c}$, \eg the finger pose $\pose{c}{T}{f}$ and the augmented surface pose $\pose{c}{T}{s}$ in the camera frame.

To reduce the noise in the pose estimation while maintaining good responsiveness, the 1€ filter \cite{casiez2012filter} is applied; a low-pass filter with an adaptive cut-off frequency, specifically designed for human motion tracking.
The filtered pose is denoted as $\pose{c}{\hat{T}}{i}$.
The optimal filter parameters were determined using the method of \textcite{casiez2012filter}, with a minimum cut-off frequency of \qty{10}{\hertz} and a slope of \num{0.01}.

The velocity (without angular velocity) of the finger marker, denoted as $\pose{c}{\dot{X}}{f}$, is estimated using the discrete derivative of the position.
It is then filtered with another 1€ filter with the same parameters, and denoted as $\pose{c}{\hat{\dot{X}}}{f}$.
Finally, this filtered finger velocity is transformed into the augmented surface frame $\poseFrame{s}$ to be used in the vibrotactile signal generation, such as $\pose{s}{\hat{\dot{X}}}{f} = \pose{s}{T}{c} \, \pose{c}{\hat{\dot{X}}}{f}$.

\subsection{Virtual Environment Registration}
\label{virtual_real_registration}

\comans{JG}{The registration process between the external camera, the finger, surface and HoloLens could have been described in more detail. Specifically, it could have been described clearer how the HoloLens coordinate system was aligned (e.g., by also tracking the fiducials on the surface and or finger).}{This has been better described.}
Before a user interacts with the system, it is necessary to design a \VE that will be registered with the \RE during the experiment.
Each real element tracked by a marker is modelled virtually, \eg the hand and the augmented surface (\figref{device}).
In addition, the pose and size of the virtual textures were defined on the virtual replicas.
Prior to any usage, it is necessary to register the \VE with the \RE.
First, the coordinate system of the headset is manually aligned with that of the external camera by the experimenter using a raycast from the headset to the origin point of the camera (the center of the black rectangle in the box in \figref{apparatus}).
This resulted in a \qty{\pm .5}{\cm} spatial alignment error between the \RE and the \VE.
While this was sufficient for our use cases, other methods can achieve better accuracy if needed \cite{grubert2018survey}.
The registration of the coordinate systems of the camera and the headset thus allows the use of the marker estimation poses performed with the camera to display in the headset the virtual models aligned with their real-world counterparts.

\comans{JG}{A description if and how the offset between the lower side of the fingertip touching the surface and the fiducial mounted on the top of the finger was calibrated / compensated is missing}{This has been better described.}
An additional calibration is performed to compensate for the offset between the finger contact point and the estimated marker pose \cite{son2022effect}.
The current user then places the index finger on the origin point, whose respective poses are known from the attached fiducial markers.
The transformation between the marker pose of the finger and the finger contact point can be estimated and compensated with an inverse transformation.
This allows to detect if the calibrated real finger touches a virtual texture using a collision detection algorithm (Nvidia PhysX).

In our implementation, the \VE is designed with Unity (v2021.1) and the Mixed Reality Toolkit (v2.7)\footnoteurl{https://learn.microsoft.com/windows/mixed-reality/mrtk-unity}.
The visual rendering is achieved using the Microsoft HoloLens~2, an \OST-\AR headset with a \qtyproduct{43 x 29}{\degree} \FoV, a \qty{60}{\Hz} refresh rate, and self-localisation capabilities.
A \VST-\AR or a \VR headset could have been used as well.

\subsection{Vibrotactile Signal Generation and Rendering}
\label{texture_generation}

A voice-coil actuator (HapCoil-One, Actronika) is used to display the vibrotactile signal, as it allows the frequency and amplitude of the signal to be controlled independently over time, covers a wide frequency range (\qtyrange{10}{1000}{\Hz}), and outputs the signal accurately with relatively low acceleration distortion\footnote{HapCoil-One specific characteristics are described in its data sheet: \url{https://tactilelabs.com/wp-content/uploads/2023/11/HapCoil_One_datasheet.pdf}}.
The voice-coil actuator is encased in a \ThreeD printed plastic shell and firmly attached to the middle phalanx of the user's index finger with a Velcro strap, to enable the fingertip to directly touch the environment (\figref{device}).
The actuator is driven by a class D audio amplifier (XY-502 / TPA3116D2, Texas Instrument).
The amplifier is connected to the audio output of a computer that generates the signal using the WASAPI driver in exclusive mode and the NAudio library\footnoteurl{https://github.com/naudio/NAudio}.

The represented haptic texture is a 1D series of parallels virtual grooves and ridges, similar to the real linear grating textures manufactured for psychophysical roughness perception studies \secref[related_work]{roughness}.
It is generated as a square wave audio signal $r$, sampled at \qty{48}{\kilo\hertz}, with a texture period $\lambda$ and an amplitude $A$, similar to \eqref[related_work]{grating_rendering}.
Its frequency is a ratio of the absolute finger filtered (scalar) velocity $\dot{x} = \poseX{s}{|\hat{\dot{X}}|}{f}$, and the texture period $\lambda$ \cite{friesen2024perceived}.
As the finger is moving horizontally on the texture, only the $X$ component of the velocity is used.
This velocity modulation strategy is necessary as the finger position is estimated at a far lower rate (\qty{60}{\hertz}) than the audio signal (unlike high-fidelity force-feedback devices \cite{unger2011roughness}).

When a new finger velocity $\dot{x}\,(t_j)$ is estimated at time $t_j$, the phase $\phi\,(t_j)$ of the signal $r$ needs also to be adjusted to ensure a continuity in the signal.
In other words, the sampling of the audio signal runs at \qty{48}{\kilo\hertz}, and its frequency and phase is updated at a far lower rate of \qty{60}{\hertz} when a new finger velocity is estimated.
A sample $r(t_j, t_k)$ of the audio signal at sampling time $t_k$, with $t_k >= t_j$, is thus given by:
\begin{subequations}
  \label{eq:signal}
  \begin{align}
    r(t_j, t_k) & = A\, \text{sgn} ( \sin (2 \pi \frac{\dot{x}\,(t_j)}{\lambda} t_k + \phi(t_j) ) ) & \label{eq:signal_speed} \\
    \phi(t_j)        & = \phi(t_{j-1}) + 2 \pi \frac{\dot{x}\,(t_j) - \dot{x}\,(t_{j-1})}{\lambda} t_k                 & \label{eq:signal_phase}
  \end{align}
\end{subequations}

This rendering preserves the sensation of a constant spatial frequency of the virtual texture while the finger moves at various speeds, which is crucial for the perception of roughness \cite{klatzky2003feeling,unger2011roughness}.
The phase matching avoids sudden changes in the actuator movement thus affecting the texture perception in an uncontrolled way (\figref{phase_adjustment}) and, contrary to previous work \cite{asano2015vibrotactile,ujitoko2019modulating}, it enables a free exploration of the texture by the user with no constraints on the finger speed.
A square wave is chosen to get a rendering closer to a real grating texture with the sensation of crossing edges \cite{ujitoko2019modulating}, and because the roughness perception of sine wave textures has been shown not to reproduce the roughness perception of real grating textures \cite{unger2011roughness}.
A square wave also makes it possible to render low signal frequencies that occur when the finger moves slowly or the texture period is large, as the actuator cannot render a pure sine wave signal below \qty{\approx 20}{\Hz} with sufficient amplitude to be perceived.

The vibrotactile texture is described and rendered in this chapter as a 1D signal by integrating the relative finger movement to the texture on a single direction, but it is easily extended to a two-dimensional texture by simply generating a second signal for the orthogonal direction and summing the two signals in the rendering \cite{girard2016haptip}.

\fig[0.68]{phase_adjustment}{
  Change in frequency of a sinusoidal signal with and without phase matching.
}[
  Phase matching ensures a continuity and avoids glitches in the rendering of the signal.
  A sinusoidal signal is shown here for clarity, but a different waveform will give a similar effect.
]

\section{System Latency}
\label{latency}

As shown in \figref{diagram} and described above, the system includes various haptic and visual sensors and rendering devices linked by software processes for image processing, \ThreeD rendering and audio generation.
Because the chosen \AR headset is a standalone device (like most current \AR/\VR headsets) and cannot directly control the sound card and haptic actuator, the image capture, pose estimation and audio signal generation steps are performed on an external computer.
All computation steps run in a separate thread to parallelize them and reduce latency, and are synchronized with the headset via a local network and the ZeroMQ library\footnoteurl{https://zeromq.org/}.
This complex assembly inevitably introduces latency, which must be measured.

The rendering system provides a user with two interaction loops between the movements of their hand and the visual (loop 1) and haptic (loop 2) feedbacks.
Measures are shown as (mean $\pm$ standard deviation), when it is known.
The end-to-end latency from finger movement to feedback is measured at \qty{36 \pm 4}{\ms} in the haptic loop and \qty{43 \pm 9}{\ms} in the visual loop.
Both are the result of latency in image capture \qty{16 \pm 1}{\ms}, markers pose estimation \qty{2 \pm 1}{\ms} and network communication \qty{4 \pm 1}{\ms}.
The haptic loop also includes the voice-coil latency \qty{15}{\ms} (as specified by the manufacturer\footnotemark[1]), whereas the visual loop includes the latency in \ThreeD rendering \qty{16 \pm 5}{\ms} (60 frames per second) and display \qty{5}{\ms}.
The total haptic latency is below the \qty{60}{\ms} detection threshold in vibrotactile feedback \cite{okamoto2009detectability}.
The total visual latency can be considered slightly high, yet it is typical for an \AR rendering involving vision-based pose estimation \cite{knorlein2009influence}.

The two filters also introduce a constant lag between the finger movement and the estimated position and velocity, measured at \qty{160 \pm 30}{\ms}.
With respect to the real hand position, it causes a distance error in the displayed virtual hand position, and thus a delay in the triggering of the vibrotactile signal.
This is proportional to the speed of the finger, \eg distance error is \qty{1.2 \pm .2}{\cm} when the finger moves at \qty{7.5}{\cm\per\second}.