135 lines
14 KiB
TeX
135 lines
14 KiB
TeX
%With a vibrotactile actuator attached to a hand-held device or directly on the finger, it is possible to simulate virtual haptic sensations as vibrations, such as texture, friction or contact vibrations \cite{culbertson2018haptics}.
|
|
%
|
|
%We describe a system for rendering vibrotactile roughness textures in real time, on any real surface, touched directly with the index fingertip, with no constraints on hand movement and using a simple camera to track the finger pose.
|
|
%
|
|
%We also describe how to pair this tactile rendering with an immersive \AR or \VR headset visual display to provide a coherent visuo-haptic augmentation of the \RE.
|
|
|
|
\section{Concept}
|
|
\label{principle}
|
|
|
|
The visuo-haptic texture rendering system is based on:
|
|
\begin{enumerate}[label=(\arabic*)]
|
|
\item a real-time interaction loop between the finger movements and a coherent visuo-haptic feedback simulating the sensation of a touched texture,
|
|
\item a precise alignment of the \VE with its real counterpart, and
|
|
\item a modulation of the signal frequency by the estimated finger speed with a phase matching.
|
|
\end{enumerate}
|
|
|
|
\figref{diagram} shows the interaction loop diagram and \eqref{signal} the definition of the vibrotactile signal.
|
|
The system consists of three main components: the pose estimation of the tracked real elements, the visual rendering of the \VE, and the vibrotactile signal generation and rendering.
|
|
|
|
\figwide{diagram}{Diagram of the visuo-haptic texture rendering system. }[
|
|
\setstretch{1.2}
|
|
Fiducial markers attached to the voice-coil actuator and to augmented surfaces to track are captured by a camera.
|
|
The positions and rotations (the poses) ${}^c\mathbf{T}_i$, $i=1..n$ of the $n$ defined markers in the camera frame $\poseFrame{c}$ are estimated, then filtered with an adaptive low-pass filter.
|
|
These poses are used to move and display the virtual model replicas aligned with the \RE.
|
|
A collision detection algorithm detects a contact of the virtual hand with the virtual textures.
|
|
If so, the velocity of the finger marker ${}^c\dot{\mathbf{X}}_f$ is estimated using discrete derivative of position and adaptive low-pass filtering, then transformed onto the texture frame $\poseFrame{t}$.
|
|
The vibrotactile signal $r$ is generated by modulating the (scalar) finger velocity ${}^t\hat{\dot{X}}_f$ in the texture direction with the texture period $\lambda$ (\eqref{signal}).
|
|
The signal is sampled at \qty{48}{\kilo\hertz} and sent to the voice-coil actuator via an audio amplifier.
|
|
All computation steps except signal sampling are performed at \qty{60}{\hertz} and in separate threads to parallelize them.
|
|
]
|
|
|
|
\section{Description of the System Components}
|
|
\label{system_components}
|
|
|
|
\subsection{Pose Estimation}
|
|
\label{pose_estimation}
|
|
|
|
A \qty{2}{\cm} AprilTag fiducial marker \cite{wang2016apriltag} is glued to the top of the actuator (\figref{device}) to track the finger pose with a camera (StreamCam, Logitech) which is placed above the experimental setup and capturing \qtyproduct{1280 x 720}{px} images at \qty{60}{\hertz} (\figref{apparatus}).
|
|
Other markers are placed on the real surfaces to augment (\figref{setup}) to estimate the relative position of the finger with respect to the surfaces.
|
|
Contrary to similar work, using vision-based pose estimation allows both to free the hand movements and to augment any real surface.
|
|
|
|
A camera external to the \AR headset with a marker-based technique is employed to provide accurate and robust pose estimation with a constant view of the markers \cite{marchand2016pose}.
|
|
We denote $\pose{c}{T}{i}$, $i=1..n$ the homogenous transformation matrix that defines the position and rotation of the $i$-th marker out of the $n$ defined markers in the camera frame $\poseFrame{c}$, \eg the finger pose $\pose{c}{T}{f}$ and the augmented surface pose $\pose{c}{T}{s}$ in the camera frame.
|
|
|
|
To reduce the noise in the pose estimation while maintaining good responsiveness, the 1€ filter \cite{casiez2012filter} is applied; a low-pass filter with an adaptive cut-off frequency, specifically designed for human motion tracking.
|
|
The filtered pose is denoted as $\pose{c}{\hat{T}}{i}$.
|
|
The optimal filter parameters were determined using the method of \textcite{casiez2012filter}, with a minimum cut-off frequency of \qty{10}{\hertz} and a slope of \num{0.01}.
|
|
|
|
The velocity (without angular velocity) of the finger marker, denoted as $\pose{c}{\dot{X}}{f}$, is estimated using the discrete derivative of the position.
|
|
It is then filtered with another 1€ filter with the same parameters, and denoted as $\pose{c}{\hat{\dot{X}}}{f}$.
|
|
Finally, this filtered finger velocity is transformed into the augmented surface frame $\poseFrame{s}$ to be used in the vibrotactile signal generation, such as $\pose{s}{\hat{\dot{X}}}{f} = \pose{s}{T}{c} \, \pose{c}{\hat{\dot{X}}}{f}$.
|
|
|
|
\subsection{Virtual Environment Alignment}
|
|
\label{virtual_real_alignment}
|
|
|
|
%To be able to compare virtual and augmented realities, we then create a \VE that closely replicate the real one.
|
|
Before a user interacts with the system, it is necessary to design a \VE that will be registered with the \RE during the experiment.
|
|
Each real element tracked by a marker is modelled virtually, \eg the hand and the augmented surface (\figref{device}).
|
|
In addition, the pose and size of the virtual textures were defined on the virtual replicas.
|
|
During the experiment, the system uses marker pose estimates to align the virtual models with their real world counterparts. %, according to the condition being tested.
|
|
This allows to detect if a finger touches a virtual texture using a collision detection algorithm (Nvidia PhysX), and to show the virtual elements and textures in real-time, aligned with the \RE, using the considered \AR or \VR headset.
|
|
|
|
In our implementation, the \VE is designed with Unity and the Mixed Reality Toolkit (MRTK)\footnoteurl{https://learn.microsoft.com/windows/mixed-reality/mrtk-unity}.
|
|
The visual rendering is achieved using the Microsoft HoloLens~2, an \OST-\AR headset with a \qtyproduct{43 x 29}{\degree} \FoV, a \qty{60}{\Hz} refresh rate, and self-localisation capabilities.
|
|
A \VST-\AR or a \VR headset could have been used as well.
|
|
|
|
\subsection{Vibrotactile Signal Generation and Rendering}
|
|
\label{texture_generation}
|
|
|
|
A voice-coil actuator (HapCoil-One, Actronika) is used to display the vibrotactile signal, as it allows the frequency and amplitude of the signal to be controlled independently over time, covers a wide frequency range (\qtyrange{10}{1000}{\Hz}), and outputs the signal accurately with relatively low acceleration distortion\footnote{HapCoil-One specific characteristics are described in its data sheet: \url{https://tactilelabs.com/wp-content/uploads/2023/11/HapCoil_One_datasheet.pdf}}.
|
|
The voice-coil actuator is encased in a \ThreeD printed plastic shell and firmly attached to the middle phalanx of the user's index finger with a Velcro strap, to enable the fingertip to directly touch the environment (\figref{device}).
|
|
The actuator is driven by a class D audio amplifier (XY-502 / TPA3116D2, Texas Instrument). %, which has proven to be an effective type of amplifier for driving moving-coil \cite{mcmahan2014dynamic}.
|
|
The amplifier is connected to the audio output of a computer that generates the signal using the WASAPI driver in exclusive mode and the NAudio library\footnoteurl{https://github.com/naudio/NAudio}.
|
|
|
|
The represented haptic texture is a 1D series of parallels virtual grooves and ridges, similar to the real linear grating textures manufactured for psychophysical roughness perception studies \secref[related_work]{roughness}. %\cite{friesen2024perceived,klatzky2003feeling,unger2011roughness}.
|
|
It is generated as a square wave audio signal $r$, sampled at \qty{48}{\kilo\hertz}, with a texture period $\lambda$ and an amplitude $A$, similar to \eqref[related_work]{grating_rendering}.
|
|
Its frequency is a ratio of the absolute finger filtered (scalar) velocity $\dot{x} = \poseX{s}{|\hat{\dot{X}}|}{f}$, and the texture period $\lambda$ \cite{friesen2024perceived}.
|
|
As the finger is moving horizontally on the texture, only the $X$ component of the velocity is used.
|
|
This velocity modulation strategy is necessary as the finger position is estimated at a far lower rate (\qty{60}{\hertz}) than the audio signal (unlike high-fidelity force-feedback devices \cite{unger2011roughness}).
|
|
|
|
%As the finger position is estimated at a far lower rate (\qty{60}{\hertz}), the filtered finger (scalar) position ${}^t\hat{X}_f$ in the texture frame $\poseFrame{t}$ cannot be directly used. % to render the signal if the finger moves fast or if the texture period is small.
|
|
%
|
|
%The best strategy instead is to modulate the frequency of the signal as a ratio of the filtered finger velocity ${}^t\hat{\dot{\mathbf{X}}}_f$ and the texture period $\lambda$ \cite{friesen2024perceived}.
|
|
%
|
|
When a new finger velocity $\dot{x}\,(t_j)$ is estimated at time $t_j$, the phase $\phi\,(t_j)$ of the signal $r$ needs also to be adjusted to ensure a continuity in the signal.
|
|
In other words, the sampling of the audio signal runs at \qty{48}{\kilo\hertz}, and its frequency and phase is updated at a far lower rate of \qty{60}{\hertz} when a new finger velocity is estimated.
|
|
A sample $r(t_j, t_k)$ of the audio signal at sampling time $t_k$, with $t_k >= t_j$, is thus given by:
|
|
\begin{subequations}
|
|
\label{eq:signal}
|
|
\begin{align}
|
|
r(t_j, t_k) & = A\, \text{sgn} ( \sin (2 \pi \frac{\dot{x}\,(t_j)}{\lambda} t_k + \phi(t_j) ) ) & \label{eq:signal_speed} \\
|
|
\phi(t_j) & = \phi(t_{j-1}) + 2 \pi \frac{\dot{x}\,(t_j) - \dot{x}\,(t_{j-1})}{\lambda} t_k & \label{eq:signal_phase}
|
|
\end{align}
|
|
\end{subequations}
|
|
|
|
This rendering preserves the sensation of a constant spatial frequency of the virtual texture while the finger moves at various speeds, which is crucial for the perception of roughness \cite{klatzky2003feeling,unger2011roughness}.
|
|
%
|
|
%Note that the finger position and velocity are transformed from the camera frame $\poseFrame{c}$ to the texture frame $\poseFrame{t}$, with the $x$ axis aligned with the texture direction.
|
|
%
|
|
%However, when a new finger position is estimated at time $t_j$, the phase $\phi(t_j)$ needs to be adjusted as well with the frequency to ensure a continuity in the signal as described in \eqref{signal_phase}.
|
|
%
|
|
The phase matching avoids sudden changes in the actuator movement thus affecting the texture perception in an uncontrolled way (\figref{phase_adjustment}) and, contrary to previous work \cite{asano2015vibrotactile,ujitoko2019modulating}, it enables a free exploration of the texture by the user with no constraints on the finger speed.
|
|
A square wave is chosen to get a rendering closer to a real grating texture with the sensation of crossing edges \cite{ujitoko2019modulating}, and because the roughness perception of sine wave textures has been shown not to reproduce the roughness perception of real grating textures \cite{unger2011roughness}.
|
|
A square wave also makes it possible to render low signal frequencies that occur when the finger moves slowly or the texture period is large, as the actuator cannot render a pure sine wave signal below \qty{\approx 20}{\Hz} with sufficient amplitude to be perceived.
|
|
|
|
The vibrotactile texture is described and rendered in this chapter as a 1D signal by integrating the relative finger movement to the texture on a single direction, but it is easily extended to a two-dimensional texture by simply generating a second signal for the orthogonal direction and summing the two signals in the rendering \cite{girard2016haptip}.
|
|
|
|
\fig[0.68]{phase_adjustment}{
|
|
Change in frequency of a sinusoidal signal with and without phase matching.
|
|
}[
|
|
Phase matching ensures a continuity and avoids glitches in the rendering of the signal.
|
|
A sinusoidal signal is shown here for clarity, but a different waveform will give a similar effect.
|
|
]
|
|
|
|
\section{System Latency}
|
|
\label{latency}
|
|
|
|
As shown in \figref{diagram} and described above, the system includes various haptic and visual sensors and rendering devices linked by software processes for image processing, \ThreeD rendering and audio generation.
|
|
Because the chosen \AR headset is a standalone device (like most current \AR/\VR headsets) and cannot directly control the sound card and haptic actuator, the image capture, pose estimation and audio signal generation steps are performed on an external computer.
|
|
All computation steps run in a separate thread to parallelize them and reduce latency, and are synchronized with the headset via a local network and the ZeroMQ library\footnoteurl{https://zeromq.org/}.
|
|
This complex assembly inevitably introduces latency, which must be measured.
|
|
|
|
The rendering system provides a user with two interaction loops between the movements of their hand and the visual (loop 1) and haptic (loop 2) feedbacks.
|
|
Measures are shown as (mean $\pm$ standard deviation), when it is known.
|
|
The end-to-end latency from finger movement to feedback is measured at \qty{36 \pm 4}{\ms} in the haptic loop and \qty{43 \pm 9}{\ms} in the visual loop.
|
|
Both are the result of latency in image capture \qty{16 \pm 1}{\ms}, markers pose estimation \qty{2 \pm 1}{\ms} and network communication \qty{4 \pm 1}{\ms}.
|
|
The haptic loop also includes the voice-coil latency \qty{15}{\ms} (as specified by the manufacturer\footnotemark[1]), whereas the visual loop includes the latency in \ThreeD rendering \qty{16 \pm 5}{\ms} (60 frames per second) and display \qty{5}{\ms}.
|
|
The total haptic latency is below the \qty{60}{\ms} detection threshold in vibrotactile feedback \cite{okamoto2009detectability}.
|
|
The total visual latency can be considered slightly high, yet it is typical for an \AR rendering involving vision-based pose estimation \cite{knorlein2009influence}.
|
|
|
|
The two filters also introduce a constant lag between the finger movement and the estimated position and velocity, measured at \qty{160 \pm 30}{\ms}.
|
|
With respect to the real hand position, it causes a distance error in the displayed virtual hand position, and thus a delay in the triggering of the vibrotactile signal.
|
|
This is proportional to the speed of the finger, \eg distance error is \qty{12 \pm 2.3}{\mm} when the finger moves at \qty{75}{\mm\per\second}.
|
|
%and of the vibrotactile signal frequency with respect to the finger speed.%, that is proportional to the speed of the finger.
|