Update xr-perception chapter

This commit is contained in:
2024-09-27 09:23:52 +02:00
parent c89299649e
commit 8a85b14d3b
7 changed files with 177 additions and 133 deletions

View File

@@ -1,6 +1,6 @@
%With a vibrotactile actuator attached to a hand-held device or directly on the finger, it is possible to simulate virtual haptic sensations as vibrations, such as texture, friction or contact vibrations \cite{culbertson2018haptics}.
%
In this section, we describe a system for rendering vibrotactile roughness texture in real time, on any tangible surface, touched directly with the index fingertip, with no constraints on hand movement and using a simple camera to track the finger pose.
In this section, we describe a system for rendering vibrotactile roughness textures in real time, on any tangible surface, touched directly with the index fingertip, with no constraints on hand movement and using a simple camera to track the finger pose.
%
We also describe how to pair this tactile rendering with an immersive \AR or \VR headset visual display to provide a coherent, multimodal visuo-haptic augmentation of the real environment.
@@ -11,22 +11,22 @@ The visuo-haptic texture rendering system is based on
%
\begin{enumerate*}[label=(\arabic*)]
\item a real-time interaction loop between the finger movements and a coherent visuo-haptic feedback simulating the sensation of a touched texture,
\item a precise alignement of the virtual environment with its real counterpart,
\item and a modulation of the signal frequency by the estimated finger speed with a phase matching.
\item a precise alignment of the virtual environment with its real counterpart, and
\item a modulation of the signal frequency by the estimated finger speed with a phase matching.
\end{enumerate*}
%
\figref{diagram} shows the diagram of the interaction loop and \eqref{signal} the definition of the vibrotactile signal.
\figref{diagram} shows the interaction loop diagram and \eqref{signal} the definition of the vibrotactile signal.
%
The system is composed of three main components: the pose estimation of the tracked real elements, the visual rendering of the virtual environment, and the vibrotactile signal generation and rendering.
The system consists of three main components: the pose estimation of the tracked real elements, the visual rendering of the virtual environment, and the vibrotactile signal generation and rendering.
\figwide[1]{diagram}{Diagram of the visuo-haptic texture rendering system. }[
Fiducial markers attached to the voice-coil actuator and to tangible surfaces to track are captured by a camera.
The positions and rotations (the poses) ${}^c\mathbf{T}_i$, $i=1..n$ of the $n$ defined markers in the camera frame $\mathcal{F}_c$ are estimated, then filtered with an adaptive low-pass filter.
%These poses are transformed to the AR/VR headset frame $\mathcal{F}_h$ and applied to the virtual model replicas to display them superimposed and aligned with the real environment.
%These poses are transformed to the \AR/\VR headset frame $\mathcal{F}_h$ and applied to the virtual model replicas to display them superimposed and aligned with the real environment.
These poses are used to move and display the virtual model replicas aligned with the real environment.
A collision detection algorithm detects a contact of the virtual hand with the virtual textures.
If so, the velocity of the finger marker ${}^c\dot{\mathbf{X}}_f$ is estimated using discrete derivative of position and adaptive low-pass filtering, then transformed onto the texture frame $\mathcal{F}_t$.
The vibrotactile signal $s_k$ is generated by modulating the finger velocity ${}^t\hat{\dot{X}}_f$ in the texture direction with the texture period $\lambda$ (\eqref{signal}).
The vibrotactile signal $s_k$ is generated by modulating the (scalar) finger velocity ${}^t\hat{\dot{X}}_f$ in the texture direction with the texture period $\lambda$ (\eqref{signal}).
The signal is sampled at 48~kHz and sent to the voice-coil actuator via an audio amplifier.
All computation steps except signal sampling are performed at 60~Hz and in separate threads to parallelize them.
]
@@ -47,19 +47,21 @@ The system is composed of three main components: the pose estimation of the trac
A fiducial marker (AprilTag) is glued to the top of the actuator (\figref{device}) to track the finger pose with a camera (StreamCam, Logitech) which is placed above the experimental setup and capturing \qtyproduct{1280 x 720}{px} images at \qty{60}{\hertz} (\figref{apparatus}).
%
Other markers are placed on the tangible surfaces to augment to estimate the relative position of the finger with respect to the surfaces (\figref{setup}).
Other markers are placed on the tangible surfaces to augment (\figref{setup}). % to estimate the relative position of the finger with respect to the surfaces
%
Contrary to similar work which either constrained hand to a constant speed to keep the signal frequency constant \cite{asano2015vibrotactile,friesen2024perceived}, or used mechanical sensors attached to the hand \cite{friesen2024perceived,strohmeier2017generating}, using vision-based tracking allows both to free the hand movements and to augment any tangible surface.
%
A camera external to the AR/VR headset with a marker-based technique is employed to provide accurate and robust tracking with a constant view of the markers \cite{marchand2016pose}.
A camera external to the \AR/\VR headset with a marker-based technique is employed to provide accurate and robust tracking with a constant view of the markers \cite{marchand2016pose}.
%
To reduce the noise the pose estimation while maintaining a good responsiveness, the 1€ filter \cite{casiez2012filter} is applied.
We denote ${}^c\mathbf{T}_i$, $i=1..n$ the homogenous transformation matrix that defines the position and rotation of the $i$-th marker out of the $n$ defined markers in the camera frame $\mathcal{F}_c$, \eg the finger pose ${}^c\mathbf{T}_f$ and the texture pose ${}^c\mathbf{T}_t$.
%
It is a low-pass filter with an adaptive cutoff frequency, specifically designed for tracking human motion.
To reduce the noise in the pose estimation while maintaining good responsiveness, the 1€ filter \cite{casiez2012filter} is applied; a low-pass filter with an adaptive cutoff frequency, specifically designed for human motion tracking..
%
The filtered pose is denoted as ${}^c\hat{\mathbf{T}}_i$.
%
The optimal filter parameters were determined using the method of \textcite{casiez2012filter}, with a minimum cutoff frequency of \qty{10}{\hertz} and a slope of \num{0.01}.
%
The velocity of the marker is estimated using the discrete derivative of the position and an other 1€ filter with the same parameters.
The velocity (without angular velocity) of the marker, denoted as ${}^c\dot{\mathbf{X}}_i$, is estimated using the discrete derivative of the position and an other 1€ filter with the same parameters.
To be able to compare virtual and augmented realities, we then create a virtual environment that closely replicate the real one.
%Before a user interacts with the system, it is necessary to design a virtual environment that will be registered with the real environment during the experiment.
@@ -74,11 +76,11 @@ This allows to detect if a finger touches a virtual texture using a collision de
In our implementation, the virtual hand and environment are designed with Unity and the Mixed Reality Toolkit (MRTK).
%
The visual rendering is achieved using the Microsoft HoloLens~2, an OST-AR headset with a \qtyproduct{43 x 29}{\degree} field of view (FoV), a \qty{60}{\Hz} refresh rate, and self-localisation capabilities.
The visual rendering is achieved using the Microsoft HoloLens~2, an \OST-\AR headset with a \qtyproduct{43 x 29}{\degree} \FoV, a \qty{60}{\Hz} refresh rate, and self-localisation capabilities.
%
It was chosen over VST-AR because OST-AR only adds virtual content to the real environment, while VST-AR streams a real-time video capture of the real environment \cite{macedo2023occlusion}.
It was chosen over \VST-\AR because \OST-\AR only adds virtual content to the real environment, while \VST-\AR streams a real-time video capture of the real environment \cite{macedo2023occlusion}.
%
Indeed, one of our objectives (\secref{experiment}) is to directly compare a virtual environment that replicates a real one. %, rather than a video feed that introduces many supplementary visual limitations.
Indeed, one of our objectives (\secref{experiment}) is to directly compare a virtual environment that replicates a real one, rather than a video feed that introduces many supplementary visual limitations \cite{kim2018revisiting,macedo2023occlusion}.
%
To simulate a \VR headset, a cardboard mask (with holes for sensors) is attached to the headset to block the view of the real environment (\figref{headset}).
@@ -89,49 +91,64 @@ A voice-coil actuator (HapCoil-One, Actronika) is used to display the vibrotacti
%
The voice-coil actuator is encased in a 3D printed plastic shell and firmly attached to the middle phalanx of the user's index finger with a Velcro strap, to enable the fingertip to directly touch the environment (\figref{device}).
%
The actuator is driven by a Class D audio amplifier (XY-502 / TPA3116D2, Texas Instrument). %, which has proven to be an effective type of amplifier for driving moving-coil \cite{mcmahan2014dynamic}.
The actuator is driven by a class D audio amplifier (XY-502 / TPA3116D2, Texas Instrument). %, which has proven to be an effective type of amplifier for driving moving-coil \cite{mcmahan2014dynamic}.
%
The amplifier is connected to the audio output of a computer that generates the signal using the WASAPI driver in exclusive mode and the NAudio library.
The represented haptic texture is a series of parallels virtual grooves and ridges, similar to real grating textures manufactured for psychophysical roughness perception studies \cite{friesen2024perceived,klatzky2003feeling,unger2011roughness}.
%
It is generated as a square wave audio signal, sampled at \qty{48}{\kilo\hertz}, with a period $\lambda$ (usually in the millimetre range) and an amplitude $A$.
It is generated as a square wave audio signal $s_k$, sampled at \qty{48}{\kilo\hertz}, with a period $\lambda$ and an amplitude $A$.
%
A sample $s_k$ of the audio signal at sampling time $t_k$ is given by:
Its frequency is a ratio of the absolute finger filtered (scalar) velocity ${}^t\hat{\dot{|X|}}_f$, transformed into the texture frame $\mathcal{F}_t$, and the texture period $\lambda$ \cite{friesen2024perceived}.
%
As the finger is moving horizontally on the texture, only the $x$ component of the velocity is used.
%
%This velocity modulation strategy is necessary as the finger position is estimated at a far lower rate (\qty{60}{\hertz}) than the audio signal.
%
%
%As the finger position is estimated at a far lower rate (\qty{60}{\hertz}), the filtered finger (scalar) position ${}^t\hat{X}_f$ in the texture frame $\mathcal{F}_t$ cannot be directly used. % to render the signal if the finger moves fast or if the texture period is small.
%
%The best strategy instead is to modulate the frequency of the signal as a ratio of the filtered finger velocity ${}^t\hat{\dot{\mathbf{X}}}_f$ and the texture period $\lambda$ \cite{friesen2024perceived}.
%
When a new finger velocity ${}^t\hat{\dot{X}}_{f,j}$ is estimated at time $t_j$, the phase $\phi_j$ of the signal $s$ needs also to be adjusted to ensure a continuity in the signal.
%
In other words, the sampling of the audio signal runs at \qty{48}{\kilo\hertz}, and its frequency and phase is updated at a far lower rate of \qty{60}{\hertz} when a new finger velocity is estimated.
%
A sample $s_k$ of the audio signal at sampling time $t_k$, with $t_k >= t_j$, is thus given by:
%
\begin{subequations}
\label{eq:signal}
\begin{align}
s(x_{f,j}, t_k) & = A \text{\,sgn} ( \sin (2 \pi \frac{\dot{x}_{f,j}}{\lambda} t_k + \phi_j) ) & \label{eq:signal_speed} \\
\phi_j & = \phi_{j-1} + 2 \pi \frac{x_{f,j} - x_{f,{j-1}}}{\lambda} t_k & \label{eq:signal_phase}
s_k(x_{f,j}, t_k) & = A\, \text{sgn} ( \sin (2 \pi \frac{|\dot{X}_{f,j}|}{\lambda} t_k + \phi_j) ) & \label{eq:signal_speed} \\
\phi_j & = \phi_{j-1} + 2 \pi \frac{x_{f,j} - x_{f,{j-1}}}{\lambda} t_k & \label{eq:signal_phase}
\end{align}
\end{subequations}
%
This is a common rendering method for vibrotactile textures, with well-defined parameters, that has been employed to modify perceived haptic roughness of a tangible surface \cite{asano2015vibrotactile,konyo2005tactile,ujitoko2019modulating}.
%This is a common rendering method for vibrotactile textures, with well-defined parameters, that has been employed to modify perceived haptic roughness of a tangible surface \cite{asano2015vibrotactile,konyo2005tactile,ujitoko2019modulating}.
%
As the finger position is estimated at a far lower rate (\qty{60}{\hertz}) than the audio signal, the finger position $x_f$ cannot be directly used to render the signal if the finger moves fast or if the texture period is small.
%As the finger position is estimated at a far lower rate (\qty{60}{\hertz}) than the audio signal, the finger position $x_f$ cannot be directly used to render the signal if the finger moves fast or if the texture period is small.
%
The best strategy instead is to modulate the frequency of the signal $s$ as a ratio of the finger velocity $\dot{x}_f$ and the texture period $\lambda$ \cite{friesen2024perceived}.
%The best strategy instead is to modulate the frequency of the signal $s$ as a ratio of the finger velocity $\dot{x}_f$ and the texture period $\lambda$ \cite{friesen2024perceived}.
%
This is important because it preserves the sensation of a constant spatial frequency of the virtual texture while the finger moves at various speeds, which is crucial for the perception of roughness \cite{klatzky2003feeling,unger2011roughness}.
This rendering preserves the sensation of a constant spatial frequency of the virtual texture while the finger moves at various speeds, which is crucial for the perception of roughness \cite{klatzky2003feeling,unger2011roughness}.
%
Note that the finger position and velocity are transformed from the camera frame $\mathcal{F}_c$ to the texture frame $\mathcal{F}_t$, with the $x$ axis aligned with the texture direction.
%Note that the finger position and velocity are transformed from the camera frame $\mathcal{F}_c$ to the texture frame $\mathcal{F}_t$, with the $x$ axis aligned with the texture direction.
%
However, when a new finger position is estimated at time $t_j$, the phase $\phi_j$ needs to be adjusted as well with the frequency to ensure a continuity in the signal as described in \eqref{signal}.
%However, when a new finger position is estimated at time $t_j$, the phase $\phi_j$ needs to be adjusted as well with the frequency to ensure a continuity in the signal as described in \eqref{signal_phase}.
%
This approach avoids sudden changes in the actuator movement thus affecting the texture perception in an uncontrolled way (\figref{phase_adjustment}) and, contrary to previous work \cite{asano2015vibrotactile,friesen2024perceived}, it enables no constraints a free exploration of the texture by the user with no constraints on the finger speed.
The phase matching avoids sudden changes in the actuator movement thus affecting the texture perception in an uncontrolled way (\figref{phase_matching}) and, contrary to previous work \cite{asano2015vibrotactile,friesen2024perceived}, it enables no constraints a free exploration of the texture by the user with no constraints on the finger speed.
%
Finally, as \textcite{ujitoko2019modulating}, a square wave is chosen over a sine wave to get a rendering closer to a real grating texture with the sensation of crossing edges, and because the roughness perception of sine wave textures has been shown not to reproduce the roughness perception of real grating textures \cite{unger2011roughness}.
Finally, a square wave is chosen to get a rendering closer to a real grating texture with the sensation of crossing edges \cite{ujitoko2019modulating}, and because the roughness perception of sine wave textures has been shown not to reproduce the roughness perception of real grating textures \cite{unger2011roughness}.
%
%And secondly, to be able to render low frequencies that occurs when the finger moves slowly or the texture period is large, as the actuator cannot render frequencies below \qty{\approx 20}{\Hz} with enough amplitude to be perceived with a pure sine wave signal.
%
The tactile texture is described and rendered in this work as a one dimensional signal by integrating the relative finger movement to the texture on a single direction, but it is easily extended to a two-dimensional texture by simply generating a second signal for the orthogonal direction and summing the two signals in the rendering.
\fig[0.7]{phase_adjustment}{
Change in frequency of a sinusoidal signal with (light green) and without phase matching (in dark green).
Change in frequency of a sinusoidal signal with and without phase matching.
}[
The phase matching ensures a continuity in the signal and avoids glitches in the rendering of the signal.
A sinusoidal signal is show here for clarity, but a different waveform, such as a square wave, will give a similar effect.
Phase matching ensures a continuity and avoids glitches in the rendering of the signal.
A sinusoidal signal is shown here for clarity, but a different waveform will give a similar effect.
]
\section{System Latency}
@@ -139,7 +156,7 @@ The tactile texture is described and rendered in this work as a one dimensional
%As shown in \figref{diagram} and described above, the system includes various haptic and visual sensors and rendering devices linked by software processes for image processing, 3D rendering and audio generation.
%
Because the chosen \AR headset is a standalone device (like most current AR/VR headsets) and cannot directly control the sound card and haptic actuator, the image capture, pose estimation and audio signal generation steps are performed on an external computer.
Because the chosen \AR headset is a standalone device (like most current \AR/\VR headsets) and cannot directly control the sound card and haptic actuator, the image capture, pose estimation and audio signal generation steps are performed on an external computer.
%
All computation steps run in a separate thread to parallelize them and reduce latency, and are synchronised with the headset via a local network and the ZeroMQ library.
%
@@ -166,4 +183,3 @@ With respect to the real hand position, it causes a distance error in the displa
This is proportional to the speed of the finger, \eg distance error is \qty{12 +- 2.3}{\mm} when the finger moves at \qty{75}{\mm\per\second}.
%
%and of the vibrotactile signal frequency with respect to the finger speed.%, that is proportional to the speed of the finger.
%