|
|
|
|
@@ -19,11 +19,11 @@ The system consists of three main components: the pose estimation of the tracked
|
|
|
|
|
|
|
|
|
|
\figwide{diagram}{Diagram of the visuo-haptic texture rendering system. }[
|
|
|
|
|
Fiducial markers attached to the voice-coil actuator and to augmented surfaces to track are captured by a camera.
|
|
|
|
|
The positions and rotations (the poses) ${}^c\mathbf{T}_i$, $i=1..n$ of the $n$ defined markers in the camera frame $\mathcal{F}_c$ are estimated, then filtered with an adaptive low-pass filter.
|
|
|
|
|
%These poses are transformed to the \AR/\VR headset frame $\mathcal{F}_h$ and applied to the virtual model replicas to display them superimposed and aligned with the \RE.
|
|
|
|
|
The positions and rotations (the poses) ${}^c\mathbf{T}_i$, $i=1..n$ of the $n$ defined markers in the camera frame $\poseFrame{c}$ are estimated, then filtered with an adaptive low-pass filter.
|
|
|
|
|
%These poses are transformed to the \AR/\VR headset frame $\poseFrame{h}$ and applied to the virtual model replicas to display them superimposed and aligned with the \RE.
|
|
|
|
|
These poses are used to move and display the virtual model replicas aligned with the \RE.
|
|
|
|
|
A collision detection algorithm detects a contact of the virtual hand with the virtual textures.
|
|
|
|
|
If so, the velocity of the finger marker ${}^c\dot{\mathbf{X}}_f$ is estimated using discrete derivative of position and adaptive low-pass filtering, then transformed onto the texture frame $\mathcal{F}_t$.
|
|
|
|
|
If so, the velocity of the finger marker ${}^c\dot{\mathbf{X}}_f$ is estimated using discrete derivative of position and adaptive low-pass filtering, then transformed onto the texture frame $\poseFrame{t}$.
|
|
|
|
|
The vibrotactile signal $s_k$ is generated by modulating the (scalar) finger velocity ${}^t\hat{\dot{X}}_f$ in the texture direction with the texture period $\lambda$ (\eqref{signal}).
|
|
|
|
|
The signal is sampled at 48~kHz and sent to the voice-coil actuator via an audio amplifier.
|
|
|
|
|
All computation steps except signal sampling are performed at 60~Hz and in separate threads to parallelize them.
|
|
|
|
|
@@ -37,14 +37,18 @@ The system consists of three main components: the pose estimation of the tracked
|
|
|
|
|
|
|
|
|
|
A \qty{2}{\cm} AprilTag fiducial marker \cite{wang2016apriltag} is glued to the top of the actuator (\figref{device}) to track the finger pose with a camera (StreamCam, Logitech) which is placed above the experimental setup and capturing \qtyproduct{1280 x 720}{px} images at \qty{60}{\hertz} (\figref{apparatus}).
|
|
|
|
|
Other markers are placed on the real surfaces to augment (\figref{setup}) to estimate the relative position of the finger with respect to the surfaces.
|
|
|
|
|
Contrary to similar work, using vision-based tracking allows both to free the hand movements and to augment any real surface.
|
|
|
|
|
A camera external to the \AR headset with a marker-based technique is employed to provide accurate and robust tracking with a constant view of the markers \cite{marchand2016pose}.
|
|
|
|
|
We denote ${}^c\mathbf{T}_i$, $i=1..n$ the homogenous transformation matrix that defines the position and rotation of the $i$-th marker out of the $n$ defined markers in the camera frame $\mathcal{F}_c$, \eg the finger pose ${}^c\mathbf{T}_f$ and the texture pose ${}^c\mathbf{T}_t$.
|
|
|
|
|
Contrary to similar work, using vision-based pose estimation allows both to free the hand movements and to augment any real surface.
|
|
|
|
|
|
|
|
|
|
A camera external to the \AR headset with a marker-based technique is employed to provide accurate and robust pose estimation with a constant view of the markers \cite{marchand2016pose}.
|
|
|
|
|
We denote $\pose{c}{T}{i}$, $i=1..n$ the homogenous transformation matrix that defines the position and rotation of the $i$-th marker out of the $n$ defined markers in the camera frame $\poseFrame{c}$, \eg the finger pose $\pose{c}{T}{f}$ and the augmented surface pose $\pose{c}{T}{s}$ in the camera frame.
|
|
|
|
|
|
|
|
|
|
To reduce the noise in the pose estimation while maintaining good responsiveness, the 1€ filter \cite{casiez2012filter} is applied; a low-pass filter with an adaptive cut-off frequency, specifically designed for human motion tracking.
|
|
|
|
|
The filtered pose is denoted as ${}^c\hat{\mathbf{T}}_i$.
|
|
|
|
|
The filtered pose is denoted as $\pose{c}{\hat{T}}{i}$.
|
|
|
|
|
The optimal filter parameters were determined using the method of \textcite{casiez2012filter}, with a minimum cut-off frequency of \qty{10}{\hertz} and a slope of \num{0.01}.
|
|
|
|
|
The velocity (without angular velocity) of the marker, denoted as ${}^c\dot{\mathbf{X}}_i$, is estimated using the discrete derivative of the position and another 1€ filter with the same parameters.
|
|
|
|
|
|
|
|
|
|
The velocity (without angular velocity) of the finger marker, denoted as $\pose{c}{\dot{T}}{f}$, is estimated using the discrete derivative of the position.
|
|
|
|
|
It is then filtered with another 1€ filter with the same parameters, and denoted as $\pose{c}{\hat{\dot{T}}}{f}$.
|
|
|
|
|
Finally, this filtered finger velocity is transformed into the augmented surface frame $\poseFrame{s}$ to be used in the vibrotactile signal generation, such as $\pose{s}{\hat{\dot{T}}}{f} = \pose{c}{T}{s} \, \pose{c}{\hat{\dot{T}}}{f}$.
|
|
|
|
|
|
|
|
|
|
\subsection{Virtual Environment Alignment}
|
|
|
|
|
\label{virtual_real_alignment}
|
|
|
|
|
@@ -68,30 +72,30 @@ The voice-coil actuator is encased in a \ThreeD printed plastic shell and firmly
|
|
|
|
|
The actuator is driven by a class D audio amplifier (XY-502 / TPA3116D2, Texas Instrument). %, which has proven to be an effective type of amplifier for driving moving-coil \cite{mcmahan2014dynamic}.
|
|
|
|
|
The amplifier is connected to the audio output of a computer that generates the signal using the WASAPI driver in exclusive mode and the NAudio library\footnoteurl{https://github.com/naudio/NAudio}.
|
|
|
|
|
|
|
|
|
|
The represented haptic texture is a series of parallels virtual grooves and ridges, similar to real grating textures manufactured for psychophysical roughness perception studies \secref[related_work]{roughness}. %\cite{friesen2024perceived,klatzky2003feeling,unger2011roughness}.
|
|
|
|
|
It is generated as a square wave audio signal $s_k$, sampled at \qty{48}{\kilo\hertz}, with a period $\lambda$ and an amplitude $A$, similar to \eqref[related_work]{grating_rendering}.
|
|
|
|
|
Its frequency is a ratio of the absolute finger filtered (scalar) velocity ${}^t\hat{\dot{|X|}}_f$, transformed into the texture frame $\mathcal{F}_t$, and the texture period $\lambda$ \cite{friesen2024perceived}.
|
|
|
|
|
As the finger is moving horizontally on the texture, only the $x$ component of the velocity is used.
|
|
|
|
|
The represented haptic texture is a 1D series of parallels virtual grooves and ridges, similar to the real linear grating textures manufactured for psychophysical roughness perception studies \secref[related_work]{roughness}. %\cite{friesen2024perceived,klatzky2003feeling,unger2011roughness}.
|
|
|
|
|
It is generated as a square wave audio signal $r$, sampled at \qty{48}{\kilo\hertz}, with a texture period $\lambda$ and an amplitude $A$, similar to \eqref[related_work]{grating_rendering}.
|
|
|
|
|
Its frequency is a ratio of the absolute finger filtered (scalar) velocity $x_f = \pose{s}{|\hat{\dot{T}}|}{f}$, and the texture period $\lambda$ \cite{friesen2024perceived}.
|
|
|
|
|
As the finger is moving horizontally on the texture, only the $X$ component of the velocity is used.
|
|
|
|
|
This velocity modulation strategy is necessary as the finger position is estimated at a far lower rate (\qty{60}{\hertz}) than the audio signal (unlike high-fidelity force-feedback devices \cite{unger2011roughness}).
|
|
|
|
|
|
|
|
|
|
%As the finger position is estimated at a far lower rate (\qty{60}{\hertz}), the filtered finger (scalar) position ${}^t\hat{X}_f$ in the texture frame $\mathcal{F}_t$ cannot be directly used. % to render the signal if the finger moves fast or if the texture period is small.
|
|
|
|
|
%As the finger position is estimated at a far lower rate (\qty{60}{\hertz}), the filtered finger (scalar) position ${}^t\hat{X}_f$ in the texture frame $\poseFrame{t}$ cannot be directly used. % to render the signal if the finger moves fast or if the texture period is small.
|
|
|
|
|
%
|
|
|
|
|
%The best strategy instead is to modulate the frequency of the signal as a ratio of the filtered finger velocity ${}^t\hat{\dot{\mathbf{X}}}_f$ and the texture period $\lambda$ \cite{friesen2024perceived}.
|
|
|
|
|
%
|
|
|
|
|
When a new finger velocity ${}^t\hat{\dot{X}}_{f,j}$ is estimated at time $t_j$, the phase $\phi_j$ of the signal $s$ needs also to be adjusted to ensure a continuity in the signal.
|
|
|
|
|
When a new finger velocity $x_f\,(t_j)$ is estimated at time $t_j$, the phase $\phi$ of the signal $r$ needs also to be adjusted to ensure a continuity in the signal.
|
|
|
|
|
In other words, the sampling of the audio signal runs at \qty{48}{\kilo\hertz}, and its frequency and phase is updated at a far lower rate of \qty{60}{\hertz} when a new finger velocity is estimated.
|
|
|
|
|
A sample $s_k$ of the audio signal at sampling time $t_k$, with $t_k >= t_j$, is thus given by:
|
|
|
|
|
A sample $r(x_f, t_j, t_k)$ of the audio signal at sampling time $t_k$, with $t_k >= t_j$, is thus given by:
|
|
|
|
|
\begin{subequations}
|
|
|
|
|
\label{eq:signal}
|
|
|
|
|
\begin{align}
|
|
|
|
|
s_k(x_{f,j}, t_k) & = A\, \text{sgn} ( \sin (2 \pi \frac{|\dot{X}_{f,j}|}{\lambda} t_k + \phi_j) ) & \label{eq:signal_speed} \\
|
|
|
|
|
\phi_j & = \phi_{j-1} + 2 \pi \frac{x_{f,j} - x_{f,{j-1}}}{\lambda} t_k & \label{eq:signal_phase}
|
|
|
|
|
r(x_f, t_j, t_k) & = A\, \text{sgn} ( \sin (2 \pi \frac{x_f\,(t_j)}{\lambda} t_k + \phi(t_j) ) ) & \label{eq:signal_speed} \\
|
|
|
|
|
\phi(t_j) & = \phi(t_{j-1}) + 2 \pi \frac{x_f\,(t_j) - x_f\,(t_j - 1)}{\lambda} t_k & \label{eq:signal_phase}
|
|
|
|
|
\end{align}
|
|
|
|
|
\end{subequations}
|
|
|
|
|
|
|
|
|
|
This rendering preserves the sensation of a constant spatial frequency of the virtual texture while the finger moves at various speeds, which is crucial for the perception of roughness \cite{klatzky2003feeling,unger2011roughness}.
|
|
|
|
|
%
|
|
|
|
|
%Note that the finger position and velocity are transformed from the camera frame $\mathcal{F}_c$ to the texture frame $\mathcal{F}_t$, with the $x$ axis aligned with the texture direction.
|
|
|
|
|
%Note that the finger position and velocity are transformed from the camera frame $\poseFrame{c}$ to the texture frame $\poseFrame{t}$, with the $x$ axis aligned with the texture direction.
|
|
|
|
|
%
|
|
|
|
|
%However, when a new finger position is estimated at time $t_j$, the phase $\phi_j$ needs to be adjusted as well with the frequency to ensure a continuity in the signal as described in \eqref{signal_phase}.
|
|
|
|
|
%
|
|
|
|
|
@@ -119,10 +123,10 @@ This complex assembly inevitably introduces latency, which must be measured.
|
|
|
|
|
The rendering system provides a user with two interaction loops between the movements of their hand and the visual (loop 1) and haptic (loop 2) feedbacks.
|
|
|
|
|
Measures are shown as (mean $\pm$ standard deviation), when it is known.
|
|
|
|
|
The end-to-end latency from finger movement to feedback is measured at \qty{36 \pm 4}{\ms} in the haptic loop and \qty{43 \pm 9}{\ms} in the visual loop.
|
|
|
|
|
Both are the result of latency in image capture \qty{16 \pm 1}{\ms}, markers tracking \qty{2 \pm 1}{\ms} and network communication \qty{4 \pm 1}{\ms}.
|
|
|
|
|
Both are the result of latency in image capture \qty{16 \pm 1}{\ms}, markers pose estimation \qty{2 \pm 1}{\ms} and network communication \qty{4 \pm 1}{\ms}.
|
|
|
|
|
The haptic loop also includes the voice-coil latency \qty{15}{\ms} (as specified by the manufacturer\footnotemark[1]), whereas the visual loop includes the latency in \ThreeD rendering \qty{16 \pm 5}{\ms} (60 frames per second) and display \qty{5}{\ms}.
|
|
|
|
|
The total haptic latency is below the \qty{60}{\ms} detection threshold in vibrotactile feedback \cite{okamoto2009detectability}.
|
|
|
|
|
The total visual latency can be considered slightly high, yet it is typical for an \AR rendering involving vision-based tracking \cite{knorlein2009influence}.
|
|
|
|
|
The total visual latency can be considered slightly high, yet it is typical for an \AR rendering involving vision-based pose estimation \cite{knorlein2009influence}.
|
|
|
|
|
|
|
|
|
|
The two filters also introduce a constant lag between the finger movement and the estimated position and velocity, measured at \qty{160 \pm 30}{\ms}.
|
|
|
|
|
With respect to the real hand position, it causes a distance error in the displayed virtual hand position, and thus a delay in the triggering of the vibrotactile signal.
|
|
|
|
|
|