Correct vhar system equations

2024-11-12 18:40:55 +01:00
parent 3328d0a3c1
commit 592c0e90df
10 changed files with 39 additions and 35 deletions
--- a/3-perception/vhar-system/1-introduction.tex
+++ b/3-perception/vhar-system/1-introduction.tex
@@ -11,12 +11,12 @@ However, this method has not yet been integrated in an \AR context, where the us
 In this chapter, we propose a \textbf{system for rendering visual and haptic virtual textures that augment real surfaces}.
 It is implemented with an immersive \OST-\AR headset Microsoft HoloLens~2 and a wearable vibrotactile (voice-coil) device worn on the outside of finger (not covering the fingertip, \secref[related_work]{vhar_haptics}).
 The visuo-haptic augmentations can be \textbf{viewed from any angle} and \textbf{explored freely with the bare finger}, as if they were real textures.
-To ensure both real-time and reliable renderings, the hand and the real surfaces are tracked using a webcam and marker-based tracking.
+To ensure both real-time and reliable renderings, the hand and the real surfaces are tracked using a webcam and marker-based pose estimation.
 The haptic textures are rendered as a vibrotactile signal representing a patterned grating texture that is synchronized with the finger movement on the augmented surface.

 \noindentskip The contributions of this chapter are:
 \begin{itemize}
-  \item The rendering of virtual vibrotactile roughness textures representing a patterned grating texture in real time from free finger movements and using vision-based tracking.
+  \item The rendering of virtual vibrotactile roughness textures representing a patterned grating texture in real time from free finger movements and using vision-based finger pose estimation.
  \item A system to provide a coherent visuo-haptic texture augmentations of the \RE in a direct touch context using an immersive \AR headset and wearable haptics.
 \end{itemize}

@@ -26,7 +26,7 @@ The haptic textures are rendered as a vibrotactile signal representing a pattern

 \begin{subfigs}{setup}{Visuo-haptic texture rendering system setup.}[][
  \item HapCoil-One voice-coil actuator with a fiducial marker on top attached to the middle-phalanx of the user's index finger.
-  \item Our implementation of the system using a Microsoft HoloLens~2, a webcam for tracking the hand and the real surfaces, and an external computer for processing the tracking data and rendering the haptic textures.
+  \item Our implementation of the system using a Microsoft HoloLens~2, a webcam for estimating the poses the hand and the real surfaces, and an external computer for processing the tracking data and rendering the haptic textures.
  ]
  \subfigsheight{60mm}
  \subfig{device}
--- a/3-perception/vhar-system/2-method.tex
+++ b/3-perception/vhar-system/2-method.tex
@@ -19,11 +19,11 @@ The system consists of three main components: the pose estimation of the tracked

 \figwide{diagram}{Diagram of the visuo-haptic texture rendering system. }[
  Fiducial markers attached to the voice-coil actuator and to augmented surfaces to track are captured by a camera.
-  The positions and rotations (the poses) ${}^c\mathbf{T}_i$, $i=1..n$ of the $n$ defined markers in the camera frame $\mathcal{F}_c$ are estimated, then filtered with an adaptive low-pass filter.
-  %These poses are transformed to the \AR/\VR headset frame $\mathcal{F}_h$ and applied to the virtual model replicas to display them superimposed and aligned with the \RE.
+  The positions and rotations (the poses) ${}^c\mathbf{T}_i$, $i=1..n$ of the $n$ defined markers in the camera frame $\poseFrame{c}$ are estimated, then filtered with an adaptive low-pass filter.
+  %These poses are transformed to the \AR/\VR headset frame $\poseFrame{h}$ and applied to the virtual model replicas to display them superimposed and aligned with the \RE.
  These poses are used to move and display the virtual model replicas aligned with the \RE.
  A collision detection algorithm detects a contact of the virtual hand with the virtual textures.
-  If so, the velocity of the finger marker ${}^c\dot{\mathbf{X}}_f$ is estimated using discrete derivative of position and adaptive low-pass filtering, then transformed onto the texture frame $\mathcal{F}_t$.
+  If so, the velocity of the finger marker ${}^c\dot{\mathbf{X}}_f$ is estimated using discrete derivative of position and adaptive low-pass filtering, then transformed onto the texture frame $\poseFrame{t}$.
  The vibrotactile signal $s_k$ is generated by modulating the (scalar) finger velocity ${}^t\hat{\dot{X}}_f$ in the texture direction with the texture period $\lambda$ (\eqref{signal}).
  The signal is sampled at 48~kHz and sent to the voice-coil actuator via an audio amplifier.
  All computation steps except signal sampling are performed at 60~Hz and in separate threads to parallelize them.
@@ -37,14 +37,18 @@ The system consists of three main components: the pose estimation of the tracked

 A \qty{2}{\cm} AprilTag fiducial marker \cite{wang2016apriltag} is glued to the top of the actuator (\figref{device}) to track the finger pose with a camera (StreamCam, Logitech) which is placed above the experimental setup and capturing \qtyproduct{1280 x 720}{px} images at \qty{60}{\hertz} (\figref{apparatus}).
 Other markers are placed on the real surfaces to augment (\figref{setup}) to estimate the relative position of the finger with respect to the surfaces.
-Contrary to similar work, using vision-based tracking allows both to free the hand movements and to augment any real surface.
-A camera external to the \AR headset with a marker-based technique is employed to provide accurate and robust tracking with a constant view of the markers \cite{marchand2016pose}.
-We denote ${}^c\mathbf{T}_i$, $i=1..n$ the homogenous transformation matrix that defines the position and rotation of the $i$-th marker out of the $n$ defined markers in the camera frame $\mathcal{F}_c$, \eg the finger pose ${}^c\mathbf{T}_f$ and the texture pose ${}^c\mathbf{T}_t$.
+Contrary to similar work, using vision-based pose estimation allows both to free the hand movements and to augment any real surface.
+
+A camera external to the \AR headset with a marker-based technique is employed to provide accurate and robust pose estimation with a constant view of the markers \cite{marchand2016pose}.
+We denote $\pose{c}{T}{i}$, $i=1..n$ the homogenous transformation matrix that defines the position and rotation of the $i$-th marker out of the $n$ defined markers in the camera frame $\poseFrame{c}$, \eg the finger pose $\pose{c}{T}{f}$ and the augmented surface pose $\pose{c}{T}{s}$ in the camera frame.

 To reduce the noise in the pose estimation while maintaining good responsiveness, the 1€ filter \cite{casiez2012filter} is applied; a low-pass filter with an adaptive cut-off frequency, specifically designed for human motion tracking.
-The filtered pose is denoted as ${}^c\hat{\mathbf{T}}_i$.
+The filtered pose is denoted as $\pose{c}{\hat{T}}{i}$.
 The optimal filter parameters were determined using the method of \textcite{casiez2012filter}, with a minimum cut-off frequency of \qty{10}{\hertz} and a slope of \num{0.01}.
-The velocity (without angular velocity) of the marker, denoted as ${}^c\dot{\mathbf{X}}_i$, is estimated using the discrete derivative of the position and another 1€ filter with the same parameters.
+
+The velocity (without angular velocity) of the finger marker, denoted as $\pose{c}{\dot{T}}{f}$, is estimated using the discrete derivative of the position.
+It is then filtered with another 1€ filter with the same parameters, and denoted as $\pose{c}{\hat{\dot{T}}}{f}$.
+Finally, this filtered finger velocity is transformed into the augmented surface frame $\poseFrame{s}$ to be used in the vibrotactile signal generation, such as $\pose{s}{\hat{\dot{T}}}{f} = \pose{c}{T}{s} \, \pose{c}{\hat{\dot{T}}}{f}$.

 \subsection{Virtual Environment Alignment}
 \label{virtual_real_alignment}
@@ -68,30 +72,30 @@ The voice-coil actuator is encased in a \ThreeD printed plastic shell and firmly
 The actuator is driven by a class D audio amplifier (XY-502 / TPA3116D2, Texas Instrument). %, which has proven to be an effective type of amplifier for driving moving-coil \cite{mcmahan2014dynamic}.
 The amplifier is connected to the audio output of a computer that generates the signal using the WASAPI driver in exclusive mode and the NAudio library\footnoteurl{https://github.com/naudio/NAudio}.

-The represented haptic texture is a series of parallels virtual grooves and ridges, similar to real grating textures manufactured for psychophysical roughness perception studies \secref[related_work]{roughness}. %\cite{friesen2024perceived,klatzky2003feeling,unger2011roughness}.
-It is generated as a square wave audio signal $s_k$, sampled at \qty{48}{\kilo\hertz}, with a period $\lambda$ and an amplitude $A$, similar to \eqref[related_work]{grating_rendering}.
-Its frequency is a ratio of the absolute finger filtered (scalar) velocity ${}^t\hat{\dot{|X|}}_f$, transformed into the texture frame $\mathcal{F}_t$, and the texture period $\lambda$ \cite{friesen2024perceived}.
-As the finger is moving horizontally on the texture, only the $x$ component of the velocity is used.
+The represented haptic texture is a 1D series of parallels virtual grooves and ridges, similar to the real linear grating textures manufactured for psychophysical roughness perception studies \secref[related_work]{roughness}. %\cite{friesen2024perceived,klatzky2003feeling,unger2011roughness}.
+It is generated as a square wave audio signal $r$, sampled at \qty{48}{\kilo\hertz}, with a texture period $\lambda$ and an amplitude $A$, similar to \eqref[related_work]{grating_rendering}.
+Its frequency is a ratio of the absolute finger filtered (scalar) velocity $x_f = \pose{s}{|\hat{\dot{T}}|}{f}$, and the texture period $\lambda$ \cite{friesen2024perceived}.
+As the finger is moving horizontally on the texture, only the $X$ component of the velocity is used.
 This velocity modulation strategy is necessary as the finger position is estimated at a far lower rate (\qty{60}{\hertz}) than the audio signal (unlike high-fidelity force-feedback devices \cite{unger2011roughness}).

-%As the finger position is estimated at a far lower rate (\qty{60}{\hertz}), the filtered finger (scalar) position ${}^t\hat{X}_f$ in the texture frame $\mathcal{F}_t$ cannot be directly used. % to render the signal if the finger moves fast or if the texture period is small.
+%As the finger position is estimated at a far lower rate (\qty{60}{\hertz}), the filtered finger (scalar) position ${}^t\hat{X}_f$ in the texture frame $\poseFrame{t}$ cannot be directly used. % to render the signal if the finger moves fast or if the texture period is small.
 %
 %The best strategy instead is to modulate the frequency of the signal as a ratio of the filtered finger velocity ${}^t\hat{\dot{\mathbf{X}}}_f$ and the texture period $\lambda$ \cite{friesen2024perceived}.
 %
-When a new finger velocity ${}^t\hat{\dot{X}}_{f,j}$ is estimated at time $t_j$, the phase $\phi_j$ of the signal $s$ needs also to be adjusted to ensure a continuity in the signal.
+When a new finger velocity $x_f\,(t_j)$ is estimated at time $t_j$, the phase $\phi$ of the signal $r$ needs also to be adjusted to ensure a continuity in the signal.
 In other words, the sampling of the audio signal runs at \qty{48}{\kilo\hertz}, and its frequency and phase is updated at a far lower rate of \qty{60}{\hertz} when a new finger velocity is estimated.
-A sample $s_k$ of the audio signal at sampling time $t_k$, with $t_k >= t_j$, is thus given by:
+A sample $r(x_f, t_j, t_k)$ of the audio signal at sampling time $t_k$, with $t_k >= t_j$, is thus given by:
 \begin{subequations}
  \label{eq:signal}
  \begin{align}
-    s_k(x_{f,j}, t_k) & = A\, \text{sgn} ( \sin (2 \pi \frac{|\dot{X}_{f,j}|}{\lambda} t_k + \phi_j) ) & \label{eq:signal_speed} \\
-    \phi_j            & = \phi_{j-1} + 2 \pi \frac{x_{f,j} - x_{f,{j-1}}}{\lambda} t_k                 & \label{eq:signal_phase}
+    r(x_f, t_j, t_k) & = A\, \text{sgn} ( \sin (2 \pi \frac{x_f\,(t_j)}{\lambda} t_k + \phi(t_j) ) ) & \label{eq:signal_speed} \\
+    \phi(t_j)            & = \phi(t_{j-1}) + 2 \pi \frac{x_f\,(t_j) - x_f\,(t_j - 1)}{\lambda} t_k                 & \label{eq:signal_phase}
  \end{align}
 \end{subequations}

 This rendering preserves the sensation of a constant spatial frequency of the virtual texture while the finger moves at various speeds, which is crucial for the perception of roughness \cite{klatzky2003feeling,unger2011roughness}.
 %
-%Note that the finger position and velocity are transformed from the camera frame $\mathcal{F}_c$ to the texture frame $\mathcal{F}_t$, with the $x$ axis aligned with the texture direction.
+%Note that the finger position and velocity are transformed from the camera frame $\poseFrame{c}$ to the texture frame $\poseFrame{t}$, with the $x$ axis aligned with the texture direction.
 %
 %However, when a new finger position is estimated at time $t_j$, the phase $\phi_j$ needs to be adjusted as well with the frequency to ensure a continuity in the signal as described in \eqref{signal_phase}.
 %
@@ -119,10 +123,10 @@ This complex assembly inevitably introduces latency, which must be measured.
 The rendering system provides a user with two interaction loops between the movements of their hand and the visual (loop 1) and haptic (loop 2) feedbacks.
 Measures are shown as (mean $\pm$ standard deviation), when it is known.
 The end-to-end latency from finger movement to feedback is measured at \qty{36 \pm 4}{\ms} in the haptic loop and \qty{43 \pm 9}{\ms} in the visual loop.
-Both are the result of latency in image capture \qty{16 \pm 1}{\ms}, markers tracking \qty{2 \pm 1}{\ms} and network communication \qty{4 \pm 1}{\ms}.
+Both are the result of latency in image capture \qty{16 \pm 1}{\ms}, markers pose estimation \qty{2 \pm 1}{\ms} and network communication \qty{4 \pm 1}{\ms}.
 The haptic loop also includes the voice-coil latency \qty{15}{\ms} (as specified by the manufacturer\footnotemark[1]), whereas the visual loop includes the latency in \ThreeD rendering \qty{16 \pm 5}{\ms} (60 frames per second) and display \qty{5}{\ms}.
 The total haptic latency is below the \qty{60}{\ms} detection threshold in vibrotactile feedback \cite{okamoto2009detectability}.
-The total visual latency can be considered slightly high, yet it is typical for an \AR rendering involving vision-based tracking \cite{knorlein2009influence}.
+The total visual latency can be considered slightly high, yet it is typical for an \AR rendering involving vision-based pose estimation \cite{knorlein2009influence}.

 The two filters also introduce a constant lag between the finger movement and the estimated position and velocity, measured at \qty{160 \pm 30}{\ms}.
 With respect to the real hand position, it causes a distance error in the displayed virtual hand position, and thus a delay in the triggering of the vibrotactile signal.
--- a/3-perception/vhar-system/6-conclusion.tex
+++ b/3-perception/vhar-system/6-conclusion.tex
@@ -5,7 +5,7 @@

 In this chapter, we designed and implemented a system for rendering virtual visuo-haptic textures that augment a real surface.
 Directly touched with the fingertip, the perceived roughness of the surface can be increased using a wearable vibrotactile voice-coil device mounted on the middle phalanx of the finger.
-We adapted the 1D sinusoidal grating rendering method, common in the literature but not yet integrated in a direct touch context, for use with vision-based tracking of the finger and paired it with an immersive \AR headset.
+We adapted the 1D sinusoidal grating rendering method, common in the literature but not yet integrated in a direct touch context, for use with vision-based pose estimation of the finger and paired it with an immersive \AR headset.

 Our wearable visuo-haptic augmentation system enable any real surface to be augmented with a minimal setup.
 It also allows a free exploration of the textures, as if they were real (\secref[related_work]{ar_presence}), by letting the user view them from different poses and touch them with the bare finger without constraints on hand movements.