Time Analysis of Pulse-Based Face Anti-Spoofing in...
Transcript of Time Analysis of Pulse-Based Face Anti-Spoofing in...
Time Analysis of Pulse-based Face Anti-Spoofing in Visible and NIR
Javier Hernandez-Ortega, Julian Fierrez, Aythami Morales, Pedro Tome
Biometrics and Data Pattern Analytics - BiDA Lab
Universidad Autonoma de Madrid, Madrid, Spain
[email protected], [email protected], [email protected], [email protected]
Abstract
In this paper we study Presentation Attack Detection
(PAD) in face recognition systems against realistic arti-
facts such as 3D masks or good quality of photo attacks.
In recent works, pulse detection based on remote photo-
plethysmography (rPPG) has shown to be a effective coun-
termeasure in concrete setups, but still there is a need for
a deeper understanding of when and how this kind of PAD
works in various practical conditions. Related works ana-
lyze full video sequences (usually over 60 seconds) to dis-
tinguish between attacks and legitimate accesses. However,
existing approaches may not be as effective as it has been
claimed in the literature in time variable scenarios. In this
paper we evaluate the performance of an existent state-of-
the-art PAD scheme based on rPPG when analyzing short-
time video sequences extracted from a longer video.
Results are reported using the 3D Mask Attack Database
(3DMAD), and a self-collected dataset called Heart Rate
Database (HR), including different video durations, spec-
trum bands, resolutions and frame rates.
Several conclusions can be drawn from this work: a)
PAD performance based on rPPG varies significantly with
the length of the analyzed video, b) rPPG information ex-
tracted from short-time sequences (over 5 seconds) can be
discriminant enough for performing the PAD task, c) in gen-
eral, videos using the NIR band perform better than those
using the RGB band, and d) the temporal resolution is more
valuable for rPPG signal extraction than the spatial resolu-
tion.
1. Introduction
Face recognition is one of the most extended biometric
traits together with fingerprint and iris. Biometric systems
based on human faces have interesting properties that dif-
ferentiate them from those based on fingerprint and iris, e.g.
the possibility to acquire the facial information at a distance,
and its non-intrusiveness [28]. At present, face is one of the
biometric traits with the highest economical and social im-
pact, being included in high impact products like Face ID
technology from Apple1. Due to this high level of deploy-
ment, attacks against face recognition systems have become
a real threat. Regarding that, it is worth noting that the fac-
tors that make face an interesting trait for person recogni-
tion, i.e. images can be taken at distance in a non-intrusive
way, also make it specially vulnerable to attackers who may
easily get and use facial biometric information in an illicit
manner.
In presentation attacks, an assailant presents to the sen-
sor an artifact for trying to impersonate a genuine user [15].
There are different spoofs and artifacts that a face recogni-
tion system may confront, and the same anti-spoofing tech-
nique may not be useful against all of them, each situation
normally needing a specific countermeasure. Techniques
for countermeasuring those attacks are also known as Pre-
sentation Attack Detection (PAD) methods [19].
Additionally to the ease of getting information of the
real users, face recognition systems are known to respond
weakly to presentation attacks for a long time [15, 17], and
are easily spoofed, for example using one of these three cat-
egories of attacks [11]: i) using a photograph of the user to
impersonate [24]; ii) using a video of the user to imperson-
ate (aka replay attack) [7]; and iii) building and using a 3D
model of the enrolled persons face, for example an hyperre-
alistic mask [9].
There are a high number of PAD measures in the liter-
ature for trying to deal with those three (and others) types
of presentation attacks [19, 11, 23]. For example, texture-
based techniques perform the analysis of the facial texture
to discover unnatural characteristics that may be related to
presentation attacks [7, 12, 1]. This type of approaches may
be useful to detect photo-attacks, video-attacks, and also
low quality mask-attacks. The major drawback of texture-
based presentation attack detection is that high resolution
input images are required in order to extract fine details
from the faces. These countermeasures will not work prop-
erly with bad illumination conditions that make the captured
images to have bad quality in general.
1https://support.apple.com/en-my/HT208108
1657
Another type of PAD techniques use the depth informa-
tion for detecting the spoofs. In photo and replay attacks the
artifacts are 2D surfaces that can be detected using depth
information. This depth information can be captured by
specialized sensors, like Microsoft Kinect2 or Intel Real-
Sense3. Recent works have shown that it is even possible
to extract depth information from a single RGB image [16].
Nevertheless, this type of countermeasures becomes ineffi-
cient when dealing with 3D presentation attack artifacts like
realistic masks.
Specifically when dealing with 3D mask attacks, in
which the attacker manages somehow to build a highly rea-
listic 3D reconstruction of a genuine face and presents it to
the sensor-camera [9, 13], it becomes difficult to find effec-
tive countermeasures due to the high realism of the spoofs.
As has been said before, the use of depth information be-
comes inefficient as the artifacts present the same volume
of a real face, and the texture information is also useless
when dealing with hyper-realistic masks that imitate the real
texture of the human skin. Social media and video stream-
ing web sites (e.g. Facebook and YouTube) contain a huge
amount of facial recordings of people, making easy to ac-
cess to the information required to manufacture 3D masks
or another face spoofs. Additionally, over the Internet there
exist a variety of online services (e.g. “ThatsMyFace”4) for
ordering a highly realistic 3D mask at an affordable cost
(200-300 USD).
Due to the easiness to perform this type of sophisticated
attack, efficient countermeasures against 3D mask attacks
are nowadays highly relevant. In this scenario, detecting
pulse from face videos using remote photoplethysmography
(rPPG) has proved to be an effective countermeasure against
3D mask attacks [18]. Even though rPPG-based PAD is
quite promising, all the current approaches still have their
limitations. They usually consist in taking a complete video
sequence for extracting its rPPG signal. In the databases
employed in published studies, the length of the video se-
quences uses to be long enough to give the rPPG system
sufficient data for making a robust estimation without ana-
lyzing the time interval size. Nevertheless, not in all situa-
tions will be possible to have a long video sequence (e.g. 1
minute) for making the analysis. In order to perform a low
latency/short-time study of the rPPG signal, where the se-
lected length of the videos is as short as possible, it becomes
necessary to study the performance when varying the video
length.
On the other hand, even when working with favorable
conditions: long enough videos, good illumination, high re-
solution, perfect face detection and tracking, etc, the pulse
2https://developer.microsoft.com/en-us/windows/kinect3https://www.intel.com/content/www/us/en/architecture-and-
technology/realsense-overview.html4http://thatsmyface.com/
detection algorithms must deal with variable scenarios, e.g.
an attacker that puts on a mask in the middle of the video,
in which case existing approaches may not be able to give
a consistent estimation of pulse and/or presentation attack
probability. In such cases, a short-time approach is more
adequate. Additional problems arise when these algorithms
are applied to real scenarios where they may also have to
deal with other factors like bad illumination conditions or
failures in the face detection module, making their perfor-
mance to drop significantly. With a short-time analysis of
the rPPG signal, the frames without a properly detected
face could be discarded without affecting the global perfor-
mance.
In this paper we: i) present a new dataset of photo at-
tacks with long video sequences, in visible (RGB) and Near
InfraRed (NIR); ii) study the performance of video-based
pulse detection depending on the length of the video se-
quences, both in an existing benchmark (3DMAD) and our
new dataset; and iii) simulate and test pulse-based PAD in a
scenario in which the attacking conditions vary over time.
The rest of this paper is organized as follows: Section 2
introduces Remote Photoplethysmography and summarizes
works that use it for pulse detection and/or face PAD. Sec-
tion 3 describes the proposed system. Section 4 describes
the employed databases and the experimental protocol. Sec-
tion 5 shows the results obtained. Finally, concluding re-
marks are drawn in Section 6.
2. Remote Photoplethysmography
Plethysmography refers to techniques for measuring the
changes in the volume of blood through human vessels.
This information can be used to estimate parameters such as
heart rate, arterial pressure, blood glucose level, or oxygen
saturation levels. The variant called Photoplethysmography
(PPG) includes low-cost and noninvasive techniques asso-
ciated with imagery and the optical properties of the human
body [2]. For example, human heart rate can be measured
detecting the periodic changes between oxygenated and de-
oxygenated blood through the veins.
Recently, related studies have proven that it is possible
to measure the changes in the amount of oxygenated blood
through facial video sequences [22]. These techniques are
called Remote Photoplethysmography (rPPG) and their op-
erating principle consists in looking for slight changes in
the skin color at video recordings using signal processing
methods. When applying this technique to a 3D mask at-
tack or a photo print attack, the estimated pulse signal is
highly different from a genuine pulse signal [18].
Table 1 summarizes related works in rPPG, from where
we can see that most research in this area use self-collected
datasets not publicly available. One of the few public
datasets available for 3D mask PAD is 3DMAD, which con-
tains RGB videos of genuine users and of 3D mask attacks.
658
Method Type of Images Database used Video Length Parameter Estimated Performance
Garbey et al. 2007 [14] Thermal self-collected 120 secs Heart Rate Accuracy = 99%
Poh et al. 2011 [22] RGB self-collected 60 secs. Heart Rate RMSE = 5.63%
Tasli et al. 2014 [26] RGB self-collected 90 secs. Heart Rate MAE = 4.2%
Chen et al. 2014 [6] Hyperspectral self-collected 30-60 secs Stress Level Qualitative
McDuff et al. 2014 [20] Multiband (RGBCO) self-collected 120 secs Heart Rate Correlation = 1.0
Chen et al. 2016 [5] RGB + NIR self-collected 90 secs Heart Rate RMSE = 1.65%
Li et al. 2016 [18] RGB 3DMAD and self-collected 10 secs Face PAD EER = 4.71%
Present Work RGB & NIR 3DMAD and self-collected 10 & 60 secs Face PAD EER = 25% & 0%
Table 1: Related works that use different types of images to implement rPPG for pulse extraction and related tasks like face
Presentation Attack Detection or stress detection. For our system, performances obtained with RGB and NIR videos are
shown separately.
We decided to use 3DMAD to compare our results with
[18]. We also decided to acquire a supplementary dataset
to have larger RGB videos (1 minute) compared to the ones
from 3DMAD (only 10 seconds), what will help us with
our target of performing a time analysis of the PAD per-
formance. Our dataset also contains information from the
NIR spectrum band in order to allow us to compare perfor-
mances between both bands.
3. Proposed System
In this section we describe the general scheme of the pro-
posed framework. The purpose of the system is the follow-
ing: given a facial video sequence of a person, our system
decides if the video comes from a real face or if, on the
contrary, it is really a presentation attack.
As shown in Fig. 1, our system consists in three main
stages. The first stage processes a video, extracting tempo-
ral windows and measuring raw rPPG signals. The second
stage, extracts discriminant pulse-related features from the
rPPG signals. The third and last stage performs the match-
ing task comparing models of presentation attacks and real
faces with the extracted features. Our approach is largely
based in [18]. We chose it as reference because its excellent
performance for PAD shown in 3DMAD. There are slight
differences between the system from [18] and ours, and
between the process followed with the 3DMAD database
and with the HR database (our own self-collected database).
The differences are described below.
3.1. rPPG Signal Generation
The input of this stage is a temporal window extracted
from the original video. The window length T is config-
urable in order to perform a time dependent analysis. Each
video can be processed as a whole or extracting smaller
video sequences. The rPPG signal extraction stage is di-
vided into three steps, namely: face detection, ROI selec-
tion and tracking, and rPPG signal generation.
3.1.1 Face detection
For each RGB windowed video (or NIR if available) we per-
form face detection at the first frame using the Matlab im-
plementation of the Viola-Jones algorithm [29]. This algo-
rithm is known to perform reasonably well and in real time
when dealing with frontal faces, as in our case. There are
three possible outputs from the face detector: 1) One face is
detected and the detector returns a bounding box that con-
tains the face location. 2) Multiple faces are detected and
the bounding boxes of all them are returned. Sometimes,
there are false positive face detections and bounding boxes
without a real face inside them are returned. In these cases,
we decided to keep only the larger bounding box because in
the majority of cases it contains the real face. 3) No face is
detected so there is no output. This case is called Failed To
Acquire (FTA). The FTA rate of this specific work is 0% for
both databases as they contain good quality faces.
3.1.2 Face region selection and tracking
After the recognition stage, we selected a facial Region of
Interest. An example of this region (nose and cheeks) can
be seen in Fig. 1. Our ROI is different to [18], since they se-
lected a bigger ROI that includes also the mouth and chin.
We decided to use a smaller region because it is less af-
fected by objects like hats, glasses, beards or mustaches.
The next step consists of detecting corners inside that re-
gion for tracking them over time using the Kanade–Lucas–
Tomasi algorithm [27], also implemented in Matlab.
3.1.3 rPPG signal extraction
For the ROI of each frame from the video segment, we cal-
culate its raw rPPG value as the average intensity of the pix-
els inside the region. This sum is made separately for the
three available channels of the RGB images (Red, Green
and Blue) giving us three different rPPG values for each
RGB video segment (one value for NIR recordings). The
rPPG signal generation is executed at every frame of the
659
Video Sequence
Face
DetectionROI Selection
and Tracking
rPPG Signal
Extraction
rPPG
Postprocessing
Window Size
Selection
rPPG SIGNAL GENERATION
Liveness
ScoreDecision
Real Face
Presentation
Attack
Frequency
AnalysisSVM
Face/Mask
Templates
FEATURE EXTRACTION MATCHER
PRESENTATION ATTACK DETECTION SYSTEM
[Pr,Rr,Pg,Rg,Pb,Rb]
f
T
t
t
PSD
Figure 1: Architecture of the proposed pulse-based face presentation attack detection. Given a facial video, the face is
detected and rPPG related features are extracted from the ROI in order to obtain an individual score of each video segment.
Then, the video segment is classified as an attack or a legitimate access based on its score considering a database of real faces
and mask attacks.
video segment, being its final output a temporal signal with
the raw rPPG values for the windowed recording.
3.2. Feature Extraction
3.2.1 rPPG signal postprocessing
The raw rPPG signal contains not only the light variations
due to the human pulse but also the mean environmental
illumination, changes in that environmental levels and noise
from other sources. Due to all those factors, a postproces-
sing stage is necessary.
The postprocessing method consists of three filters:
• Detrending filter [25]: this temporal filter is em-
ployed for reducing the stationary part of the rPPG
signal, i.e. eliminating the contribution from environ-
mental light and reducing the slow changes in the rPPG
level that are not part of the expected pulse signal.
• Moving-average filter: this filter is designed to elim-
inate the random noise on the rPPG signal. That noise
may be caused by imperfections on the sensor and in-
accuracies in the capturing process. This filter consists
in a moving average of the rPPG values (size 3).
• Band-pass filter: a regular human heart rate uses to
be between 40-240 beats per minute (bpm), which cor-
responds to signals with frequencies between 0.6 and
4 Hz approximately. All the rPPG frequency compo-
nents outside that range are unlikely to correspond to
the real pulse signal so they are discarded.
3.2.2 Frequency Domain Analysis
The input to this stage is a clean rPPG signal that should be
a robust estimation of the changes in skin tone due to the
blood level evolution. At this point, if we want to extract
the corresponding heart rate value for a video sequence, the
highest frequency peak inside the normal frequency range
should be selected. In our case, we did not want to extract
the exact value of the heart rate since our task is to distin-
guish between real faces and mask attacks. With that target
in mind, we have selected discriminant features from the
final rPPG signal’s spectrum.
The features we have decided to use are also the ones
from [18]. We transformed the signal from the spatial do-
main to the frequency domain using FFT and estimated its
power spectral density (PSD) distribution. An example of a
PSD can be seen in the feature extraction stage in Fig. 1.
From the PSD we selected the maximum power response
as the first feature P , and the ratio of P to the total power
in the 0.6 - 4 Hz frequency range as a second feature R.
This pair of features [P,R] have been extracted for the three
color channels in the case of the RGB videos, and for the
unique channel in the NIR case, resulting in feature vectors
of size 6 and 2, respectively.
3.3. Match Score Generation
The last block of the presentation attack detection sys-
tem is the comparator. Like in [18] we use Support Vec-
tor Machines (SVM) as our classifier. A Support Vector
Machine is a supervised algorithm based in a representa-
tion of the examples as points in a multidimensional space.
The training process consists in choosing a hyperplane that
maximizes the distance from it to the nearest data point of
each class. With this type of classifiers, it is interesting to
have the input data represented in a high-dimensional space
of uncorrelated features, what gives more freedom to find a
hyperplane with a minimum distance large enough to obtain
high classification rates. We have used the 6-dimensional
660
(or 2-dimensional for NIR videos) feature vectors for train-
ing and testing the SVMs.
4. Databases and Experimental Protocol
4.1. Databases
In our experiments we have used two different databases:
the first dataset, the 3D Mask Attack Database (3DMAD)
has been used in related works of 3D mask PAD. We de-
cided to use 3DMAD to enable direct comparison with
related studies. The second dataset is a self-collected
database, called from now on HR (Heart Rate database).
The purpose of this dataset is to complement the results ob-
tained with 3DMAD with images of higher resolution, extra
spectrum bands and longer duration.
• The 3D Mask Attack Database (3DMAD) [8] is a
dataset collected and distributed by the Idiap Research
Institute. It contains frontal-view controlled record-
ings of 17 different users acquired using Microsoft
Kinect. For each user there are 3 different sessions,
with 5 videos for session. Two of the sessions include
legitimate user videos with a time delay of 2 weeks
between recordings, while the remaining session con-
sists of a 3D mask attack using the mask of each cor-
responding user.
The masks used for the attacks were obtained from
ThatsMyFace.com5, an online service that allows their
clients to create wearable hard-plastic realistic masks
with holes at the eyes and nostrils. To produce each
mask only 3 images of each user are needed (1 frontal
and 2 lateral). The pictures that were employed to cre-
ate each mask presented in the database are also in-
cluded in the 3DMAD corpus.
The duration of each recording is 10 seconds, cap-
tured at 30 frames per second, resulting in 300 frames
per video. The Kinect sensor makes possible to cap-
ture RGB and Depth images at the same time, so the
database contains both information for each recorded
frame. It also contains annotated eye positions for each
frame of the RGB videos.
Summarizing, 3DMAD is formed by 17 users × 3 ses-
sions/user × 5 videos/session × 300 frames/video =
76,500 RGB and 76,500 Depth frames, all of them
with a resolution of 640×480 pixels. One-third of
these frames (25,500) correspond to mask attacks and
two-thirds (51,000) to legitimate access attempts.
In our experiments we have only used the RGB infor-
mation of the database in order to compare our results
5http://www.thatsmyface.com/
Figure 2: HR Database. The figure shows some samples
from the HR database. The dataset contains information
from both RGB (left) and NIR sensors (right). The sam-
ples are splited into legitimate users (up) and photo attacks
(down).
with the ones in [18]. The Depth frames are not em-
ployed in this study, but they could be processed to
perform more robust face detection and tracking.
• The Heart Rate Database (HR) is a dataset self-
collected by our research group. It has not been re-
leased yet as we are now enlarging it for subsequent
research in this area. We will make it public at a
later stage when we finish the collection effort. It
is a database collected for complementing existing
databases like 3DMAD.
3DMAD is a large dataset but it presents some limita-
tions for performing a study like the one presented in
the present paper: i) it only has one type of face spoof-
ing artifacts; ii) the length of the videos is short, only
10 seconds; iii) it does not contain information of extra
spectrum bands, only depth information that is useless
for rPPG and that has been recorded using only one
type of sensor. To overcome these limitations, we cap-
tured HR.
For the preliminary experiments repeated in the
present paper, we use frontal-view controlled record-
ings of 10 different users. We have captured facial
RGB videos with a reflex digital camera NIKON
D5200 with 1920×1080 resolution. The database also
contains facial NIR videos, captured simultaneously to
the RGB video using a NIR camera with 1032×770
resolution. See Fig. 2 for an example of the database
images.
661
The RGB and NIR videos are synchronized in order
to develop experiments that can take advantages of the
multiband information. In this work we compare both
types of videos.
The videos have a duration of 60 seconds, being cap-
tured at 25 frames per second for the RGB camera, and
30 frames per second for the NIR camera, resulting in
1,800 and 1,500 frames per video, respectively.
HR also contains face presentation attacks, in this case
not using 3D realistic faces, but using HQ color print-
ings of the attacked faces (see Fig. 2). This way are
able to measure the performance of face PAD based on
pulse detection with other type of easy-to-create spoof-
ing artifacts different to 3D masks. For each user we
have recorded two sessions: one legitimate access and
one photo print face attack.
Summarizing, HR contains recordings of: 10 users × 2
sessions/user × 1 videos/session × 60 seconds/video =
30,000 RGB and 36,000 NIR frames, from which one-
half (15,000 and 18,000 frames) correspond to attacks
and the other half to legitimate access attempts.
4.2. Experimental Protocol
4.2.1 3DMAD database
From the rPPG signal we extracted the six-dimensional fea-
ture vector (Pr, Rr, Pg, Rg, Pb, Rb) for each one of the fol-
lowing temporal window sizes: from 1 to 10 seconds (1 sec-
ond steps). For the Support Vector Machines we used linear
kernels with fixed cost parameter C = 1000 similarly to
[18].
The whole dataset is divided into legitimate samples as
the positive class and attack samples as the negative class.
For training and testing the classifier, we use a Leave-One-
Out Cross-Validation (LOOCV) protocol: for each one of
the 17 users in the 3DMAD database, we use all his fea-
ture vectors for testing against a SVM model that has been
trained with all the samples from the remaining 16 users.
The metric used to report results is the Equal Error Rate
(EER in %). EER refers to the value where the Impostor At-
tack Presentation Match Rate (IAPMR, percentage of pre-
sentation attacks classified as real) and the False Non-Match
Rate (FNMR, percentage of real faces classified as fake)
are equal6. For each window size, the EER has been cal-
culated independently for the 17 subjects (each one of the
LOOCV iterations). The 17 individual results are then av-
eraged to produce a single performance (mean and standard
deviation).
6As error measures we have mentioned IAPMR and FNMR as defined
and discussed by Galbally et al. [12]. Modifying the Decision Threshold
on the right of Fig. 1 until those error rates are equal we obtain the Pre-
sentation Attack Equal Error Rate, PAEER, defined and discussed in [12].
Here we follow [12] using PAEER to evaluate the presentation attacks, but
calling it as EER for simplicity.
4.2.2 HR database
For the experiments with HR we used a slightly different
experimental protocol in order to compare the information
from RGB and NIR videos. We distinguish between: i)
using only RGB videos, and ii) using only NIR videos.
For the scenario (i) we followed exactly the same exper-
imental protocol than the one explained for 3DMAD, with
the obvious changes in the total number of users and videos
and the possible sizes of the temporal windows, since the
higher duration of the recordings allowed us to have also
into account windows sizes of 20, 40 and 60 seconds.
For the scenario (ii) we also followed the same experi-
mental protocol but this time the feature vector only con-
sists of two features (P,R) as the NIR images only have
one color channel.
5. Results
The results obtained on 3DMAD and HR RGB are sum-
marized in Table 2. The EER has been computed for di-
fferent window sizes. The performance of the presenta-
tion attack detection becomes higher when increasing the
length of the processed video sequence. For short video
sequences (from 1 second to 5 seconds) the system shows
almost random behavior, close to 50% EER, improving for
longer videos. When dealing with those short recordings,
the rPPG sequence does not have enough complete pulse
cycles for extracting robustly the features from the signal
spectrum. As can be seen in Table 2, Li et al. [18] obtained
much lower EER results than ours when working with the
3DMAD database. Compared to their work our approach
is much simpler (as can be seen in Section 3.1.2). The fa-
cial ROI extracted in our experiments is smaller and our
extraction method is less robust to movement and illumina-
tion changes, affecting the final results. Despite that fact,
our target, which was analyzing to what extent the length of
the video affects to the final performance, is achieved.
Comparing the EER results obtained from 3DMAD data
with those obtained with HR, there is a gap between per-
formances, specially when using longer videos, achieving
always better performance when working with the 3DMAD
database. Though our self-collected dataset has been cap-
tured with higher resolution than 3DMAD (1900×1080
pixels vs 640×480 pixels), the frame rate is lower in HR
(25 fps vs 30 fps). This means that even though the detected
ROIs will contain a higher number of pixels, i.e. more spa-
cial information, for each temporal window, the videos will
contain less frames, i.e. less temporal information. This re-
sult indicates that for extracting a robust rPPG signal from
facial videos, the temporal resolution is more relevant than
the spatial resolution.
In Table 3 a comparison between the EERs obtained with
RGB and NIR videos on the HR database is shown. In this
662
Video Length T [s] 1 2 3 4 5 6 7 8 9 10 10 (Li et al. [18])
3DMADMean EER [%] 42.8 45.0 37.8 40.7 33.1 29.7 25 26.1 24.1 22.1 4.71
Std EER [%] 5.0 5.9 8.6 9.8 10.8 18.1 14.5 15.2 11.9 10.3 -
HRMean EER [%] 46.9 45.7 46.5 42.1 42.1 40.1 34.1 36.4 37.3 40.1 -
Std EER [%] 3.9 5.1 3.9 8.1 9.5 10.2 12.7 11.8 11.7 9.6 -
Table 2: EER of our implemented face PAD on 3DMAD and HR databases. The study has been performed changing the
length of the video sequences analyzed. The table also compares our results with [18] on 3DMAD. Values in %. Highlighted
in bold are the best EER results for each database.
Video Length T [s] 1 2 5 10 20 30 40 50 60
RGB videosMean EER [%] 46.9 45.7 42.1 40.1 40.0 40.0 36.6 30.0 25.0
Std EER [%] 3.9 5.1 9.5 9.6 14.0 21.1 20.5 25.8 26.3
NIR videosMean EER [%] 42.4 41.7 38.4 30.9 30.0 16.6 5.0 0.0 0.0
Std EER [%] 5.9 6.4 10.8 13.5 18.8 17.5 15.8 0.0 0.0
Table 3: EER of our implemented face PAD on HR database. The table shows EER values for video sequences of
increasing length T and also results using NIR videos. Values in %. Highlighted in bold are the best EER for each band.
case, longer video sequences (up to 60 seconds) have been
analyzed thanks to the higher duration of the recordings.
The overall performance obtained with NIR videos is much
better than in the RGB case, achieving a 0% EER with 50
seconds of window size.
In the process of extracting the rPPG information from
the facial videos, there are several factors that can affect
the final performance: head motion, ROI location, light
changes, frame rate, resolution or noise sources. The NIR
camera employed for acquiring the video sequences have a
higher frame rate (30 fps), and it adds less noise to the final
images thanks to the higher quality of its sensor. The NIR
spectrum band is also more robust to environmental light
variations than the RGB bands.
Once again, as in the comparison between the perfor-
mance obtained with HR and 3DMAD, the NIR videos have
lower spatial resolution than the RGB, but higher temporal
resolution, reinforcing the hypothesis that the spatial reso-
lution is not as critical as the temporal resolution in order
to achieve high performance when working with rPPG. It
is also remarkable the fact that the feature vector extracted
from NIR videos is only 2-dimensional versus the higher
dimensionality (6 features) from the RGB recordings.
Finally, Fig. 3 shows the temporal evolution of the anti-
spoofing liveness scores in a variable attack scenario (the
higher the scores the lower the estimated probability of pre-
sentation attack). In this scenario we wanted to simulate
a situation in which an imaginary attacker puts on and re-
moves the mask (or the printed HQ photograph) several
times during the same recording. Extracting the features
using the full video may not be enough for discerning be-
tween an attack and a real access in this situation due to
the intra-video variability. Using a short-time/low latency
approach, the PAD output will be able to evolve over the
video, which justifies the usefulness of a short-time rPPG
analysis. It can be seen how the liveness scores decrease
when the attacker puts on the mask and viceversa. In Fig.
3a the video windows analyzed are shorter (T = 5 seconds)
than the length of those in Fig. 3b and Fig. 3c, due to the
lower performance obtained with HR compared to 3DMAD
(see Table 2). Our future research will be oriented to inves-
tigating practical trade-offs towards short-time pulse detec-
tion and related liveness scores, and their integration in this
kind of time-variant attacking scenarios.
6. Conclusions
We analyzed time effects of Presentation Attack Detec-
tion (PAD) against face biometrics based on video pulse
detection (remote PhotoPlethysmoGraphy, rPPG). We an-
alyzed the performance of pulse-based face PAD using two
different databases, one public (3DMAD) and one self-
collected (HR), with various spectrum bands (RGB and
NIR), frame rates and resolutions. We also discussed a pos-
sible time-variant attack scenario in which the advantages
of a short-time rPPG analysis can be exploited.
While time holistic methods may fail to adapt their
decisions in variable situations, we advocate for short-time
PAD able to deal with quick changes in the attacking sce-
nario. For that, the video sequences must have a minimum
length in order to obtain a robust PAD score, and we have
provided some evidence on the performance of pulse-based
face PAD for small video durations.
663
0 5 10 15 20 25 30
Window Number
-1.5
-1
-0.5
0
0.5
1
1.5
Score
Antispoofing Liveness Scores, window size 5 seconds
Mask Flag
Liveness score
Decision Threshold
NO MASK
MASK
(a) Scores from 3DMAD RGB videos.
0 2 4 6 8 10 12 14 16 18
Window Number
-1.5
-1
-0.5
0
0.5
1
1.5
Score
Antispoofing Liveness Scores, window size 15 seconds
Mask Flag
Liveness score
Decision Threshold
NO MASK
MASK
(b) Scores from HR RGB videos.
0 2 4 6 8 10 12 14
Window Number
-1.5
-1
-0.5
0
0.5
1
1.5
Score
Antispoofing Liveness Scores, window size 15 seconds
Mask Flag
Liveness score
Decision ThresholdNO MASK
MASK
(c) Scores from HR NIR videos.
Figure 3: Temporal evolution of the scores in a variable attack scenario. The attacker puts on and removes the mask
several times inside the video. Results using data from 3DMAD and HR are shown. In (a) the video sequences analyzed are
shorter (5 seconds) than the length in (b) and (c), due to the lower performance obtained on HR compared to 3DMAD.
Our implemented rPPG PAD method works better with
higher frame rate recordings in the NIR spectrum band.
This is due to the temporal changing nature of the rPPG
signal, that favors the temporal resolution against the spa-
tial resolution, and also due to the higher robustness against
illumination variations of the NIR spectrum band.
This is the first in-depth research of the temporal de-
pendence of pulse detection for PAD. The proposed short-
time analysis has potential to be generalized into many real-
world use case scenarios in which a low latency analysis of
the video sequence is necessary.
Future work includes: 1) Improving the baseline sys-
tem for getting lower EER with short videos (e.g. using
video magnification techniques [3]). 2) Temporal integra-
tion of individual scores for performing continuous PAD
[21]. 3) Capturing a larger database with a higher number
of users, more variate spoofing artifacts and maybe employ-
ing sensors that combine RGB and NIR information (like
Intel Real-Sense) for improved PAD based on multiple evi-
dences [4, 10]. And 4) accomplishing a more in depth study
of the performance when changing spatial and temporal re-
solution.
7. Acknowledgments
This work was supported in part by Accenture,
project CogniMetrics from MINECO/FEDER under
Grant TEC2015-70627-R, project Neurometrics (CEAL-
AL/2017-13) from UAM-Banco Santander, and the
COST Action CA16101 (Multi-Foresee). The work of J.
Hernandez-Ortega was supported by a Ph.D. Scholarship
from Universidad Autonoma de Madrid.
References
[1] A. Agarwal, R. Singh, and V. M. Face anti-spoofing using
Haralick features. In Proc. IEEE 8th International Con-
ference on Biometrics Theory, Applications and Systems
(BTAS), 2016. 1
[2] J. Allen. Photoplethysmography and its application in clini-
cal physiological measurement. Physiological measurement,
28(3):R1, 2007. 2
[3] S. Bharadwaj, T. I. Dhamecha, M. Vatsa, and R. Singh. Com-
putationally Efficient Face Spoofing Detection with Motion
Magnification. In Procs. of the IEEE Conference on Com-
puter Vision and Pattern Recognition Workshops, CVRW,
pages 105–110, 2013. 8
[4] B. Biggio, G. Fumera, G. L. Marcialis, and F. Roli. Statistical
Meta-Analysis of Presentation Attacks for Secure Multibio-
metric Systems. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 39(3):561–575, 2017. 8
[5] J. Chen, Z. Chang, Q. Qiu, X. Li, G. Sapiro, A. Bronstein,
and M. Pietikainen. Realsense = real heart rate: Illumina-
tion invariant heart rate estimation from videos. In Proc.
Image Processing Theory Tools and Applications (IPTA).
IEEE Press, 2016. 3
[6] T. Chen, P. Yuen, M. Richardson, G. Liu, and Z. She. De-
tection of psychological stress using a hyperspectral imag-
ing technique. IEEE Transactions on Affective Computing,
5(4):391–405, 2014. 3
[7] I. Chingovska, A. Anjos, and S. Marcel. On the effective-
ness of local binary patterns in face anti-spoofing. In Proc.
Biometrics Special Interest Group (BIOSIG). IEEE, 2012. 1
[8] N. Erdogmus and S. Marcel. Spoofing in 2D face recognition
with 3D masks and anti-spoofing with Kinect. In Proc. IEEE
Intl. Conf. on Biometrics: Theory, Applications and Systems,
2013. 5
[9] N. Erdogmus and S. Marcel. Spoofing face recognition with
3D masks. IEEE Transactions on Information Forensics and
Security, 9(7):1084–1097, 2014. 1, 2
[10] J. Fierrez, A. Morales, R. Vera-Rodriguez, and D. Camacho.
Multiple classifiers in biometrics. part 2: Trends and chal-
lenges. Information Fusion, 44:103–112, November 2018.
8
664
[11] J. Galbally, S. Marcel, and J. Fierrez. Biometric antispoof-
ing methods: A survey in face recognition. IEEE Access,
2:1530–1552, 2014. 1
[12] J. Galbally, S. Marcel, and J. Fierrez. Image quality assess-
ment for fake biometric detection: Application to iris, fin-
gerprint, and face recognition. IEEE Transactions on Image
Processing, 23(2):710–724, 2014. 1, 6
[13] J. Galbally and R. Satta. Three-dimensional and two-and-
a-half-dimensional face recognition spoofing using three-
dimensional printed models. IET Biometrics, 5:83–91, 2016.
2
[14] M. Garbey, N. Sun, A. Merla, and I. Pavlidis. Contact-free
measurement of cardiac pulse based on the analysis of ther-
mal imagery. IEEE Transactions on Biomedical Engineer-
ing, 54(8):1418–1426, 2007. 3
[15] A. Hadid, N. Evans, S. Marcel, and J. Fierrez. Biometrics
systems under spoofing attack: an evaluation methodology
and lessons learned. IEEE Signal Processing Magazine,
32(5):20–30, 2015. 1
[16] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropou-
los. Large pose 3D face reconstruction from a single image
via direct volumetric CNN regression. In Proc. IEEE In-
ternational Conference on Computer Vision (ICCV), pages
1031–1039, 2017. 2
[17] L. Li, P. L. Correia, and A. Hadid. Face recognition under
spoofing attacks: countermeasures and research directions.
IET Biometrics, 7:3–14(11), January 2018. 1
[18] X. Li, J. Komulainen, G. Zhao, P.-C. Yuen, and
M. Pietikainen. Generalized face anti-spoofing by detect-
ing pulse from face videos. In International Conference on
Pattern Recognition (ICPR), pages 4244–4249. IEEE, 2016.
2, 3, 4, 5, 6, 7
[19] S. Marcel, M. S. Nixon, J. Fierrez, and N. Evans. Handbook
of Biometric Anti-Spoofing. Springer, 2018. 1
[20] D. McDuff, S. Gontarek, and R. W. Picard. Improvements in
remote cardiopulmonary measurement using a five band dig-
ital camera. IEEE Transactions on Biomedical Engineering,
61(10):2593–2601, 2014. 3
[21] V. M. Patel, R. Chellappa, D. Chandra, and B. Barbello.
Continuous User Authentication on Mobile Devices: Recent
progress and remaining challenges. IEEE Signal Processing
Magazine, 33(4):49–61, 2016. 8
[22] M.-Z. Poh, D. J. McDuff, and R. W. Picard. Advancements
in noncontact, multiparameter physiological measurements
using a webcam. IEEE Transactions on Biomedical Engi-
neering, 58(1):7–11, 2011. 2, 3
[23] R. Ramachandra and C. Busch. Presentation attack detec-
tion methods for face recognition systems: A comprehensive
survey. ACM Comput. Surv., 50(1), 2017. 1
[24] X. Tan, Y. Li, J. Liu, and L. Jiang. Face liveness detection
from a single image with sparse low rank bilinear discrimina-
tive model. In Proc. ECCV, pages 504–517. Springer, 2010.
1
[25] M. P. Tarvainen, P. O. Ranta-aho, and P. A. Karjalainen. An
advanced detrending method with application to HRV anal-
ysis. IEEE Transactions on Biomedical Engineering, 2002.
4
[26] H. E. Tasli, A. Gudi, and M. den Uyl. Remote PPG based vi-
tal sign measurement using adaptive facial regions. In Proc.
IEEE International Conference on Image Processing (ICIP),
pages 1410–1414. IEEE, 2014. 3
[27] C. Tomasi and T. Kanade. Detection and tracking of point
features. International Journal of Computer Vision, 1991. 3
[28] P. Tome, J. Fierrez, R. Vera-Rodriguez, and M. Nixon. Soft
Biometrics and their Application in Person Recognition at a
Distance. IEEE Transactions on Information Forensics and
Security, 9(3):464–475, 2014. 1
[29] P. Viola and M. Jones. Rapid object detection using a boosted
cascade of simple features. In Proc. IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition
(CVPR), 2001. 3
665