Computer Science Computer Vision and Pattern Recognition

Face recognition and analysis

Description

This cluster of papers focuses on various techniques for face recognition and analysis, including deep learning approaches, facial landmark detection, age estimation, 3D face reconstruction, facial expression analysis, and pose estimation. The papers cover topics such as metric learning, feature learning, and the application of convolutional neural networks to address challenges in face recognition.

Keywords

Face Recognition; Facial Landmark Detection; Deep Learning; Age Estimation; 3D Face Reconstruction; Facial Expression Analysis; Pose Estimation; Feature Learning; Metric Learning; Convolutional Neural Networks

Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very … Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very high performance on LFW, i.e., 97% to 99%. While there are many open source implementations of CNN, none of large scale face dataset is publicly available. The current situation in the field of face recognition is that data is more important than algorithm. To solve this problem, this paper proposes a semi-automatical way to collect face images from Internet and builds a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace. Based on the database, we use a 11-layer CNN to learn discriminative representation and obtain state-of-theart accuracy on LFW and YTF. The publication of CASIAWebFace will attract more research groups entering this field and accelerate the development of face recognition in the wild.
Predicting face attributes in the wild is challenging due to complex face variations. We propose a novel deep learning framework for attribute prediction in the wild. It cascades two CNNs, … Predicting face attributes in the wild is challenging due to complex face variations. We propose a novel deep learning framework for attribute prediction in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently. LNet is pre-trained by massive general object categories for face localization, while ANet is pre-trained by massive face identities for attribute prediction. This framework not only outperforms the state-of-the-art with a large margin, but also reveals valuable facts on learning face representation. (1) It shows how the performances of face localization (LNet) and attribute prediction (ANet) can be improved by different pre-training strategies. (2) It reveals that although the filters of LNet are fine-tuned only with image-level attribute tags, their response maps over entire images have strong indication of face locations. This fact enables training LNet for face localization with only image-level annotations, but without face bounding boxes or landmarks, which are required by all attribute recognition works. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training with massive face identities, and such concepts are significantly enriched after fine-tuning with attribute tags. Each attribute can be well explained with a sparse linear combination of these concepts.
We propose a new approach for estimation of the positions of facial key points with three-level carefully designed convolutional networks. At each level, the outputs of multiple networks are fused … We propose a new approach for estimation of the positions of facial key points with three-level carefully designed convolutional networks. At each level, the outputs of multiple networks are fused for robust and accurate estimation. Thanks to the deep structures of convolutional networks, global high-level features are extracted over the whole face region at the initialization stage, which help to locate high accuracy key points. There are two folds of advantage for this. First, the texture context information over the entire face is utilized to locate each key point. Second, since the networks are trained to predict all the key points simultaneously, the geometric constraints among key points are implicitly encoded. The method therefore can avoid local minimum caused by ambiguity and data corruption in difficult image samples due to occlusions, large pose variations, and extreme lightings. The networks at the following two levels are trained to locally refine initial predictions and their inputs are limited to small regions around the initial predictions. Several network structures critical for accurate and robust facial point detection are investigated. Extensive experiments show that our approach outperforms state-of-the-art methods in both detection accuracy and reliability.
This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features (DeepID), for face verification. We argue that DeepID can … This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features (DeepID), for face verification. We argue that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set. Moreover, the generalization capability of DeepID increases as more face classes are to be predicted at training. DeepID features are taken from the last hidden layer neuron activations of deep convolutional networks (ConvNets). When learned as classifiers to recognize about 10, 000 face identities in the training set and configured to keep reducing the neuron numbers along the feature extraction hierarchy, these deep ConvNets gradually form compact identity-related features in the top layers with only a small number of hidden neurons. The proposed features are extracted from various face regions to form complementary and over-complete representations. Any state-of-the-art classifiers can be learned based on these high-level representations for face verification. 97:45% verification accuracy on LFW is achieved with only weakly aligned faces.
In the Fall of 2000, we collected a database of more than 40,000 facial images of 68 people. Using the Carnegie Mellon University 3D Room, we imaged each person across … In the Fall of 2000, we collected a database of more than 40,000 facial images of 68 people. Using the Carnegie Mellon University 3D Room, we imaged each person across 13 different poses, under 43 different illumination conditions, and with four different expressions. We call this the CMU pose, illumination, and expression (PIE) database. We describe the imaging hardware, the collection procedure, the organization of the images, several possible uses, and how to obtain the database.
Recognizing faces in unconstrained videos is a task of mounting importance. While obviously related to face recognition in still images, it has its own unique characteristics and algorithmic requirements. Over … Recognizing faces in unconstrained videos is a task of mounting importance. While obviously related to face recognition in still images, it has its own unique characteristics and algorithmic requirements. Over the years several methods have been suggested for this problem, and a few benchmark data sets have been assembled to facilitate its study. However, there is a sizable gap between the actual application needs and the current state of the art. In this paper we make the following contributions. (a) We present a comprehensive database of labeled videos of faces in challenging, uncontrolled conditions (i.e., `in the wild'), the `YouTube Faces' database, along with benchmark, pair-matching tests <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> . (b) We employ our benchmark to survey and compare the performance of a large variety of existing video face recognition techniques. Finally, (c) we describe a novel set-to-set similarity measure, the Matched Background Similarity (MBGS). This similarity is shown to considerably improve performance on the benchmark tests.
This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face's landmark positions directly … This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate priors exploiting the structure of image data helps with efficient feature selection. Different regularization strategies and its importance to combat overfitting are also investigated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.
In 2000, the Cohn-Kanade (CK) database was released for the purpose of promoting research into automatically detecting individual facial expressions. Since then, the CK database has become one of the … In 2000, the Cohn-Kanade (CK) database was released for the purpose of promoting research into automatically detecting individual facial expressions. Since then, the CK database has become one of the most widely used test-beds for algorithm development and evaluation. During this period, three limitations have become apparent: 1) While AU codes are well validated, emotion labels are not, as they refer to what was requested rather than what was actually performed, 2) The lack of a common performance metric against which to evaluate new algorithms, and 3) Standard protocols for common databases have not emerged. As a consequence, the CK database has been used for both AU and emotion detection (even though labels for the latter have not been validated), comparison with benchmark algorithms is missing, and use of random subsets of the original database makes meta-analyses difficult. To address these and other concerns, we present the Extended Cohn-Kanade (CK+) database. The number of sequences is increased by 22% and the number of subjects by 27%. The target expression for each sequence is fully FACS coded and emotion labels have been revised and validated. In addition to this, non-posed sequences for several types of smiles and their associated metadata have been added. We present baseline results using Active Appearance Models (AAMs) and a linear support vector machine (SVM) classifier using a leave-one-out subject cross-validation for both AU and emotion detection for the posed data. The emotion and AU labels, along with the extended image data and tracked landmarks will be made available July 2010.
We consider the problem of building high-level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer … We consider the problem of building high-level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a deep sparse autoencoder on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200×200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting from these learned features, we trained our network to recognize 22,000 object categories from ImageNet and achieve a leap of 70% relative improvement over the previous state-of-the-art.
Images containing faces are essential to intelligent vision-based human-computer interaction, and research efforts in face processing include face recognition, face tracking, pose estimation and expression recognition. However, many reported methods … Images containing faces are essential to intelligent vision-based human-computer interaction, and research efforts in face processing include face recognition, face tracking, pose estimation and expression recognition. However, many reported methods assume that the faces in an image or an image sequence have been identified and localized. To build fully automated systems that analyze the information contained in face images, robust and efficient face detection algorithms are required. Given a single image, the goal of face detection is to identify all image regions which contain a face, regardless of its 3D position, orientation and lighting conditions. Such a problem is challenging because faces are non-rigid and have a high degree of variability in size, shape, color and texture. Numerous techniques have been developed to detect faces in a single image, and the purpose of this paper is to categorize and evaluate these algorithms. We also discuss relevant issues such as data collection, evaluation metrics and benchmarking. After analyzing these algorithms and identifying their limitations, we conclude with several promising directions for future research.
We present a generative appearance-based method for recognizing human faces under variation in lighting and viewpoint. Our method exploits the fact that the set of images of an object in … We present a generative appearance-based method for recognizing human faces under variation in lighting and viewpoint. Our method exploits the fact that the set of images of an object in fixed pose, but under all possible illumination conditions, is a convex cone in the space of images. Using a small number of training images of each face taken with different lighting directions, the shape and albedo of the face can be reconstructed. In turn, this reconstruction serves as a generative model that can be used to render (or synthesize) images of the face under novel poses and illumination conditions. The pose space is then sampled and, for each pose, the corresponding illumination cone is approximated by a low-dimensional linear subspace whose basis vectors are estimated using the generative model. Our recognition algorithm assigns to a test image the identity of the closest approximated illumination cone. Test results show that the method performs almost without error, except on the most extreme lighting directions.
A method is presented for the representation of (pictures of) faces. Within a specified framework the representation is ideal. This results in the characterization of a face, to within an … A method is presented for the representation of (pictures of) faces. Within a specified framework the representation is ideal. This results in the characterization of a face, to within an error bound, by a relatively low-dimensional vector. The method is illustrated in detail by the use of an ensemble of pictures taken for this purpose.
As a recently proposed technique, sparse representation based classification (SRC) has been widely used for face recognition (FR). SRC first codes a testing sample as a sparse linear combination of … As a recently proposed technique, sparse representation based classification (SRC) has been widely used for face recognition (FR). SRC first codes a testing sample as a sparse linear combination of all the training samples, and then classifies the testing sample by evaluating which class leads to the minimum representation error. While the importance of sparsity is much emphasized in SRC and many related works, the use of collaborative representation (CR) in SRC is ignored by most literature. However, is it really the l <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</inf> -norm sparsity that improves the FR accuracy? This paper devotes to analyze the working mechanism of SRC, and indicates that it is the CR but not the l1-norm sparsity that makes SRC powerful for face classification. Consequently, we propose a very simple yet much more efficient face classification scheme, namely CR based classification with regularized least square (CRC_RLS). The extensive experiments clearly show that CRC_RLS has very competitive classification results, while it has significantly less complexity than SRC.
The key challenge of face recognition is to develop effective feature representations for reducing intra-personal variations while enlarging inter-personal differences. In this paper, we show that it can be well … The key challenge of face recognition is to develop effective feature representations for reducing intra-personal variations while enlarging inter-personal differences. In this paper, we show that it can be well solved with deep learning and using both face identification and verification signals as supervision. The Deep IDentification-verification features (DeepID2) are learned with carefully designed deep convolutional networks. The face identification task increases the inter-personal variations by drawing DeepID2 features extracted from different identities apart, while the face verification task reduces the intra-personal variations by pulling DeepID2 features extracted from the same identity together, both of which are essential to face recognition. The learned DeepID2 features can be well generalized to new identities unseen in the training data. On the challenging LFW dataset [11], 99.15% face verification accuracy is achieved. Compared with the best previous deep learning result [20] on LFW, the error rate has been significantly reduced by 67%.
In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing … In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4, 000 identities. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.35% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 27%, closely approaching human-level performance.
The capacity to estimate the head pose of another person is a common human ability that presents a unique challenge for computer vision systems. Compared to face detection and recognition, … The capacity to estimate the head pose of another person is a common human ability that presents a unique challenge for computer vision systems. Compared to face detection and recognition, which have been the primary foci of face-related vision research, identity-invariant head pose estimation has fewer rigorously evaluated systems or generic solutions. In this paper, we discuss the inherent difficulties in head pose estimation and present an organized survey describing the evolution of the field. Our discussion focuses on the advantages and disadvantages of each approach and spans 90 of the most innovative and characteristic papers that have been published on this topic. We compare these systems by focusing on their ability to estimate coarse and fine head pose, highlighting approaches that are well suited for unconstrained environments.
We describe a new method of matching statistical models of appearance to images. A set of model parameters control modes of shape and gray-level variation learned from a training set. … We describe a new method of matching statistical models of appearance to images. A set of model parameters control modes of shape and gray-level variation learned from a training set. We construct an efficient iterative matching algorithm by learning the relationship between perturbations in the model parameters and the induced image errors.
Many computer vision problems (e.g., camera calibration, image alignment, structure from motion) are solved through a nonlinear optimization method. It is generally accepted that 2nd order descent methods are the … Many computer vision problems (e.g., camera calibration, image alignment, structure from motion) are solved through a nonlinear optimization method. It is generally accepted that 2nd order descent methods are the most robust, fast and reliable approaches for nonlinear optimization of a general smooth function. However, in the context of computer vision, 2nd order descent methods have two main drawbacks: (1) The function might not be analytically differentiable and numerical approximations are impractical. (2) The Hessian might be large and not positive definite. To address these issues, this paper proposes a Supervised Descent Method (SDM) for minimizing a Non-linear Least Squares (NLS) function. During training, the SDM learns a sequence of descent directions that minimizes the mean of NLS functions sampled at different points. In testing, SDM minimizes the NLS objective using the learned descent directions without computing the Jacobian nor the Hessian. We illustrate the benefits of our approach in synthetic and real examples, and show how SDM achieves state-of-the-art performance in the problem of facial feature detection. The code is available at www.humansensing.cs. cmu.edu/intraface.
This paper presents a method for face recognition across variations in pose, ranging from frontal to profile views, and across a wide range of illuminations, including cast shadows and specular … This paper presents a method for face recognition across variations in pose, ranging from frontal to profile views, and across a wide range of illuminations, including cast shadows and specular reflections. To account for these variations, the algorithm simulates the process of image formation in 3D space, using computer graphics, and it estimates 3D shape and texture of faces from single images. The estimate is achieved by fitting a statistical, morphable model of 3D faces to images. The model is learned from a set of textured 3D scans of heads. We describe the construction of the morphable model, an algorithm to fit the model to images, and a framework for face identification. In this framework, faces are represented by model parameters for 3D shape and texture. We present results obtained with 4,488 images from the publicly available CMU-PIE database and 1,940 images from the FERET database.
We present a neural network-based upright frontal face detection system. A retinally connected neural network examines small windows of an image and decides whether each window contains a face. The … We present a neural network-based upright frontal face detection system. A retinally connected neural network examines small windows of an image and decides whether each window contains a face. The system arbitrates between multiple networks to improve performance over a single network. We present a straightforward procedure for aligning positive face examples for training. To collect negative examples, we use a bootstrap algorithm, which adds false detections into the training set as training progresses. This eliminates the difficult task of manually selecting nonface training examples, which must be chosen to span the entire space of nonface images. Simple heuristics, such as using the fact that faces rarely overlap in images, can further improve the accuracy. Comparisons with several other state-of-the-art face detection systems are presented, showing that our system has comparable performance in terms of detection and false-positive rates.
In this paper, a new technique for modeling textured 3D faces is introduced.3D faces can either be generated automatically from one or more photographs, or modeled directly through an intuitive … In this paper, a new technique for modeling textured 3D faces is introduced.3D faces can either be generated automatically from one or more photographs, or modeled directly through an intuitive user interface.Users are assisted in two key problems of computer aided face modeling.First, new face images or new 3D face models can be registered automatically by computing dense one-to-one correspondence to an internal face model.Second, the approach regulates the naturalness of modeled faces avoiding faces with an "unlikely" appearance.Starting from an example set of 3D face models, we derive a morphable face model by transforming the shape and texture of the examples into a vector space representation.New faces and expressions can be modeled by forming linear combinations of the prototypes.Shape and texture constraints derived from the statistics of our example faces are used to guide manual modeling or automated matching algorithms.We show 3D face reconstructions from single images and their applications for photo-realistic image manipulations.We also demonstrate face manipulations according to complex parameters such as gender, fullness of a face or its distinctiveness.
We present a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with … We present a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.
Over the past few years, there has been an increased interest in automatic facial behavior analysis and understanding. We present OpenFace - an open source tool intended for computer vision … Over the past few years, there has been an increased interest in automatic facial behavior analysis and understanding. We present OpenFace - an open source tool intended for computer vision and machine learning researchers, affective computing community and people interested in building interactive applications based on facial behavior analysis. OpenFace is the first open source tool capable of facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. The computer vision algorithms which represent the core of OpenFace demonstrate state-of-the-art results in all of the above mentioned tasks. Furthermore, our tool is capable of real-time performance and is able to run from a simple webcam without any specialist hardware. Finally, OpenFace allows for easy integration with other applications and devices through a lightweight messaging system.
Deep metric learning has gained much popularity in recent years, following the success of deep learning. However, existing frameworks of deep metric learning based on contrastive loss and triplet loss … Deep metric learning has gained much popularity in recent years, following the success of deep learning. However, existing frameworks of deep metric learning based on contrastive loss and triplet loss often suffer from slow convergence, partially because they employ only one negative example while not interacting with the other negative classes in each update. In this paper, we propose to address this problem with a new metric learning objective called multi-class N-pair loss. The proposed objective function firstly generalizes triplet loss by allowing joint comparison among more than one negative examples - more specifically, N-1 negative examples - and secondly reduces the computational burden of evaluating deep embedding vectors via an efficient batch construction strategy using only N pairs of examples, instead of (N+1) x N. We demonstrate the superiority of our proposed loss to the triplet loss as well as other competing loss functions for a variety of tasks on several visual recognition benchmark, including fine-grained object recognition and verification, image clustering and retrieval, and face verification and identification.
Face recognition has made extraordinary progress owing to the advancement of deep convolutional neural networks (CNNs). The central task of face recognition, including face verification and identification, involves face feature … Face recognition has made extraordinary progress owing to the advancement of deep convolutional neural networks (CNNs). The central task of face recognition, including face verification and identification, involves face feature discrimination. However, the traditional softmax loss of deep CNNs usually lacks the power of discrimination. To address this problem, recently several loss functions such as center loss, large margin softmax loss, and angular softmax loss have been proposed. All these improved losses share the same idea: maximizing inter-class variance and minimizing intra-class variance. In this paper, we propose a novel loss function, namely large margin cosine loss (LMCL), to realize this idea from a different perspective. More specifically, we reformulate the softmax loss as a cosine loss by L2 normalizing both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space. As a result, minimum intra-class variance and maximum inter-class variance are achieved by virtue of normalization and cosine decision margin maximization. We refer to our model trained with LMCL as CosFace. Extensive experimental evaluations are conducted on the most popular public-domain face recognition datasets such as MegaFace Challenge, Youtube Faces (YTF) and Labeled Face in the Wild (LFW). We achieve the state-of-the-art performance on these benchmarks, which confirms the effectiveness of our proposed approach.
We present an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNN). The proposed method called, HyperFace, fuses the intermediate layers … We present an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNN). The proposed method called, HyperFace, fuses the intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features. It exploits the synergy among the tasks which boosts up their individual performances. Additionally, we propose two variants of HyperFace: (1) HyperFace-ResNet that builds on the ResNet-101 model and achieves significant improvement in performance, and (2) Fast-HyperFace that uses a high recall fast face detector for generating region proposals to improve the speed of the algorithm. Extensive experiments show that the proposed models are able to capture both global and local information in faces and performs significantly better than many competitive algorithms for each of these four tasks.
This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably … This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. However, few existing algorithms can effectively achieve this criterion. To this end, we propose the angular softmax (A-Softmax) loss that enables convolutional neural networks (CNNs) to learn angularly discriminative features. Geometrically, A-Softmax loss can be viewed as imposing discriminative constraints on a hypersphere manifold, which intrinsically matches the prior that faces also lie on a manifold. Moreover, the size of angular margin can be quantitatively adjusted by a parameter m. We further derive specific m to approximate the ideal feature criterion. Extensive analysis and experiments on Labeled Face in the Wild (LFW), Youtube Faces (YTF) and MegaFace Challenge 1 show the superiority of A-Softmax loss in FR tasks.
Face detection is one of the most studied topics in the computer vision community. Much of the progresses have been made by the availability of face detection benchmark datasets. We … Face detection is one of the most studied topics in the computer vision community. Much of the progresses have been made by the availability of face detection benchmark datasets. We show that there is a gap between current face detection performance and the real world requirements. To facilitate future face detection research, we introduce the WIDER FACE dataset1, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion, as shown in Fig. 1. Furthermore, we show that WIDER FACE dataset is an effective training source for face detection. We benchmark several representative detection systems, providing an overview of state-of-the-art performance and propose a solution to deal with large scale variation. Finally, we discuss common failure cases that worth to be further investigated.
In this paper, we introduce a new large-scale face dataset named VGGFace2. The dataset contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject. … In this paper, we introduce a new large-scale face dataset named VGGFace2. The dataset contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession (e.g. actors, athletes, politicians). The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimise the label noise. We describe how the dataset was collected, in particular the automated and manual filtering stages to ensure a high accuracy for the images of each identity. To assess face recognition performance using the new dataset, we train ResNet-50 (with and without Squeeze-and-Excitation blocks) Convolutional Neural Networks on VGGFace2, on MS-Celeb-1M, and on their union, and show that training on VGGFace2 leads to improved recognition performance over pose and age. Finally, using the models trained on these datasets, we demonstrate state-of-the-art performance on the IJB-A and IJB-B face recognition benchmarks, exceeding the previous state-of-the-art by a large margin. The dataset and models are publicly available.
One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative … One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.
Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this … Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure offace similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings asfeature vectors. Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-artface recognition performance using only 128-bytes perface. On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by 30% on both datasets.
Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two … Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.
In this paper, we propose a conceptually simple and geometrically interpretable objective function, i.e. additive margin Softmax (AM-Softmax), for deep face verification. In general, the face verification task can be … In this paper, we propose a conceptually simple and geometrically interpretable objective function, i.e. additive margin Softmax (AM-Softmax), for deep face verification. In general, the face verification task can be viewed as a metric learning problem, so learning large-margin face features whose intra-class variation is small and inter-class difference is large is of great importance in order to achieve good performance. Recently, Large-margin Softmax and Angular Softmax have been proposed to incorporate the angular margin in a multiplicative manner. In this work, we introduce a novel additive angular margin for the Softmax loss, which is intuitively appealing and more interpretable than the existing works. We also emphasize and discuss the importance of feature normalization in the paper. Most importantly, our experiments on LFW BLUFR and MegaFace show that our additive margin softmax loss consistently performs better than the current state-of-the-art methods using the same network architecture and training dataset. Our code has also been made available at https://github.com/happynear/AMSoftmax
This paper investigates how far a very deep neural network is from attaining close to saturating performance on existing 2D and 3D face alignment datasets. To this end, we make … This paper investigates how far a very deep neural network is from attaining close to saturating performance on existing 2D and 3D face alignment datasets. To this end, we make the following 5 contributions: (a) we construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block, train it on a very large yet synthetically expanded 2D facial landmark dataset and finally evaluate it on all other 2D facial landmark datasets. (b)We create a guided by 2D landmarks network which converts 2D landmark annotations to 3D and unifies all existing datasets, leading to the creation of LS3D-W, the largest and most challenging 3D facial landmark dataset to date (~230,000 images). (c) Following that, we train a neural network for 3D face alignment and evaluate it on the newly introduced LS3D-W. (d) We further look into the effect of all "traditional" factors affecting face alignment performance like large pose, initialization and resolution, and introduce a "new" one, namely the size of the network. (e) We show that both 2D and 3D face alignment networks achieve performance of remarkable accuracy which is probably close to saturating the datasets used. Training and testing code as well as the dataset can be downloaded from https://www.adrianbulat.com/face-alignment/.
An increasing number of scholars, policymakers and grassroots communities argue that artificial intelligence (AI) research-and computer-vision research in particular-has become the primary source for developing and powering mass surveillance1-7. Yet, … An increasing number of scholars, policymakers and grassroots communities argue that artificial intelligence (AI) research-and computer-vision research in particular-has become the primary source for developing and powering mass surveillance1-7. Yet, the pathways from computer vision to surveillance continue to be contentious. Here we present an empirical account of the nature and extent of the surveillance AI pipeline, showing extensive evidence of the close relationship between the field of computer vision and surveillance. Through an analysis of computer-vision research papers and citing patents, we found that most of these documents enable the targeting of human bodies and body parts. Comparing the 1990s to the 2010s, we observed a fivefold increase in the number of these computer-vision papers linked to downstream surveillance-enabling patents. Additionally, our findings challenge the notion that only a few rogue entities enable surveillance. Rather, we found that the normalization of targeting humans permeates the field. This normalization is especially striking given patterns of obfuscation. We reveal obfuscating language that allows documents to avoid direct mention of targeting humans, for example, by normalizing the referring to of humans as 'objects' to be studied without special consideration. Our results indicate the extensive ties between computer-vision research and surveillance.
In recent years, deep learning techniques have become increasingly prominent in face recognition tasks, particularly through the extraction and classification of face vectors. These vectors enable the inference of demographic … In recent years, deep learning techniques have become increasingly prominent in face recognition tasks, particularly through the extraction and classification of face vectors. These vectors enable the inference of demographic attributes such as gender, age, and ethnicity. This study introduces a gender classification approach based solely on face vectors, avoiding the use of traditional machine learning algorithms. Face embeddings were generated using three popular models: dlib, ArcFace, and FaceNet512. For classification, the Average Neural Face Embeddings (ANFE) technique was applied by calculating distances between vectors. To improve gender recognition performance for Asian individuals, a new dataset was created by scraping facial images and related metadata from AsianWiki. The experimental evaluations revealed that ANFE models based on ArcFace achieved classification accuracies of 93.1% for Asian women and 90.2% for Asian men. In contrast, the models utilizing dlib embeddings performed notably lower, with accuracies dropping to 76.4% for women and 74.3% for men. Among the tested models, FaceNet512 provided the best results, reaching 97.5% accuracy for female subjects and 94.2% for males. Furthermore, this study includes a comparative analysis between ANFE and other commonly used gender classification methods.
Ha Thu Le | International Journal of Oral and Maxillofacial Surgery
Д. Шукла | INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Artificial Intelligence, deepfake technology, Generative Adversarial Networks GAN, Detection System, Detection Accuracy, User accessibility, Digital content verification. Abstract: In recent years, the rise of deepfake technology has raised significant concerns. … Artificial Intelligence, deepfake technology, Generative Adversarial Networks GAN, Detection System, Detection Accuracy, User accessibility, Digital content verification. Abstract: In recent years, the rise of deepfake technology has raised significant concerns. regarding the authenticity of digital content. Deepfakes, which are synthetic media created using advanced artificial intelligence techniques, can mislead viewers and pose risks to personal privacy, public trust, and social discourse. The proposed system focuses on developing a Generative Adversarial Network (GAN)- based deepfake detection system that aims to identify manipulated images and videos accurately and efficiently. The importance of this research lies in its potential to enhance digital content verification, ultimately restoring trust in media across various sectors, including news, entertainment, and social media. The proposed approach utilizes GANs to both generate synthetic deepfake samples for training and serve as the basis for the detection engine. By focusing solely on GANs, the system leverages their unique capabilities to create a model that is adaptable to evolving deepfake generation techniques. The architecture of the system includes a user-friendly frontend, a robust backend, and a powerful detection engine, all integrated seamlessly to ensure real- time processing and analysis of media files.
A. Ng Yi Qi , Yaxin Weng , Yu Fan Sim +1 more | International Journal of Oral and Maxillofacial Surgery
| International Research Journal of Modernization in Engineering Technology and Science
<ns3:p>Background In computer vision and image processing, face recognition is increasingly popular field of research that identifies similar faces in a picture and assigns a suitable label. It is one … <ns3:p>Background In computer vision and image processing, face recognition is increasingly popular field of research that identifies similar faces in a picture and assigns a suitable label. It is one of the desired detection techniques employed in forensics for criminal identification. Methods This study explores face recognition system for monozygotic twins utilizing three widely recognized feature descriptor algorithms: Scale-Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), and Oriented Fast and Rotated BRIEF (ORB)—with region-specific facial landmarks. These landmarks were extracted from 468 points detected through the MediaPipe framework, which enables simultaneous recognition of multiple faces. Quantitative similarity metrics t served as inputs for four classification methods: Support Vector Machine (SVM), eXtreme Gradient Boost (XGBoost), Light Gradient Boost Machine (LGBM), and Nearest Centroid (NC). The effectiveness of these algorithms was tested and validated using challenging ND Twins and 3D TEC datasets, the most difficult data sets for 2D and 3D face recognition research at Notre Dame University. Results Testing with Notre Dame University’s challenging ND Twins and 3D TEC datasets revealed significant performance differences. Results demonstrated that 2D facial images achieved notably higher recognition accuracy than 3D images. The 2D images produced accuracy of 88% (SVM), 83% (LGBM), 83% (XGBoost), and 79% (NC). In contrast, the 3D TEC dataset yielded a lower accuracy r of 74%, 72%, 72%, and 70%, with the same classifiers. Conclusion The hybrid feature extraction approach proved most effective, with maximum accuracy rates reaching 88% for 2D facial images and 74% for 3D facial images. This work contributes significantly to forensic science by enhancing the reliability of facial recognition systems when confronted with indistinguishable facial characteristics of monozygotic twins.</ns3:p>
| International journal of intelligent engineering and systems
Face segmentation is a critical component in new media post-production, enabling the precise separation of facial regions from complex backgrounds at the pixel level. With the increasing demand for flexible … Face segmentation is a critical component in new media post-production, enabling the precise separation of facial regions from complex backgrounds at the pixel level. With the increasing demand for flexible and efficient segmentation solutions across diverse media scenarios—such as variety shows, period dramas, and other productions—there is a pressing need for adaptable methods that can perform reliably under varying conditions. However, existing approaches primarily depend on fully supervised learning, which requires extensive manual annotation and incurs high labor costs. To overcome these limitations, we propose a novel weakly supervised face segmentation framework that leverages large-scale vision models to automatically generate high-quality pseudo-labels. These pseudo-labels are then used to train segmentation networks in a dual-model architecture, where two complementary models collaboratively enhance segmentation performance. Our method significantly reduces the reliance on manual labeling while maintaining competitive accuracy. Extensive experiments demonstrate that our approach not only improves segmentation precision and efficiency but also streamlines post-production workflows, lowering human effort and accelerating project timelines. Furthermore, this framework reduced reliance on annotations in the field of weakly supervised learning for facial image processing in the new media post-production scenario.
Maya Sai Gopala Krishna | INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Image stylization has become a prominent area in computer vision, enabling creative transformations of real-world photos into artistic formats like sketches and cartoons. This paper introduces StyleWeb, a web-based application … Image stylization has become a prominent area in computer vision, enabling creative transformations of real-world photos into artistic formats like sketches and cartoons. This paper introduces StyleWeb, a web-based application that allows users to convert ordinary images into stylized outputs in real time. The system integrates two core approaches: sketch generation using traditional image processing techniques and cartoonization powered by a pre-trained AnimeGANv3 model converted into ONNX format for optimized performance. Developed using the Django framework, StyleWeb offers a secure and responsive backend for efficient image processing and delivery. Through a simple interface, users can upload images, choose a desired style, and receive processed results with minimal delay. The platform bridges modern deep learning methods with practical web deployment, providing an accessible solution for stylized image generation suitable for artists, educators, and casual users alike. StyleWeb showcases how AI-driven creativity can be brought to everyday applications through thoughtful system design and implementation.
This project proposes an innovative, AI-driven facerecognition system designed for effective crowdsurveillance and missing person identificationduring the large-scale Simhastha Ujjain religiousgathering. With over 50 million participantsexpected, traditional manual identification methodsprove … This project proposes an innovative, AI-driven facerecognition system designed for effective crowdsurveillance and missing person identificationduring the large-scale Simhastha Ujjain religiousgathering. With over 50 million participantsexpected, traditional manual identification methodsprove inefficient and time-consuming. To addressthis, our system leverages real-time video feedanalysis using deep learning techniques todistinguish between known (enrolled) and unknownindividuals. Detected individuals are annotatedwith green boxes for recognized faces and redboxes for unrecognized ones, maintaining a highconfidence threshold of 0.9 to ensure detectionaccuracy and minimize false positives.The system is architected with dual user modules:one for the general public and another for policeand administrative officials. The public-facingportal enables users to report missing persons,enroll facial data for family members, and monitorcase statuses. Enrolled faces are stored securely inan encrypted database. The administrativeinterface provides advanced tools for surveillanceofficers to upload and analyze video feeds. Thecore recognition pipeline is powered by MobileNetfor efficient and lightweight feature extraction,Principal Component Analysis (PCA) for reducingthe dimensionality of the extracted feature vectors,and K-Nearest Neighbors (KNN) for classifyingfaces based on proximity in the feature space.
Abstract The rapid spread of infectious diseases, such as COVID-19, has highlighted the critical need for reliable and efficient face mask detection systems. This study proposes a novel parallel hybrid … Abstract The rapid spread of infectious diseases, such as COVID-19, has highlighted the critical need for reliable and efficient face mask detection systems. This study proposes a novel parallel hybrid convolutional neural network (CNN) architecture that integrates VGG16 and MobileNetV2 to enhance feature extraction and classification accuracy. By leveraging advanced parallel architectures, the proposed model optimizes feature learning and parameter efficiency, outperforming conventional deep learning (DL) frameworks. To evaluate its effectiveness, we compare our hybrid model with established transfer learning (TL) architectures, including VGG16, MobileNetV2, and ResNet50. A comprehensive class-wise performance analysis was conducted to assess the model’s robustness in detecting both masked and unmasked faces. Experimental results demonstrate that our hybrid architecture achieves superior performance, with an accuracy of 99.49%, precision of 98.95%, recall of 100%, and an F1-score of 99.48%. These results indicate a significant improvement over traditional TL models. The high accuracy and reliability of our model suggest its practical viability for real-time face mask detection in public health monitoring and disease prevention efforts. The proposed approach offers a powerful and effective solution for enforcing mask compliance, thereby contributing to global efforts in mitigating infectious disease transmission.
Kalimur Rahman , Md. Shahriar Hussain , Shaik Ayaan +1 more | International jounal of information technology and computer engineering.
The "Automated Attendance System Using OpenCV, Face and Iris Detection" is an advanced solution designed to replace outdated and insecure traditional attendance systems. By leveraging modern technologies such as OpenCV, … The "Automated Attendance System Using OpenCV, Face and Iris Detection" is an advanced solution designed to replace outdated and insecure traditional attendance systems. By leveraging modern technologies such as OpenCV, artificial intelligence (AI), and Internet of Things (IoT) integration, the system ensures accurate, secure, and user-friendly attendance tracking. It utilizes real-time face detection and recognition to identify individuals, while incorporating eye-blink detection as a liveness check to prevent spoofing attempts using photos or videos. OpenCV serves as the core computer vision engine for detecting facial features, while AI-driven models improve recognition accuracy and adaptability. Eye aspect ratio (EAR) calculations help confirm the user’s physical presence by detecting natural blinking patterns. In secure environments, iris detection adds an extra biometric layer for identity verification. The system can be implemented on IoT-enabled devices such as Raspberry Pi for portability and real-time processing. This solution is ideal for educational institutions, corporate offices, and smart environments. It automates attendance logging, generates reports, and synchronizes data with centralized systems. The integration of computer vision and AI not only improves efficiency but also ensures security and scalability, making this system a powerful step forward in smart attendance technology.
Xin Wang , Ting Tsai , Li Lin +5 more | ACM Transactions on Multimedia Computing Communications and Applications
Generative Adversarial Networks (GANs) have enabled the creation of highly authentic facial images, which are increasingly used in deceptive social media profiles and other forms of disinformation, resulting in serious … Generative Adversarial Networks (GANs) have enabled the creation of highly authentic facial images, which are increasingly used in deceptive social media profiles and other forms of disinformation, resulting in serious consequences. Significant progress has been made in developing GAN-generated face detection systems to identify these fake images. This study offers a comprehensive review of recent advancements in GAN-generated face detection, focusing on techniques that detect facial images generated by GAN models. We categorize detection methods into three groups: (1) deep learning-based approaches, (2) physics-based methods, and (3) physiology-based methods. We summarize key concepts in each category, connecting them to relevant implementations, datasets, and evaluation metrics. Additionally, we provide a comparative analysis between automated detection and human visual performance to highlight the strengths and weaknesses of both approaches. Furthermore, we review related surveys, including detecting morphed faces, manipulated faces, DeepFake, and faces generated by diffusion models. Finally, we discuss unresolved challenges and suggest potential directions for future research.
| International Journal of Advanced Trends in Computer Science and Engineering
Fingerprints are unique and stay the same throughout a person’s life, which makes them reliable for identifying individuals. Interestingly, this uniqueness might also help in determining a person’s blood group. … Fingerprints are unique and stay the same throughout a person’s life, which makes them reliable for identifying individuals. Interestingly, this uniqueness might also help in determining a person’s blood group. In our project, we explore a non-invasive method of detecting blood types using fingerprint patterns. Instead of taking blood samples, we analyse fingerprint features like ridge frequency and spatial patterns using tools like Gabor filters and image processing techniques. We’ll collect fingerprints from individuals along with their known blood groups, then use deep learning, especially Convolutional Neural Networks your figures. (CNNs), to find patterns that may link fingerprint traits to blood types. This approach could make blood group detection faster, safer, and easier helpful in medical emergencies, forensic investigations, and even everyday healthcare.
Identifying gender through facial images is a crucial aspect in various life contexts. Biometric technology, such as facial recognition, has become an integral part of various applications, including fraud detection, … Identifying gender through facial images is a crucial aspect in various life contexts. Biometric technology, such as facial recognition, has become an integral part of various applications, including fraud detection, cybersecurity protection, and consumer behavior analysis. With the advancement of technology and the progress in artificial intelligence, especially through the use of Convolutional Neural Networks (CNNs), computers can now identify gender from facial images with a high level of accuracy. Although there are still some challenges, such as variations in pose, facial expressions, and different lighting conditions, CNNs can overcome these obstacles. This study uses the CelebA dataset, which consists of 122,000 facial images of both men and women. The dataset has been processed to maintain a balanced number of samples for each gender class, resulting in a total of 101,568 samples. The data is divided into training, validation, and test sets, with 80% used for training, and the remaining 20% split between validation and testing. Eight different CNN architectures are applied, including VGG16, VGG19, MobileNetV2, ResNet-50, ResNet-50 V2, Inception V3, Inception ResNet V2, and AlexNet. Although previous research has shown the potential of CNN architectures for various classification tasks, these studies often encounter issues of overfitting on large datasets, which can reduce model accuracy. This study applies dropout techniques and hyperparameter tuning to address overfitting issues and optimize model performance. The training results indicate that ResNet-50, ResNet-50 V2, and Inception V3 achieved the highest accuracy of 98%, while VGG16, VGG19, MobileNetV2, and AlexNet achieved accuracies of 95% and 97%, respectively. Performance evaluation using confusion matrices, precision, recall, and F1-score demonstrates excellent performance.
This paper examines methods utilizing Convolutional Neural Networks (CNN) and facial landmark detection using dlibrary to determine psychological personality types described in MBTI and MMPI scales based on analysing 72,000 … This paper examines methods utilizing Convolutional Neural Networks (CNN) and facial landmark detection using dlibrary to determine psychological personality types described in MBTI and MMPI scales based on analysing 72,000 photographs (profile and frontal views). The authors create an image dataset to evaluate algorithm performance, categorized by combinations of MBTI personality types. The aim of this work is to develop a classification model capable of predicting personality typology from images, employing MBTI and MMPI scales. The paper uses such methods as an original dataset structured into directories corresponding to dichotomy combinations; employs specialized neural networks (VGG, FaceNet) alongside face feature extraction algorithms based on FaceToFate. For assessing the performance of the developed algorithms, the paper uses Accuracy, Recall, Precision, and F1 Score metrics. When exploring the approach using neural networks, binary classification models are implemented. Frontal-view images serve as input for distinguishing between the classes E/I, N/S, and T/P. In the current approach, binary models are trained separately for the first three dichotomy letters using frontal orientation and for the fourth letter using profile orientation. This combination results in an average accuracy of approximately 65%. Additionally, another model is trained to predict the first three letters using embedding vectors generated from pre-trained FaceNet architecture as inputs. This model achieves an average accuracy of roughly 33%. Furthermore, convolutional neural networks (CNNs) are also applied to train models on MMPI categories. On average, they yield an accuracy level of about 20%. The second approach uses key facial landmarks extracted by the pre-trained dlib framework. Coordinates of facial points are used to compute measurements of facial features such as nose length, eye shape, lip width, and face contours. These extracted features represent the first three dichotomy letters based on frontal images. This method achieves an average accuracy of around 30%. Findings show that overall, when determining MBTI dichotomies from images accounting for face orientation, the best performance is observed using binary classification models.