Face Hallucination

Face Hallucination

Considering the critical role of image interpretation in modern intelligent systems, Face Hallucination (FH) has attracted growing attention in recent years. FH fuels a wide spectrum of applications in authentication,surveillance monitoring, law enforcement, security control, entertainment, bio-metrics, digital entertainment, rendered services for a legitimate user only, age synthesis and estimation, forensic art, electronic customer relationship management, and cosmetology.

In most surveillance imaging framework, the camera and the objects of interest are at the large interest in the scene that results in normally low resolution of objects that are taken. In automatic face recognition and identification methods, it is very much required to improve the resolution of the faces.

FH refers to both face super-resolution (FSR) and face sketch-photo synthesis (FSPS) because they share the similar intrinsic mathematical model; that is, they infer an image lying in an image space from its corresponding counterpart lying in another space.

Although face super-resolution and FSPS share the similar framework, this does not mean that methods which work well for FSR also work well on face sketch-photo synthesis and vice versa. This indicates that applying face super-resolution techniques directly to FSPS may not always achieve good performance and vice versa. This may be due to the fact that though down-sampling and blurring effect are the mian factors of difference between low-resolution and high-resolution images, they have the similar texture or intensity expressions. However, sketches and photos are in quite different texture expressions.

teaser

In cases where low-resolution face images are acquired by live surveillance cameras at a distance, FH techniques can be used to enhance low-resolution images and transform sketches to photos and photos to sketches for the subsequent utilization.

In [16], Liu et al. argued that a successful face-hallucination algorithm should meet the following three constraints:

1. Sanity constraint: the target HR image should be very close to the input LR image when smoothed and down-sampled.

2. Global constraint: the target HR image should have the common characteristics of human faces, e.g., possessing a mouth and a nose, being symmetrical, etc.

3. Local constraint: the target HR image should have the specific characteristics of the original LR face image, with photorealistic local features.

Reconstruction-based methods estimate a high-resolution image from a sequence of blurred and down-sampled low-resolution images and there are inherent limitations in relation to increasing the magnification factor.

Diagram of face hallucination

Learning-based methods explore mapping relations between high- and low-resolution image pairs to infer high- resolution images from their low-resolution counterparts.

Diagram of different face hallucination approaches

Existing FH methods can be grouped into four categories: Bayesian inference approaches, sub-space learning approaches, a combination of Bayesian inference and subspace learning approaches, and sparse representation-based approaches.

Diagram of subspace learning framework

Subspace learning refers to the technique of finding a sub-space Rm embedded in a high dimensional space Rn (n > m).

In one of the approaches (Liu et al. 2001, 2007a), principal component analysis is first applied to obtain an initial global face image and subsequently an MAP-MRF is exploited to calculate the local face image.

Liu et al. (2005a) (LLE-based) is used to generate an initial estimate. Then, by exploiting the proposed tensor model whose modes consisted of people identity, patch position, patch style (sketch or photo) and patch features, the high frequency residual error is inferred under the Bayesian MAP framework on the assumption that a sketch-photo patch pair shares the same tensor representation parameter. By adding these two parts, a photo with much more detailed information could be synthesized from the input sketch.

Liu et al. (2005a) improved the accuracy to 88 % by adopting the kernel based nonlinear discriminant analysis (Mika et al. 1999) as the dimension reduction algorithm.

Liu et al. (2007b) applied a two-step procedure to photo synthesis from an input sketch.

An image is decomposed into two parts: the low and middle frequency information part, and the high frequency information part, which results in the MAP objective function being solved by a two-step sequential solution. In the first step, the low and middle frequency information is evaluated by solving a least square problem. The high frequency information is then compensated by exploiting a non-parametric patch learning process in the second step. Combining these two parts, the target high-resolution image with some expression is computed.

Recently, sparse representation has achieved great progress in computer vision (Wright et al. 2010) and data analysis (Zhou and Tao 2013). In particular, methods have been proposed for image reconstruction and state-of-the-art results have been obtained (Mairal et al. 2008a,b). Yang et al. (2008b) applied the idea of the sparse representation model with a coupled learning process to face image super-resolution and achieved good results. Yang et al.’s method(2008b) is not the end of the application of sparse representation to FH, since the method considers less prior knowledge of the face image than the face images provide, and the effective exploration of the sparsity of face images is therefore an interesting problem to resolve.

Sparse representation-based approaches promise a better performance, and efficient signal modeling. Moreover, they avoid the need to image registration or alignment. It’s no more necessary to estimate the blurring operator used in down-sampling of the original high-resolution image.

The evaluation for FH can be subjective quality assessment or objective quality assessment. To some extent, face recognition rate can also be seen as an objective image quality assessment metric because it measures the similarity of the query image to images in the gallery.

Gunturk et al. (2003) performed eigenface (Turk and Pentland 1991) recognition experiments on some real video sequences containing 68 people, collected from the CMU PIE database (Sim et al. 2002). They achieved an accuracy of 44 % by utilizing low-resolution images in comparison to 74 % by exploring their hallucinated high-resolution face images.

Wang and Tang (2005) conducted direct correlation-based face recognition on 490 face images of 295 subjects in the XM2VTS database (each subject has two images from two different sessions). They found that the recognition accuracy fluctuates slightly when the down-sample factor is not too large (not larger than five in the paper). When the down-sample factor is reduced further, the hallucinated high-resolution face images improve the face recognition performance compared to directly utilizing the low-resolution images. They also pointed out that the improvement on face recognition accuracy is not as significant as that in the visual quality.

Sparse representation of a signal is based on the assumption that most or all signals can be represented as a linear combination of a small number of elementary signals only, called atoms, from an overcomplete dictionary.

Il = Ih

Therefore, the purpose of SR is to recover as much of the information lost in the down-sampling process as possible. Since the reconstruction process still remains ill posed, different priors can be used to guide and constrain the reconstruction results.

In recent years, the sparse representation model (SRM) has been used as the prior model, and has shown promising results in image super-resolution.

Compared with other conventional methods, sparse representation can usually offer a better performance, with its capacity for efficient signal modeling [21].

The sparse representation of signals has already been applied in many fields, such as object recognition [22,23], text categorization [24], signal classification [21], etc.

Solving the sparsest solution for (3) has been found to be NP-hard, and it is even difficult to approximate [25]. However, some recent results [26,27] indicate that if the vector ω in (3) is sparse enough, then the problem can be solved efficiently by minimizing the l1-norm instead.

[25] E. Amaldi, V. Kann, On the approximability of minimization nonzero variables or unsatisfied relations in linear systems, Theoretical Computer Science 209 (1998) 237–260. http://refhub.elsevier.com/S0031-3203(13)00384-1/sbref16

[26] D.L. Donolo, For most large underdetermined systems of linear equations, the minimal l 1 -norm solution is also the sparsest solution, Communication on Pure and Applied Mathematics 59 (2006) 797–829. http://refhub.elsevier.com/S0031-3203(13)00384-1/sbref17

[27] E. Candes, J. Romberg, T. Tao, Stable signal recovery from incomplete and inaccurate measurements, Communications on Pure and Applied Mathematics 59 (2006) 1207–1223. http://refhub.elsevier.com/S0031-3203(13)00384-1/sbref17

In addition, the optimization problem of the l1-norm can be solvedin polynomial time [28,29].

[28] D.L. Donoho, Y. Tsaig, I. Drori, J.-L. Starck, Sparse Solution of Underdetermined Linear Equations by Stagewise Orthogonal Matching Pursuit, Preprint, December 2007. http://refhub.elsevier.com/S0031-3203(13)00384-1/othref0050

[29] S.J. Wright, R.D. Nowak, M.a.A.T. Figueiredo, Sparse reconstruction by separable approximation, in: Proceedings of the ICASSP’2008, May 2008, pp. 3373–3376. http://refhub.elsevier.com/S0031-3203(13)00384-1/othref0055

Equation (4): This will lead to the result whereby the sparse representation of an observational signal, in terms of the training data in A, may not be accurate.

Lagrange multipliers offer an equivalent formulation, as shown in the following equation:

arg min 1/2 ||Aω – ψ‖22 + λ‖ω‖1

where λ ∈ R+ is a regularization parameter which balances the sparsity of the solution and the fidelity of the approximation to ψ.

This is actually a typical convex-optimization problem, and it can be efficiently solved using the method of Large-Scale L 1 Regularized Least Squares (L1LS) [30].

[30] X. Mei, H. Ling, D.W. Jacobs, Sparse representation of cast shadows via L1-regularized least squares, in: Proceedings of the International Conference on Computer Vision (ICCV), 2009, pp. 583–590. http://refhub.elsevier.com/S0031-3203(13)00384-1/othref0060

Recently, sparse representation has achieved great progress in computer vision (Wright et al. 2010) and data analysis (Zhou and Tao 2013). In particular, methods have been proposed for image reconstruction and state-of-the-art results have been obtained (Mairal et al. 2008a,b). Yang et al. (2008b) applied the idea of the sparse representation model with a coupled learning process to face image super-resolution and achieved good results. Yang et al.’s method(2008b) is not the end of the application of sparse representation to FH, since the method considers less prior knowledge of the face image than the face images provide, and the effective exploration of the sparsity of face images is therefore an interesting problem to resolve.

Sparse representation-based approaches promise a better performance, and efficient signal modeling. Moreover, they avoid the need to image registration or alignment. It’s no more necessary to estimate the blurring operator used in down-sampling of the original high-resolution image.

The proposed method, denoted as ScSR, is based on the idea of sparse signal representation whereby the linear relationships among HR training signals can be accurately recovered from their low-dimensional projections.

The structures of LR images are used to form a sparse prior model, which is then employed to reconstruct the HR images or HR patches.

The differences between ScSR and our method are that ScSR represents image patches as a sparse linear combination of elements from an appropriately chosen overcomplete dictionary, while in our method, a pixel is represented as a sparse linear combination of elements from its neighboring pixels.

In ScSR, the method assumes that image patches can be well represented as a sparse linear combination of elements from a specific dictionary, and a pair of HR–LR dictionaries is constructed to force LR–HR patches to have the same sparse coefficients.

High-resolution patches have a sparse linear representation with respect to an compact learned overcomplete dictionary of patches randomly sampled from similar images. one is the conventional approach, which is also widely known as multi-image SR [2–5] or regularization-based SR, and which reconstructs a HR image from a sequence of LR images of the same scene. These algorithms mainly employ regularization models to solve the ill-posed image SR, and use smooth constraints as the prior constraints, which are defined artificially.

The other approach is single-frame SR [6–11], which is also called learning-based SR or example-based SR. These methods generate a HR image from a single LR image with the information learned from a set of LR–HR training image pairs. These algorithms attain the prior constraints between the HR images and the corresponding LR images through a learning process.

In [16], Liu et al. argued that a successful face-hallucination algorithm should meet the following three constraints:

1. Sanity constraint: the target HR image should be very close to the input LR image when smoothed and down-sampled.

2. Global constraint: the target HR image should have the common characteristics of human faces, e.g., possessing a mouth and a nose, being symmetrical, etc.

3. Local constraint: the target HR image should have the specific characteristics of the original LR face image, with photorealistic local features.

The implementation procecure of the face hallucination frameworkLearning the sparse local pixel structure of a face patch from example HR facesgeneral framework of an example based SR algorithm

Paper:  From local pixel structure to global image

super-resolution: a new face hallucination framework

Authors: Y. Hu, K.M. Lam, G. Qiu, T. Shen

Year: 2011

A three-stage face-hallucination framework was proposed in [11], which is called Local-Pixel Structure to Global Image Super-Resolution (LPS-GIS).

1- In Stage 1, k pairs of example faces which have a similar pixel structure to the input LR face are selected from a training dataset using k-Nearest Neighbors (KNN).

2 – They are then subjected to warping using optical flow, so that the corresponding target HR image can be reconstructed more accurately.

3 – In Stage 2, the LPS-GIS method learns the face structures, which are represented as coefficients using a standard Gaussian function; the learned coefficients are updated according to the warped errors.

4 – In Stage 3, LPS-GIS constrains the revised face structures, namely the revised coefficients to the input LR face, and then reconstructs the target HR image using an iterative method.

A face-hallucination framework which utilizes the sparse local-pixel structure as the prior model in the reconstruction of HR faces.

Our method seeks a sparse representation for each patch of the LR input image from an LR overcomplete dictionary; the coefficients of this representation are then used to generate the HR target image using the HR overcomplete dictionary. One important process in ScSR is the training of two dictionaries for the LR and HR image patches. In our method, central pixels replace patches, and only the HR dictionary is needed; it is constructed directly from the HR example faces.

The experimental results have demonstrated that our proposed framework is competitive and can achieve superior performance compared to other state-of-the-art face-hallucination methods.

The superior performance of our algorithm is mainly due to the fact that the example faces can provide both the holistic and the pixel-wise information for reconstructing the target HR facial images, and it can estimate the sparse local-pixel structures of the target HR faces more accurately from the example faces using sparse representation. Our proposed method can maintain the impressive capability of inferring fine facial details and generating plausible HR facial images when the input face images are of very low resolution.

Paper: Face hallucination based on sparse local-pixel structure

Authors: Yongchao Li a,1 , Cheng Cai a, n , Guoping Qiu b,c , Kin-Man Lam d

Year: 2013

Key Idea:High-resolution patches have a sparse linear representation with respect to an compact learned overcomplete dictionary of patches randomly sampled from similar images.

Steps:

1 – the input LR image is first interpolated, using the conventional meth-

ods, to the size of the target HR image.

2 – the input interpolated LR image ? a blurry image lack of high-frequency information ? is then used as the initial estimation of the target HR.

3 – the input LR image is also divided into either overlapping or non-overlapping

image patch.

4 – the example-based framework will use the image

patches to find out the most matched examplesby searching a

training dataset of LR–HR image pairs.

5 – the selected HR examples are then employed to learn the HR information as the prior constraints.

6 – finally, the learned HR information and the input interpolated image are combined to evaluate the target HR image.

Pros:

1 – The experimental results have demonstrated that our proposed framework iscompetitive and can achievesuperior performance compared to other state-of-the-art face-hallucination methods.

2 -Fine facial details and plausible high resolution from very low resolution.

3 – Sparse representations are more accurate.

4 -Our code is available online: http://as.nwsuaf.edu.cn/fhsr.html .

5 – avoids the need to image registration ill-conditioned registration

6 – avoids the need to estimate the unknown blurring operators

7 – Takes the global face into consideration

8 – Two dictionaries will not be fully coupled, allowing much flexibility for synthesis.

Cons:

Differences:

*utilizes the sparse local-pixel structure as the prior model.

* needs one dictionary instead of two compared to Yang et al. ScSR [2].

* central pixels replace patches.

* a pixel is represented bya sparse linear combination ofelements from its neighboring pixels, not from the dictionary as in Yang et al. ScSR.

* The dictionary is constructed using the neighboring pixels of the missing pixels.

*optical flow is applied to make the learning processmore accurate

* Wang et al 2012 [1].Proposed a novelcoupled dictionary learning approach (Two dictionaries will not be fullycoupled, allowing much flexibility for synthesis).

[1] Wang et al., SEMI-COUPLED DICTIONARY LEARNING WITH APPLICATIONS TO IMAGE SUPER-RESOLUTION AND PHOTO SKETCH SYNTHESIS, 2012

[2]Yang et al., Image super-resolution via sparse representation, 2010

In [44], Dong et al. also proposed an image interpolation method based sparse representation, which is abbreviated as NARM-SRM-NL. In the method, a nonlocal autoregressive model (NARM) was proposed and taken as the data-fidelity term in the sparse representation model (SRM). The patches in the estimated HR image are reconstructed using the nonlocal neighboring patches. The method assumes that the nonlocal similar patches in an image have similar coding coefficients with the same overcomplete dictionary; the coefficients are then embedded into SRM and NARM to reconstruct the HR images.

In NARM-SRM-NL, the sparse model employed assumes that an image patch can have many similar patches among its nonlocal neighboring patches, and the local PCA dictionary is used to span adaptively the sparse domain for signal representation.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s