current position:Home>Hyperstyle: complete face inversion using hypernetwork

Hyperstyle: complete face inversion using hypernetwork

2022-05-15 07:32:05Ericam_


HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing

  • 2022 CVPR
  • StyleGan Inversion correlation



Invert the real image to StyleGAN Of latent space It is a well studied problem . However , The effect of existing methods applied to real-world scenes is still general , This is because there is an inherent trade-off between image reconstruction and editability : The potential spatial region that can accurately represent the real image is usually degraded by the influence of semantic control .

Recently, some work has made a trade-off by adding the target image to the area with good image performance and good editing in the potential space through the fine-tuning generator . But this fine-tuning scheme requires long training for new pictures .

In this work , We introduce this method to the encoder The field of inversion , Put forward HyperStyle. This is a way to Learning adjustment StyleGan Weighted hypernetworks , Simple modulation requires training a parameter over 30 Billion super network , And through careful network design , Reduce parameters to match existing encoder Agreement .HyperStyle The reconstruction effect is similar to latent The reconstruction effect of optimization technology is quite , And encoder Real time reasoning ability of . Finally, we show the use of HyperStyle Effectiveness in several applications outside the inversion task , Including editing images outside the training data domain .

 Please add a picture description

In order to take advantage of the pretrained models, Most of the work avoids changing the weight when the generator performs inversion .

Some work exploration :

  • For each image , Adjust the generator for more accurate inversion
  • Inversion through random noise vector BigGAN , Select the vector that best matches the real image , And gradually optimize the generator weight
  • First, the potential code is obtained through inversion ( It approximately reconstructs the target image ), Then fine tune the generator weights to improve image specific details

But these jobs need a long time to optimize , Usually a few minutes per image . by comparison ,HyperStyle A super network is trained on a large number of images , Used to complete inversion for any given image , And real-time .

summary : More accurate inversion + Good editability + Near real time (Encoder)

Network structure

 Please add a picture description

(1) First, the picture goes through a encoder Generate initial inversion picture

(2) The original Input and Initial Inversion Picture as Input, Send in HyperStyle The Internet ,Input The number of channels is 6, after ResNet Backbone Output is 16*16*512

#x:3*h*w y:3*h*w
x_input =[x, y_hat], dim=1)

ResNet BackBone The structure is

# 1 Layer convolution +4 layer resnet34 The network layer 
self.conv1 = nn.Conv2d(opts.input_nc, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = BatchNorm2d(64)
self.relu = PReLU(64)

resnet_basenet = resnet34(pretrained=True)
blocks = [

(3)Refinement Block Is to optimize vector parameters , But there are too many parameters in the training process , To solve this problem , Introduced Shared Refinement Block, As shown in the figure below :

 Please add a picture description

stay 3*3*512*512 Non - toRGB Two fully connected layers are shared between layers . This effectively saves the amount of training parameters .

p.s: 3*3*512*512 Signification :kernel size * kernel size * input depth * output depth

The specific layer can be seen in :blocks

Why not toRGB Layers share weights ?

Because the experiment found that this would affect GAN Ability to edit .

(4)Generator Divided into 3 A level :coarse,medium,fine, Different levels control different levels of the generated image . because initial inversion Tends to capture coarse details , So we send network input into medium and fine generator layer .

(5) Iterative refinement , to update generator The weight , Ensure more accurate inversion .

Training Losses

Because it's similar to training encoder, So we used pixel-wise L2 loss and LPIPS perceptual loss

For facial areas , Further use of Loss of identity based similarity ( A pre trained face recognition network is used )

For non facial areas , Used a MoCo-based Similarity degree loss
L 2 ( x , y ^ ) + λ LPIPS  L LPIPS  ( x , y ^ ) + λ sim  L sim  ( x , y ^ ) \mathcal{L}_{2}(x, \hat{y})+\lambda_{\text {LPIPS }} \mathcal{L}_{\text {LPIPS }}(x, \hat{y})+\lambda_{\text {sim }} \mathcal{L}_{\text {sim }}(x, \hat{y}) L2(x,y^)+λLPIPS LLPIPS (x,y^)+λsim Lsim (x,y^)


We introduce a novel StyleGan Inversion method :HyperStyle, Using the latest development of hypernetworks to realize approximation encoder Reasoning time optimization-level Optimize . In a sense ,HyperStyle Use the given target image to continuously learn and adjust 、 The generator is effectively optimized . This can reduce the cost of reconstruction - The tradeoff between editability , And can effectively use the existing editing technology on a wide range of input . Besides ,HyperStyle Very well generalized , Even extraterritorial images that are not in the training set , Both the super network and the generator can invert very well . Looking forward to the future , It is highly desirable to further expand generalization from the field of training . This includes robustness to misaligned images and unstructured domains . The former may pass StyleGAN3 Be solved , The latter may need to be trained on a richer image set .

Generate the image

Comparative display of inversion ability :

 Please add a picture description
 Please add a picture description

Comparative display of editability :

 Please add a picture description

 Please add a picture description

copyright notice
author[Ericam_],Please bring the original link to reprint, thank you.

Random recommended