Knowledge Distillation SOTA 논문 리뷰

논문리뷰 2022. 5. 30. 14:59

1. Knowledge review (CVPR 2021)

https://arxiv.org/pdf/2104.09044v1.pdf

~를 중요한 문제로 여겼다.
-> 기존은 똑같은 feature 단에서 feature transformation + loss로 knowledge 를 전달한다. 근데 이 fourth stage output끼리의 비교는 bottleneck in the whole knowledge distillation framework 이다.
이를 어떤 hint를 통해 풀었다.
-> teacher의 old knowledge(low level previous feature) connection path cross 을 사용하여 student에 각 stage에서 추가적으로 전달했다. (uses multiple layers in the teacher to supervise one layer in the student)
~가 미지수다.
-> how to extract useful information from multilevel information from the teacher and how to transfer them to the student.
(본 논문에선 attention based fusion, hierarchical context loss 두 방법을 제안하지만 여전히 open problem임)
~점이 부족하다.
->

2. SSKD (ECCV 2020)

https://arxiv.org/pdf/2006.07114v2.pdf

~를 중요한 문제로 여겼다.
-> 현재는 classification 이라는 single task 에서 얻은 knowledge만을 사용함. 근데 knowledge is highly task-specific, such knowledge may only reflect a single facet of the complete knowledge encapsulated in a cumbersome network. 따라서 Self supervised 로 얻는 knowledge 도 추가로 쓸꺼임.
이를 어떤 hint를 통해 풀었다.
-> teacher, student를 self supervision task를 진행시켜 contrastive prediction을 얻고 그 값을 비교 하는 loss를 기존 KD loss에 추가
(SS loss : similarity matrix disparity + output disparity)
~가 미지수다.
->
~점이 부족하다.
-> knowledge review에서 나왔던 hierarchical 한 knowledge는 없어보임.

3. GLD (ICCV 2021)

https://openaccess.thecvf.com/content/ICCV2021/papers/Kim_Distilling_Global_and_Local_Logits_With_Densely_Connected_Relations_ICCV_2021_paper.pdf

~를 중요한 문제로 여겼다.
-> GAP 거친 global logit distillation은 input의 spatial 정보가 없다. 이는 rich relational information across contexts of an input image 의 손실이다.
이를 어떤 hint를 통해 풀었다.
-> penultimate layer에서 local logit + 그들의 relationship을 student에 전달
(global logit : classifier output logit
local logit : output of the classifier that takes the input of local features divided from the global feature)
global, local logit 둘 다 shape는 1x1xc

3. ~가 미지수다.
-> GLD를 다양한 방식의 spatial pooling을 적용하면 성능을 더 올릴 수 있을것이다.

4. ~점이 부족하다.
->

4. CRCD (CVPR 2021)

https://arxiv.org/pdf/2103.16367v1.pdf

~를 중요한 문제로 여겼다.
-> inter-sample relation 정보 전달력 개선
Contrastive learning방식을 그대로 사용하면 latent상 근접한 sample간의 geometry가 망가짐.
(CRD : student logit이 정답인 teacher logit에만 가까이 가고자 함)
이를 어떤 hint를 통해 풀었다.
-> 새로운 cross-space relation between two sample 정의, 해당 relation을 teacher에서 측정한 후 student를 supervise
(CRCD : student logit이 teacher의 logit근처 structure를 보존하려고 하며 supervised)
~가 미지수다.
-> Structural information of deep representation can be better exploited during distillation(what kind of knowledge really needs to be preserved in the distillation.)
~점이 부족하다.
-> output logit 끼리만 비교하는 듯 하다. lower feature도 추가로 활용하면 어떨까

5. WSL(ICLR 2021)

https://arxiv.org/pdf/2102.00650v1.pdf

~를 중요한 문제로 여겼다.
-> teacher의 output인 soft label은 supervision인 동시에 regularization 역할을 한다. 근데 해당 soft label을 통해 distill을 진행하면 얼마나 bias와 varaiance가 바뀌는지는 unclear 하다.
이를 어떤 hint를 통해 풀었다.
-> bias-variance tradeoff를 실험적으로 관찰했다.
이 과정에서 bias를 높이고 variance를 낮추는(학습에 안좋은 영향을 끼치는) 특정 sample들이 있음을 발견,
기존의 KD로는 handling이 잘 되지 않음을 발견함. 해당 sample들의 영향력을 낮추는 weighted soft label 제안
we rethink the soft labels for distillation from a bias-variance tradeoff perspective
~가 미지수다.
->
~점이 부족하다.
->

(relative work인 Revisiting knowledge distillation via label smoothing regularization을 먼저 읽어봐야할 논문.)

6. Revisiting knowledge distillation via label smoothing regularization(CVPR 2020 oral)

https://openaccess.thecvf.com/content_CVPR_2020/papers/Yuan_Revisiting_Knowledge_Distillation_via_Label_Smoothing_Regularization_CVPR_2020_paper.pdf

~를 중요한 문제로 여겼다.
-> (acc 엄청 낮은 teacher,self distillation, student로 teacher distillation)으로 distill해도 student 성능 잘나오더라.
이를 어떤 hint를 통해 풀었다.
-> KD는 label smoothing regularization 의 special case 이다. t가 올라갈수록 q는 uniform에 가까워져 regularization 된다.
soft label은 categorical information 전달 + regularization 역할을 한다.
~가 미지수다.
->
~점이 부족하다.
->

fidelity랑 묶어서 생각해보기

7. Does knowledge distillation really works? (Neurlps 2021 poster)
https://arxiv.org/pdf/2106.05945.pdf

~를 중요한 문제로 여겼다.
-> teacher, student의 아키텍쳐가 같아도, predictive distribution은 매우 다르다.
이를 어떤 hint를 통해 풀었다.
-> (teacher, student agreement 쟤는) fidelity를 제안, student acc와 정비례하지 않음을 확인
~가 미지수다.
-> fidelity를 높이려 했지만 못 높이겠다..!
~점이 부족하다.
-> 진짜 fidelity는 argmax한 label값이 아닌 label 분포(soft prediction)간의 비교를 하는게 맞는게 아닐까?
GAN augmentation의 합당성
fidelity를 높이려 노력했지만 못 높였음.

8. What Knowledge Gets Distilled in KD (May 31, 2022)

https://arxiv.org/pdf/2205.16004v1.pdf

~를 중요한 문제로 여겼다.
-> dark knowledge란 무엇이고 우린 실험을 통해 knowledge가 어느 정도 전달될 수 있는가를 파악.
이를 어떤 hint를 통해 풀었다.
-> T, independent student, distilled student(KD, hint, CRD) 간의 agreement 비교
T와 distilled S의 agreement를 통해 knowledge가 어디까지 전달되는지 파악하고, dark knowledge에 대해 알아가려함
~가 미지수다.
-> color jitter invariance, crop invariance가 hint learning에서는 이루어지지 않는 이유 (hint learning을 발전시킨 논문에서 invariance가 지켜졌다면, 어떤 점이 critical했는지 분석해보기)(+ KD, CRD에서는 invariance가 지켜졌는데, 어떻게 지켰는지도 open problem) -> KD의 경우는 GAP 거친 값들 staticscally 분석해보면 알 수 있지 않을까
student 아키텍쳐 당 distill 될 수 있는 knowledge upper bound
teacher의 특정 property만 전달 시킬 수 있도록 tailored algorithm 만들기
~점이 부족하다.
-> Hint learning이 feature-based KD를 대변하기엔 너무 오래 전 논문 같다.
그림마다 해당 수치가 높고 낮은지를 알려주는 기준이 없음(user understanding으로 이해하기엔 잣대가 부족)

9. On the Efficacy of KD (ICCV 2019)

~를 중요한 문제로 여겼다.
-> larger teacher이 student acc를 보장해주진 않는다. small student 는 mimic 할 수 있는 capacity 한계가 있다.
두 가지 가설을 세움 : 1.KD loss 와 accuracy는 상관관계가 아니다 2. student는 teacher를 mimic할 수 없다
이를 어떤 hint를 통해 풀었다.
-> 1. 에 대한 해결 :
inde S, distilled S 비교시 KD loss optimize가 training 끝 무렵엔 accuracy를 해친다.
small student 는 KD loss 와 CE loss를 동시에 최적화 할 capacity가 부족해서, CE loss를 희생한다.(저자 가정)
따라서 KD loss는 학습 초반에만 하고 후엔 CE loss만 쓰는 Ealy stopping KD 제안.

2. 에 대한 해결 : 못함.
~가 미지수다.
-> 2.
~점이 부족하다.
-> acc 9%인 teacher에서 distill해도 성능 올라가는 label smoothing 효과에 대해서는 기술하지 않음
larger T에서 student acc 올리기 결국 실패(sequential kd 제안했으나 효과 X)

10. Knowledge distillation from a stronger teacher (May 22, 2022)
(baseline으로 삼을만한 좋은 논문)

~를 중요한 문제로 여겼다.
-> KLdiv를 exactly matching하는건 acc면에서 안좋음(stronger teacher 쓸수록 T-S prediction discrepancy 커짐)
KL div 대신 Pearson correlation 사용
( we really care about is preserving the preference (relative ranks of predictions) by the teacher, instead of recoveringthe absolute values accurately.)
이를 어떤 hint를 통해 풀었다.
-> inter, intra correlation 사용
intra correlation 사용 이유 : For the images from the same class, the intrinsic intra-class variance of the semantic similarities is actually also informative. It indicates the prior from the teacher that which one is more reliable to cast in this class.
(+최근 KD 연구들은 temperatures 높일수록 label smoothing 효과 때문에 1만 쓴다고 함.)
~가 미지수다.
->Pearson 제외한 similarity(correlation) metric 사용
~점이 부족하다.
-> output logit만을 활용

'논문리뷰' 카테고리의 다른 글

Projected GANs Converge Faster (0)	2022.02.25
Context-Aware Layout to Image Generation with Enhanced Object Appearance (0)	2022.01.13
GANSpace: Discovering Interpretable GAN Controls 리뷰 (0)	2021.11.18

ABOUT ME

notou10 notou10

1. Knowledge review (CVPR 2021)

2. SSKD (ECCV 2020)

3. GLD (ICCV 2021)

4. CRCD (CVPR 2021)

5. WSL(ICLR 2021)

6. Revisiting knowledge distillation via label smoothing regularization(CVPR 2020 oral)

7. Does knowledge distillation really works? (Neurlps 2021 poster)
https://arxiv.org/pdf/2106.05945.pdf

8. What Knowledge Gets Distilled in KD (May 31, 2022)

9. On the Efficacy of KD (ICCV 2019)

10. Knowledge distillation from a stronger teacher (May 22, 2022)
(baseline으로 삼을만한 좋은 논문)

'논문리뷰' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Knowledge review (CVPR 2021)

2. SSKD (ECCV 2020)

3. GLD (ICCV 2021)

4. CRCD (CVPR 2021)

5. WSL(ICLR 2021)

6. Revisiting knowledge distillation via label smoothing regularization(CVPR 2020 oral)

7. Does knowledge distillation really works? (Neurlps 2021 poster)https://arxiv.org/pdf/2106.05945.pdf

8. What Knowledge Gets Distilled in KD (May 31, 2022)

9. On the Efficacy of KD (ICCV 2019)

10. Knowledge distillation from a stronger teacher (May 22, 2022)(baseline으로 삼을만한 좋은 논문)

'논문리뷰' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

7. Does knowledge distillation really works? (Neurlps 2021 poster)
https://arxiv.org/pdf/2106.05945.pdf

10. Knowledge distillation from a stronger teacher (May 22, 2022)
(baseline으로 삼을만한 좋은 논문)