얀 르쿤 페이스북 요약. Self-supervised learning: NLP vs VISION

2021. 3. 8. 14:40딥러닝

얀 르쿤(Yann LeCun)의 facebook에 남긴 의견(링크)

 

https://www.facebook.com/yann.lecun/posts/10157551937352143

요약

 

  • SSL의 방법이 NLP와 VISION에 다르다
    • Text
      • Text는 discrete한 신호이다. 이것이 '예측'에 있어 uncertainty를 표현하기 쉽다. 따라서
      • 따라서 '신호'를 예측(predict)하거나, reconstruct하는 architecture와 훈련 패러다임이 잘 작동한다.
        • 예) The (blank) chases the (blank) in the savanna에서 blank 맞추기. 어휘라는게 엄청나게 방대한 양이고, 빈칸 맞추기가 엄청 uncertainty하지만, a list of all possible words를 제공하기에 좋다. 그리고 해당 장소에 그 단어들의 출현을 측정하는 확률을 제공한다.
      • 예를들어, denoising encoder-decoder.
    • VISION
      • 이미지는 high-dimensional continuous 신호이다.
      • Uncertainty를 표현하는게 더욱 더 어렵다.
      • 따라서 예측하거나 reconstruct하지 않는 architecture가 더 좋다.
      • 예를 들어 joint embedding method가 있다(a.k.a. Siamese nets)
  • 과거엔,
    • joint embedding architecture가 비싼 "contrastive learning"과정을 필요로 했다.
  • 그러나 요즘은, contrastive samples를 필요로 하지 않는 몇가지 방법들이 있다.
    • 이러한 방법이 better, more general, and more robust 하게 representation을 학습한다.
    • 이게 pure supervised learning보다 좋다.

 

  1. Energy function = Similarity function이라고 부르는것 같다.
  2. NLP vs Vision :
    1. To better understand this challenge, we first need to understand the prediction uncertainty and the way it’s modeled in NLP compared with CV. In NLP, predicting the missing words involves computing a prediction score for every possible word in the vocabulary. While the vocabulary itself is large and predicting a missing word involves some uncertainty, it’s possible to produce a list of all the possible words in the vocabulary together with a probability estimate of the words’ appearance at that location. Typical machine learning systems do so by treating the prediction problem as a classification problem and computing scores for each outcome using a giant so-called softmax layer, which transforms raw scores into a probability distribution over words. With this technique, the uncertainty of the prediction is represented by a probability distribution over all possible outcomes, provided that there is a finite number of possible outcomes.
    2. In CV, on the other hand, the analogous task of predicting “missing” frames in a video, missing patches in an image, or missing segment in a speech signal involves a prediction of high-dimensional continuous objects rather than discrete outcomes. There are an infinite number of possible video frames that can plausibly follow a given video clip. It is not possible to explicitly represent all the possible video frames and associate a prediction score to them. In fact, we may never have techniques to represent suitable probability distributions over high-dimensional continuous spaces, such as the set of all possible video frames.

'딥러닝' 카테고리의 다른 글

SwAV, SEER-Unsupervised Learning by Contrasting Cluster Assignments  (0) 2021.03.23
Fine-grained 서베이 논문 3편  (0) 2021.03.16
NeRF-Neural Radiance Field  (0) 2021.03.06
Meta Pseudo Label  (0) 2021.03.06
GPU Memory Consumption of DL Models  (0) 2021.03.06
donaricano-btn