요약

SSL의 방법이 NLP와 VISION에 다르다
- Text
  - Text는 discrete한 신호이다. 이것이 '예측'에 있어 uncertainty를 표현하기 쉽다. 따라서
  - 따라서 '신호'를 예측(predict)하거나, reconstruct하는 architecture와 훈련 패러다임이 잘 작동한다.
    - 예) The (blank) chases the (blank) in the savanna에서 blank 맞추기. 어휘라는게 엄청나게 방대한 양이고, 빈칸 맞추기가 엄청 uncertainty하지만, a list of all possible words를 제공하기에 좋다. 그리고 해당 장소에 그 단어들의 출현을 측정하는 확률을 제공한다.
  - 예를들어, denoising encoder-decoder.
- VISION
  - 이미지는 high-dimensional continuous 신호이다.
  - Uncertainty를 표현하는게 더욱 더 어렵다.
  - 따라서 예측하거나 reconstruct하지 않는 architecture가 더 좋다.
  - 예를 들어 joint embedding method가 있다(a.k.a. Siamese nets)
과거엔,
- joint embedding architecture가 비싼 "contrastive learning"과정을 필요로 했다.
그러나 요즘은, contrastive samples를 필요로 하지 않는 몇가지 방법들이 있다.
- 이러한 방법이 better, more general, and more robust 하게 representation을 학습한다.
- 이게 pure supervised learning보다 좋다.

Energy function = Similarity function이라고 부르는것 같다.
NLP vs Vision :
1. To better understand this challenge, we first need to understand the prediction uncertainty and the way it’s modeled in NLP compared with CV. In NLP, predicting the missing words involves computing a prediction score for every possible word in the vocabulary. While the vocabulary itself is large and predicting a missing word involves some uncertainty, it’s possible to produce a list of all the possible words in the vocabulary together with a probability estimate of the words’ appearance at that location. Typical machine learning systems do so by treating the prediction problem as a classification problem and computing scores for each outcome using a giant so-called softmax layer, which transforms raw scores into a probability distribution over words. With this technique, the uncertainty of the prediction is represented by a probability distribution over all possible outcomes, provided that there is a finite number of possible outcomes.
2. In CV, on the other hand, the analogous task of predicting “missing” frames in a video, missing patches in an image, or missing segment in a speech signal involves a prediction of high-dimensional continuous objects rather than discrete outcomes. There are an infinite number of possible video frames that can plausibly follow a given video clip. It is not possible to explicitly represent all the possible video frames and associate a prediction score to them. In fact, we may never have techniques to represent suitable probability distributions over high-dimensional continuous spaces, such as the set of all possible video frames.

'딥러닝' 카테고리의 다른 글

SwAV, SEER-Unsupervised Learning by Contrasting Cluster Assignments (0)	2021.03.23
Fine-grained 서베이 논문 3편 (0)	2021.03.16
NeRF-Neural Radiance Field (0)	2021.03.06
Meta Pseudo Label (0)	2021.03.06
GPU Memory Consumption of DL Models (0)	2021.03.06

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

딥러닝 논문 읽기

딥러닝 논문 읽기

태그

최근글

댓글

공지사항

아카이브

요약

'딥러닝' 카테고리의 다른 글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역