Computer vision (CV) and natural language processing (NLP) are two distinct subfields of artificial intelligence. The former concerns how computers can understand digital images and videos, while the latter deals with the interactions between computers and human language. One key difference between CV and NLP tasks is that NLP relies heavily on sequence modeling. However, recent studies suggest that images can also be seen and processed the same way as words.
Convolutional architectures remain dominant in computer vision. However, one problem of this architecture is that it does not find the context or connections among different parts of the images. This is particularly devastating as the meaning of a word relies heavily on context. As a result, people approach NLP completely differently by focusing more on contextualization.
At the beginning of 2021, researchers from Google discovered that the state-of-the-art NLP model transformer could also be implemented for computer vision. An image can be split into fixed-size patches, linearly embedded, and added position embeddings. Hence, every patch can be considered as a word, and a sequence of patches can be treated as a sentence. It turns out that computer vision and natural language processing are more interconnected than we used to think.