ViT

视觉Transformer (ViT):AI的“远视眼”如何看图?

想象一下,你我如何识别一张图片中究竟是猫、是狗,还是一辆车?我们的大脑会迅速地扫视整张图片,捕捉关键特征,并将它们组合起来形成一个整体的认知。在人工智能领域,特别是计算机视觉(Computer Vision)中,让机器也能做到这一点,一直是科学家们追求的目标。

过去很长一段时间里,卷积神经网络(Convolutional Neural Networks, 简称CNN)是图像处理领域的霸主。它就像一位“近视眼”的侦探,通过一层层地放大局部区域,先识别出边缘、纹理等最基本的特征,然后将这些小特征逐步组合成更大的特征(例如,眼睛、鼻子),最终形成对整个物体的识别。CNN在很多任务上都表现出色,但它有一个局限性:由于其设计专注于局部特征提取,在理解图像中相距较远的元素之间的复杂关系时,可能会力不从心,就像一位只顾低头看书的人,可能会忽略周围环境的全貌。

然而,在2020年,谷歌的研究人员带来了一场“视力革命”——Vision Transformer,简称ViT。它大胆地将原本用于处理文本的Transformer模型“移植”到了图像理解领域,让AI拥有了处理图像的“远视眼”,能够一眼看清全局,洞察图片中所有元素之间的联系。

什么是Transformer?从语言到视觉的蜕变

在深入ViT之前,我们先简单了解一下它的“前辈”——Transformer模型。Transformer最初是为处理自然语言(如我们说话或写的文字)而设计的。它最核心的创新是“自注意力机制”(Self-Attention)。

你可以把一句话想象成一串珍珠项链。当我们理解这句话时,每个词(一颗珍珠)的意义都不是孤立的,它会受到这句话中其他词的影响。比如,“苹果”这个词,在“苹果手机”中指的是品牌,在“吃苹果”中则指水果。Transformer的自注意力机制就是让模型在处理每一个词时,都能“关注”到句子中的所有其他词,并根据它们的重要性来调整当前词的理解。它能捕捉到非常长距离的依赖关系,这在处理长文本时尤其强大。

ViT的颠覆性在于,它提出一个简单而大胆的想法:既然Transformer在理解文字的顺序和关系上如此出色,那为什么不能把图片也当作一种“序列”来处理呢?

ViT如何“看”图:一个四步走的“拼图高手”

为了让视力卓越的Transformer能处理图像,ViT进行了一些巧妙的改造。我们用一个“拼图高手”的比喻来解析ViT的工作流程:

  1. 拆解图片:将图像切成“小块拼图”
    想象你面前有一张宏伟的风景画。ViT做的第一件事,就是把这张画均匀地切割成许多小方块,就像玩拼图一样。这些小方块在ViT中被称为“图像块”(Image Patches)。每个小方块的大小是固定的,比如16x16像素。这样,一张大图就被转换成了一系列有序的小图片块。这个步骤就像把一本书的每一页裁成相同大小的纸条,方便后续处理。

  2. 编码“拼图块”:为每个小块赋予“数字身份”
    仅仅是切开还不够,机器无法直接理解这些图像块。因此,ViT会给每一个小块生成一个独一无二的“数字身份”,业内称之为“线性嵌入”(Linear Embedding)。这个“数字身份”是一串数字向量,它浓缩了该图像块的颜色、纹理、形状等视觉信息。这就像为每个拼图块拍一张“身份证照”,然后将其转化为机器能理解的数字编码。

  3. 添加“位置信息”:记住每个小块的“座次”
    现在我们有了一堆数字编码的拼图块,但它们被打乱了顺序,模型不知道哪块应该在左上角,哪块在右下角。为了解决这个问题,ViT会给每个编码后的图像块添加一个“位置编码”(Positional Embedding)。这就像在每个拼图块的背面写上它的原始坐标(例如,第3行第5列),这样Transformer在处理时就知道每个块来自图片中的哪个位置。

  4. Transformer编码器:最强大脑的“全局分析”
    准备工作完成后,这些带有位置信息的图像块序列就可以送入Transformer的核心部分——编码器(Encoder)了。编码器内部层层堆叠的“自注意力机制”开始发挥作用:

    • “你中有我,我中有你”的全局关联:当编码器处理某个特定的图像块(例如,画中一棵树的树叶部分)时,它不会孤立地看待这片树叶。通过自注意力机制,这片树叶的编码会去检视所有其他图像块的编码(如树干、远处的山、地上的小草),并根据它们对理解“树叶”的重要性来分配不同的“注意力权重”。例如,它会发现“树干”与“树叶”关系最为密切,而“远处的山”则关联较弱。这种机制让模型能够建立起图像中所有元素之间的复杂关系,捕捉到全局的上下文信息。这就像一个团队开会,每个人发言时,都会仔细听别人的观点,结合起来形成自己更全面的看法。

    • 深度学习与特征整合:经过多层自注意力机制和前馈网络(Feed-Forward Networks)的处理,每个图像块的数字身份都会变得越来越丰富、越来越有意义。它们不再是孤立的像素点,而是融合了整张图片上下文信息的“高级特征”。

最后,ViT会从所有处理完的图像块中抽取一个特殊的类别判别符(通常是一个额外的“类别令牌”Class Token),将其送入一个简单的分类器(通常是一个全连接层),最终输出图像的类别预测结果,例如“这是一只猫”或“这是一辆汽车”。

ViT的优势与挑战:

优势

  • 全局视野,长距离依赖:ViT的核心优势在于自注意力机制使其能够捕捉图像中不同区域之间的长距离依赖关系,这对于理解复杂的场景和物体上下文非常有利。
  • 更高的泛化能力:在拥有海量数据训练的情况下,ViT展现出比CNN更强的泛化能力,能够学习到更强大、更通用的视觉表示。
  • 与其他模态融合的潜力:由于Transformer本身就是处理序列数据的通用架构,这使得ViT在未来更容易与文本、音频等其他模态的数据进行融合,构建更强大的多模态AI模型。

挑战

  • 数据饥渴:ViT需要海量的训练数据才能发挥出其潜力。如果没有足够的数据,它往往不如CNN表现好。通常,ViT会先在大规模数据集(如JFT-300M、ImageNet-21K)上进行预训练,然后再在特定任务上进行微调。
  • 计算成本高昂:自注意力机制的计算复杂度较高,尤其是在处理高分辨率图像时,其计算资源和内存消耗都远超同等参数量的CNN模型。

ViT的最新进展与应用:

自ViT被提出以来,它迅速成为计算机视觉领域的研究热点,并催生了大量的变体和改进模型,如Swin Transformer、MAE等,它们在保持ViT核心思想的同时,解决了部分计算效率和数据依赖的问题。

目前,ViT及其变种已广泛应用于:

  • 图像分类、目标检测、语义分割:在这些基础视觉任务上,ViT已经超越了许多传统的CNN模型,取得了SOTA(State-Of-The-Art,当前最佳)的性能。
  • 医学影像分析:辅助医生诊断疾病,例如识别X光片或CT扫描中的病变区域。
  • 自动驾驶:帮助车辆理解复杂的道路环境,识别行人、车辆和交通标志。
  • 多模态学习:与大语言模型结合,实现图像到文本的生成(Image Captioning)和文本到图像的生成(Text-to-Image Generation),例如Midjourney和DALL-E等生成式AI模型。
  • 视频理解:处理视频帧序列,实现行为识别、事件检测等任务。

总之,ViT的出现是AI计算机视觉领域的一个里程碑,它证明了Transformer架构不仅限于文本,也能够在图像处理上大放异彩。它就像给AI装上了一双能够洞察全局的“远视眼”,让人工智能在理解和感知我们这个丰富多彩的视觉世界方面,迈出了坚实而重要的一步。未来,随着模型效率的提升和更多通用数据的出现,ViT及其家族将在更多领域展现其强大的潜力。


参考文献:
Vision Transformers in Autonomous Driving. [Online]. Available: https://github.com/topics/vision-transformers-for-autonomous-driving.
How DALL-E, MidJourney, Stable Diffusion & Other AI Image Generators Work. [Online]. Available: https://www.mage.ai/blog/how-ai-image-generators-work/.
Vision Transformers are scaling up for video and 3D. [Online]. Available: https://huggingface.co/papers/2301.07727.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. [Online]. Available: https://arxiv.org/abs/2010.11929.

Title: ViT
Tags: [“Deep Learning”, “CV”]

Vision Transformer (ViT): How Does AI’s “Farsighted Eye” See Pictures?

Imagine how you and I identify whether a picture is a cat, a dog, or a car. Our brains quickly scan the entire picture, capture key features, and combine them to form a holistic perception. In the field of Artificial Intelligence, especially Computer Vision, enabling machines to do this has always been a goal pursued by scientists.

For a long time in the past, Convolutional Neural Networks (CNNs) were the overlords of image processing. It acts like a “nearsighted” detective, zooming in on local areas layer by layer, first identifying the most basic features such as edges and textures, and then gradually combining these small features into larger features (e.g., eyes, noses), and finally forming a recognition of the entire object. CNN performs well in many tasks, but it has a limitation: due to its design focus on local feature extraction, it may struggle to understand complex relationships between distant elements in an image, just like a person who only looks down at a book and ignores the full view of the surroundings.

However, in 2020, researchers at Google brought a “vision revolution” — Vision Transformer, or ViT for short. It boldly “transplanted” the Transformer model, originally used for processing text, into the field of image understanding, giving AI a “farsighted eye” for processing images, capable of seeing the whole picture at a glance and gaining insight into the connections between all elements in the picture.

What is Transformer? Evolution from Language to Vision

Before diving into ViT, let’s briefly understand its “predecessor” — the Transformer model. Transformer was originally designed to process natural language (such as the words we speak or write). Its core innovation is “Self-Attention”.

You can imagine a sentence as a string of pearl necklaces. When we understand this sentence, the meaning of each word (a pearl) is not isolated; it is influenced by other words in the sentence. For example, the word “apple” refers to a brand in “Apple phone” and a fruit in “eat an apple”. Transformer’s self-attention mechanism allows the model to “pay attention” to all other words in the sentence when processing each word and adjust the understanding of the current word according to their importance. It can capture very long-distance dependencies, which is especially powerful when processing long texts.

ViT’s disruptiveness lies in its simple yet bold idea: since Transformer is so excellent at understanding the order and relationship of text, why can’t we treat images as a kind of “sequence” as well?

How ViT “Sees” Pictures: A Four-Step “Puzzle Master”

To enable the vision-superior Transformer to process images, ViT underwent some ingenious modifications. We use a “puzzle master” analogy to analyze the workflow of ViT:

  1. Disassembling the Picture: Cutting the Image into “Puzzle Pieces”
    Imagine you have a magnificent landscape painting in front of you. The first thing ViT does is evenly cut this painting into many small squares, just like a puzzle. These small squares are called “Image Patches” in ViT. The size of each small square is fixed, such as 16x16 pixels. In this way, a large picture is converted into a series of ordered small picture blocks. This step is like cutting each page of a book into strips of the same size for subsequent processing.

  2. Encoding “Puzzle Pieces”: Giving Each Piece a “Digital Identity”
    Just cutting is not enough; the machine cannot directly understand these image blocks. Therefore, ViT will generate a unique “digital identity” for each small block, known in the industry as “Linear Embedding”. This “digital identity” is a string of number vectors that concentrates visual information such as color, texture, and shape of the image block. This is like taking an “ID photo” for each puzzle piece and then converting it into a digital code that the machine can understand.

  3. Adding “Position Information”: Remembering the “Seat” of Each Piece
    Now we have a pile of digitally encoded puzzle pieces, but their order is scrambled. The model doesn’t know which piece should be in the top left corner and which in the bottom right corner. To solve this problem, ViT will add a “Positional Embedding” to each encoded image block. This is like writing its original coordinates (e.g., row 3, column 5) on the back of each puzzle piece, so that the Transformer knows which position in the picture each piece comes from during processing.

  4. Transformer Encoder: “Global Analysis” of the Strongest Brain
    After the preparation work is completed, these image block sequences with position information can be sent to the core part of the Transformer — the Encoder. The “self-attention mechanism” stacked layer by layer inside the encoder begins to work:

    • Global Association of “You in Me, Me in You”: When the encoder processes a specific image block (e.g., the leaf part of a tree in the painting), it does not view this leaf in isolation. Through the self-attention mechanism, the encoding of this leaf will examine the encodings of all other image blocks (such as the trunk, distant mountains, grass on the ground) and assign different “attention weights” based on their importance to understanding the “leaf”. For example, it will find that the “trunk” is most closely related to the “leaf”, while the “distant mountains” are weakly related. This mechanism allows the model to establish complex relationships between all elements in the image and capture global contextual information. This is like a team meeting where everyone listens carefully to others’ views when speaking and combines them to form a more comprehensive view.

    • Deep Learning and Feature Integration: After processing through multiple layers of self-attention mechanisms and Feed-Forward Networks, the digital identity of each image block will become richer and more meaningful. They are no longer isolated pixels but “high-level features” that integrate the contextual information of the entire picture.

Finally, ViT extracts a special class identifier (usually an extra “Class Token”) from all processed image blocks and sends it to a simple classifier (usually a fully connected layer) to output the final category prediction result of the image, such as “this is a cat” or “this is a car”.

Advantages and Challenges of ViT:

Advantages:

  • Global Vision, Long-Range Dependencies: The core advantage of ViT lies in the self-attention mechanism enabling it to capture long-range dependencies between different regions in an image, which is very beneficial for understanding complex scenes and object contexts.
  • Higher Generalization Ability: With massive training data, ViT demonstrates stronger generalization ability than CNNs, capable of learning more powerful and general visual representations.
  • Potential for Fusion with Other Modalities: Since Transformer itself is a general architecture for processing sequence data, this makes it easier for ViT to fuse with data from other modalities such as text and audio in the future, building more powerful multi-modal AI models.

Challenges:

  • Data Hungry: ViT requires massive amounts of training data to unleash its potential. Without sufficient data, it often performs worse than CNNs. Typically, ViT is pre-trained on large-scale datasets (such as JFT-300M, ImageNet-21K) and then fine-tuned on specific tasks.
  • High Computational Cost: The computational complexity of the self-attention mechanism is high, especially when processing high-resolution images, its computational resource and memory consumption far exceed CNN models with equivalent parameters.

Latest Progress and Applications of ViT:

Since ViT was proposed, it has quickly become a research hotspot in the field of computer vision and has spawned a large number of variants and improved models, such as Swin Transformer and MAE, which solve some computational efficiency and data dependency problems while maintaining the core idea of ViT.

Currently, ViT and its variants are widely used in:

  • Image Classification, Object Detection, Semantic Segmentation: In these basic visual tasks, ViT has surpassed many traditional CNN models, achieving SOTA (State-Of-The-Art) performance.
  • Medical Image Analysis: Assisting doctors in diagnosing diseases, such as identifying lesion areas in X-rays or CT scans.
  • Autonomous Driving: Helping vehicles understand complex road environments and identify pedestrians, vehicles, and traffic signs.
  • Multi-Modal Learning: Combined with large language models to achieve Image Captioning and Text-to-Image Generation, such as generative AI models like Midjourney and DALL-E.
  • Video Understanding: Processing video frame sequences to achieve behavior recognition, event detection, and other tasks.

In short, the emergence of ViT is a milestone in the field of AI computer vision. It proves that the Transformer architecture is not limited to text but can also shine in image processing. It is like equipping AI with a pair of “farsighted eyes” capable of perceiving the whole situation, taking a solid and important step for artificial intelligence in understanding and perceiving our colorful visual world. In the future, with the improvement of model efficiency and the emergence of more general data, ViT and its family will demonstrate their powerful potential in more fields.


References:
Vision Transformers in Autonomous Driving. [Online]. Available: https://github.com/topics/vision-transformers-for-autonomous-driving.
How DALL-E, MidJourney, Stable Diffusion & Other AI Image Generators Work. [Online]. Available: https://www.mage.ai/blog/how-ai-image-generators-work/.
Vision Transformers are scaling up for video and 3D. [Online]. Available: https://huggingface.co/papers/2301.07727.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. [Online]. Available: https://arxiv.org/abs/2010.11929.