DeepSeek's New Vision Feature: How It's Revolutionizing AI's Understanding of the Visual World

Introduction: Why Is DeepSeek's New Vision Feature a Quantum Leap?
From Text to Image: The Evolution of Multimodal AI
How Does It Work? Diving into the Architecture of the Image Recognition Feature
Brainstorming: Redefining How the Model "Thinks"
Why Is This Feature a Top Priority for DeepSeek?
Risk Assessment: The Challenges Facing This Technology
The Roadmap: Expected Future Developments
The Numbers Speak: Performance Comparison with Leading Models
Practical Example: How the Feature Works in a Real-World Scenario
Case Study: Analyzing a Complex Image Step by Step
Advantages and Limitations: A Balanced View
Roleplay: Imagine You Are the Model
Frequently Asked Questions
The Future: What Lies Ahead?
References and Sources

Introduction: Why Is DeepSeek's New Vision Feature a Quantum Leap?

In a world where AI development is accelerating at an unprecedented pace, the ability to "see" and understand the visual world has long been one of the greatest challenges for researchers and developers. This new feature from DeepSeek, officially launched on June 18, 2026, represents the culmination of months of intensive effort and a milestone in the company's journey toward building an integrated AI model capable of understanding the world in all its complexity[reference:0][reference:1].

What distinguishes this feature is not merely that it is a tool for image recognition, but the new philosophy it adopts in how the model "thinks." Instead of merely describing what the eye sees, this technology seeks to simulate the way a human thinks when looking at a complex image—connecting elements to one another, extracting implicit meanings, and making decisions based on what is seen.

This new DeepSeek feature did not emerge from a vacuum; it is the product of a long journey of research and development that began with purely text-based models, then gradually evolved toward language understanding, and now toward image understanding. It is the boldest step yet in the company's strategy to build a comprehensive artificial intelligence capable of competing with the biggest players in the field, such as GPT-4o and Gemini[reference:2]. In this article, we will take you on an in-depth journey to understand this feature, from its historical roots to its complex architecture, and finally to its expected impact on the future of human-machine interaction.

From Text to Image: The Evolution of Multimodal AI

The journey of AI toward understanding images has not been sudden; it is the culmination of decades of research and development in computer vision and natural language processing. Initially, models were limited to recognizing simple shapes using basic convolutional neural networks (CNNs). Then came the deep learning revolution with networks like ResNet, which enabled models to recognize thousands of different categories of images with remarkable accuracy.

The real challenge, however, lay in connecting what the eye sees with what language understands. Here emerged models like CLIP from OpenAI, which for the first time created a bridge between images and text, paving the way for vision-language models. These models could understand images in a linguistic context, but they were still far from true "visual thinking" in the real sense of the word.

This new DeepSeek feature represents the next generation of these models, going beyond merely linking images to text to attempting to understand the logical and causal relationships between elements within a single image. This development was made possible by tremendous advances in architectural design, the availability of massive amounts of data, and increasing computational power. DeepSeek has leveraged all these developments and added its own unique touch: "Thinking with Visual Primitives," a concept we will explore in detail in the following sections[reference:3][reference:4].

How Does It Work? Diving into the Architecture of the Image Recognition Feature

To understand how this new DeepSeek feature works, we must look at the architecture behind it, which represents a quantum leap in how images are processed. The feature is built on DeepSeek V4-Flash as its foundational model, a massive model with a total of 284 billion parameters, activating only 13 billion during inference, thanks to its Mixture of Experts (MoE) architecture that ensures high efficiency in resource usage[reference:5][reference:6].

The process begins by converting the image into a series of visual tokens using an optimized Vision Transformer (ViT) developed specifically by DeepSeek for this purpose. This transformer does not merely divide the image into small patches as traditional models do, but performs an intelligent compression process that significantly reduces the number of tokens required to represent the image. For example, a 756×756 pixel image typically generates 2,916 image patch tokens, but through 3×3 spatial compression and then using Compressed Sparse Attention (CSA) mechanism, the KV cache is further compressed by 4 times, ultimately leaving only 81 visual KV entries[reference:7][reference:8].

This massive compression, reaching up to 7,056-fold, is not merely a space-saving measure; it is the essence of DeepSeek's philosophy in dealing with images. Instead of overwhelming the model with a huge amount of unnecessary detail, the visual essence of the image is extracted, allowing the model to focus on what truly matters. This approach is fundamentally different from other models that consume hundreds or thousands of tokens for a single image—for comparison, Claude Sonnet 4.6 requires about 870 tokens for the same image size, and Gemini-3-Flash requires about 1,100[reference:9].

After converting the image into these compressed tokens, they are fed into the DeepSeek V4-Flash base language model, which processes them alongside the user's text input. Here lies the real innovation: rather than treating visual tokens as secondary information, they are integrated into the thinking process itself, where the model can refer to specific locations in the image using precise coordinates, as if "pointing" to a particular part of it[reference:10].

Brainstorming: Redefining How the Model "Thinks"

The most innovative element of this new DeepSeek feature is what the company calls "Thinking with Visual Primitives." This concept represents a paradigm shift in how multimodal models deal with images. Traditionally, models relied on describing the image in natural language within the Chain of Thought, but this approach suffers from a fundamental problem: the "Reference Gap"[reference:11][reference:12].

The Reference Gap is the phenomenon where the model describes something using vague phrases like "the big thing on the left" or "the red area in the middle," leading to inaccuracies in understanding, especially in scenes crowded with details. This new DeepSeek feature solves this problem by integrating spatial coordinates (points or bounding boxes) directly into the thinking process itself, not just in the final output[reference:13][reference:14].

Imagine the model thinking as follows while analyzing an image: "Looking for a bear in the image, find a bear at coordinates [452,23,804,411], it's climbing a tree, so not on the ground. Looking to the bottom left, find another bear at [50,447,647,771], it's standing on a rock edge, this is what's needed." In this example, the coordinates are not just a final description, but thinking tools that help the model track exactly what it is looking at at each step of its reasoning[reference:15].

This method of thinking closely mimics the way the human brain works, where a person connects what they see with spatial locations in their mind, without needing to describe everything in words. The result is a model that is more accurate in understanding complex scenes, less prone to errors caused by linguistic ambiguity, and more capable of handling tasks such as precise counting, understanding spatial relationships, and tracking objects in crowded scenes.

Why Is This Feature a Top Priority for DeepSeek?

The addition of this new DeepSeek feature was not merely a routine technical update; it is a strategic step reflecting the company's vision for the future of AI. In an era where multimodal models have become the gold standard, the absence of image understanding capability was a significant weakness for DeepSeek compared to competitors like OpenAI, Google, and Anthropic. Adding this feature bridges a fundamental gap in the company's product, making it competitive in a market where development is accelerating at an unprecedented pace[reference:16].

From a practical standpoint, this feature opens entirely new horizons for DeepSeek's applications. Instead of being limited to text processing, users can now upload images of documents, charts, maps, and even photographs, and receive detailed and accurate analysis. This makes the tool more useful in various fields such as education, scientific research, engineering, medicine, marketing, and many others[reference:17].

Furthermore, this feature represents a first step toward building a more comprehensive AI system capable of understanding the world in all its diversity. The ability to process images is the gateway to understanding video, augmented reality, and complex visual interactions. By laying this foundation now, DeepSeek positions itself to expand into these areas in the future and maintain its leadership in the AI race.

Moreover, this feature arrives at a sensitive time for the company, following a period of challenges related to team stability, with some prominent researchers in the multimodal field having left the company. The launch of this feature sends a reassuring message to the market and users that DeepSeek remains at the forefront of innovation and has the capability to compete at the highest levels[reference:18].

Risk Assessment: The Challenges Facing This Technology

Despite the significant achievement represented by this new DeepSeek feature, it is not without challenges and risks that must be considered. The first of these challenges is the limited knowledge base, as the model was trained on data up to 2025, meaning it may struggle to recognize products or objects that appeared after this date, and may confuse different models of new products[reference:19][reference:20].

Secondly, the model's performance remains unstable in some highly complex scenarios, such as images containing optical illusions, or scenes requiring precise counting of large numbers of similar objects. In these cases, the model may provide inaccurate or even contradictory answers, indicating gaps in its ability to handle certain types of visual challenges[reference:21][reference:22].

Thirdly, the model's capabilities remain relatively limited, currently focusing only on understanding static images, without the ability to generate images, understand video, or perform creative transformations between different media. This places it at a lower rank than some competing models that offer a broader range of multimodal capabilities. The model also sometimes suffers from delayed responses or failure to process images during peak hours, affecting the user experience[reference:23][reference:24].

Fourthly, there are technical challenges related to the accuracy of coordinates in very fine-grained scenes, where the positioning accuracy may not be sufficient to handle microscopic details. The model's ability to generalize across different types of images and scenarios also needs improvement, as it may show varying performance across different categories of images[reference:25].

The Roadmap: Expected Future Developments

Looking to the future, this new DeepSeek feature is expected to undergo a series of continuous developments and improvements. The first of these will be in expanding the knowledge base, with DeepSeek likely to update training data more regularly to include up-to-date information and avoid the problem of knowledge obsolescence. This will enable the model to recognize new products and objects with greater accuracy[reference:26][reference:27].

Secondly, the algorithms used in highly complex scenarios are expected to see significant improvement, especially in areas such as handling optical illusions and precise counting. DeepSeek is already working on improving these aspects, and we are likely to see periodic updates that enhance the model's performance in these challenging tasks[reference:28][reference:29].

Thirdly, there are ambitious plans to expand the range of multimodal capabilities to include image generation, video understanding, and creative interactions between different media. This will make DeepSeek an integrated AI platform capable of competing with the biggest players in the market on all fronts. Improving system stability and its ability to handle heavy loads is also a top priority, to ensure a smooth user experience even during peak hours[reference:30][reference:31].

Fourthly, there is a trend toward making the "Thinking with Visual Primitives" mechanism more flexible, so that it does not require specific trigger words to activate, but works automatically whenever needed. This will make the model easier to use and more adaptable to different types of questions and scenarios[reference:32].

The Numbers Speak: Performance Comparison with Leading Models

To objectively evaluate this new DeepSeek feature, we must look at the numbers and statistics comparing its performance with leading models in the market. In precise counting tests, the DeepSeek model achieved a score of 89.2% on the Pixmo-Count benchmark, ahead of Gemini-3-Flash's 88.2%, and significantly ahead of GPT-5.4's 76.6% and Claude Sonnet 4.6's 68.7%[reference:33].

However, the largest gap appears in topological reasoning tasks, where the DeepSeek model outperformed competitors by a wide margin. In maze navigation tasks, DeepSeek scored 66.9%, while GPT-5.4 scored 50.6%, Gemini-3-Flash 49.4%, and Claude Sonnet 4.6 48.9%—a gap of approximately 17 percentage points. In path tracking tasks, DeepSeek scored 56.7% compared to GPT-5.4's 46.5%[reference:34].

In terms of computational efficiency, this feature stands out remarkably, consuming only about 90 tokens to process an 800×800 pixel image, while competing models consume hundreds of tokens for the same resolution[reference:35][reference:36]. This means DeepSeek is faster to respond, cheaper to operate, and more capable of handling large volumes of requests. This efficiency advantage is not just a number; it is a critical competitive advantage in the world of practical applications where speed and cost are decisive factors.

It is worth noting that these numbers represent the model's performance under ideal conditions, and results may vary in real-world applications depending on the nature, quality, and complexity of the images. Nevertheless, they provide a clear indication of where this feature stands among the best that current technology has to offer.

Practical Example: How the Feature Works in a Real-World Scenario

To understand how this new DeepSeek feature works in practice, let's imagine a real-world scenario. Suppose a user visits a museum and takes a photo of a mysterious artifact they know nothing about. They upload the image to DeepSeek and ask it to identify and analyze it[reference:37].

Initially, the model processes the image through the Vision Transformer (ViT), where it is divided into visual units, then significantly compressed using the advanced techniques developed by DeepSeek. After that, the model begins the "Thinking with Visual Primitives" process, identifying the main elements in the image, such as the shape of the artifact, colors, decorations, and any inscriptions present.

During the thinking process, the model generates a series of thoughts that combine linguistic description with spatial coordinates. For example: "I see a stone artifact at coordinates [120,45,380,290], its color is grayish-green, with raised carvings at [200,150,280,220] resembling cuneiform writing. The edges at [100,30,400,310] appear eroded, indicating the artifact's age. The geometric decorations at [150,180,350,260] match the ancient Babylonian pattern."

After this analytical process, the model provides a comprehensive answer to the user, including a detailed description of the artifact, an estimate of its age and cultural origin, and conclusions about its possible function. All of this is done in seconds, using a small number of tokens compared to what traditional models would have consumed. This example illustrates how this feature is not merely an image recognition tool, but a tool for deep understanding and intelligent analysis.

Case Study: Analyzing a Complex Image Step by Step

Let's delve deeper into a specific case study to see how this new DeepSeek feature handles a complex image. Imagine a busy street scene containing many competing elements: cars, pedestrians, traffic lights, shop signs, and trees. The question posed is: "How many red cars are in the image, and where are they relative to the intersection?"

Step One: The model scans the image using the Vision Transformer, identifying all prominent elements. During this step, the image is converted into a compressed token representation, while retaining the spatial information necessary for precise analysis.

Step Two: The model begins the "Thinking with Visual Primitives" process. It first identifies all cars in the image and records their coordinates: "Car at [45,230,120,310], car at [300,200,380,290], car at [520,240,600,330], car at [150,350,230,420], car at [680,320,760,410]." It then filters the red cars: "The car at [300,200,380,290] is red, and the car at [680,320,760,410] is also red."

Step Three: The model analyzes the spatial relationships, identifying the location of the intersection (for example, at coordinates [400,300,500,400]), then compares the positions of the red cars to it: "The first red car at [300,200,380,290] is northwest of the intersection, and the second at [680,320,760,410] is southeast of the intersection."

Step Four: The model delivers the final answer: "There are two red cars in the image, one northwest of the intersection, and the other southeast of the intersection." This answer is not merely a description, but the result of an organized thinking process, using spatial coordinates as reasoning tools, ensuring high accuracy in the final result.

Advantages and Limitations: A Balanced View

Through our in-depth review of this new DeepSeek feature, we can summarize the advantages and limitations in a balanced manner. On the advantages side, resource efficiency stands out as one of the most important strengths, as the model consumes a very small number of tokens compared to competitors, meaning higher speed and lower cost. The model's accuracy in spatial reasoning and counting tasks far exceeds that of competing models, especially in complex scenarios[reference:38].

Another notable advantage is the "Thinking with Visual Primitives" philosophy, which gives the model a unique ability to track exactly what it is looking at, reducing errors caused by linguistic ambiguity[reference:39]. The independence of the image recognition feature from internet search also gives the user greater control over the source of information and avoids the distraction of integrating multiple sources[reference:40].

On the limitations side, the limited knowledge base up to 2025 is an obstacle to recognizing recent developments[reference:41]. The model's performance also remains unstable in the face of some difficult visual challenges such as illusions and very precise counting[reference:42]. Also, the lack of image generation or video understanding capabilities places the model at a lower rank than some competitors who offer a broader range of multimodal services[reference:43].

Finally, it can be said that this feature represents a giant step forward, but it is still in its early stages, with wide scope for improvement and development. The balance between advantages and limitations indicates that DeepSeek is moving in the right direction, with full awareness of the challenges to be overcome in the coming stages.

Roleplay: Imagine You Are the Model

Let's try a unique roleplay exercise, imagining that we are the model itself while processing a complex image. Imagine you are the DeepSeek model, and you have just received a photo of a dense forest, with the question: "Are there any predators in this image?" Your thinking process begins calmly and methodically.

Step One in your thinking: "I scan the image, dividing it into a grid of visual units. I see many trees and leaves, but I focus on searching for specific patterns that might indicate the presence of animals. I use my prior knowledge that predators often hide among branches, or their eyes appear distinctly."

Step Two: "I identify a point of suspicion at coordinates [230,450,310,520], where there seems to be an irregular shape among the branches. I mentally zoom in on that area, and notice two shining eyes at [260,480,270,490]. This is a strong signal of an animal's presence."

Step Three: "I try to identify the type of animal. Based on the shape of the eyes and the visible fur at [240,470,290,510], it appears to be a tiger. I check for any other animals, and find another shape at [550,320,620,400], but after analysis I discover it's just a large rock. I confirm the presence of the tiger at [230,450,310,520]."

Step Four: "I formulate the answer: 'Yes, there are predators in this image. I have identified a tiger hiding among the trees in the lower-middle part of the image.' I attach precise coordinates for the tiger's location, so the user can easily locate it." This entire process takes place in seconds, yet reflects an advanced level of organized and precise thinking.

Frequently Asked Questions

Q: What is the difference between DeepSeek's image recognition feature and similar features in other models?

A: The fundamental difference lies in the "Thinking with Visual Primitives" philosophy, where the model integrates spatial coordinates into the thinking process itself, not just in the final output. This gives it superior accuracy in understanding spatial relationships and counting, with high efficiency in resource usage, consuming far fewer tokens than competitors[reference:44].

Q: Is the image recognition feature available to all users?

A: Yes, it was officially announced on June 18, 2026, and is now available to all users on both web and app platforms[reference:45]. However, some older versions of the app may still show a "Image understanding feature in internal testing" message[reference:46][reference:47].

Q: Can the model recognize text within images?

A: Yes, the model is capable of recognizing and extracting text from images, but the primary focus of the feature is understanding visual content and spatial relationships, not merely extracting text like traditional OCR tools[reference:48].

Q: What types of images can the model process?

A: The model can process a wide range of images, including photographs, charts, maps, scanned documents, and illustrations. However, it does not currently support video processing or image generation[reference:49].

Q: Does the image recognition feature work offline?

A: No, the feature requires an internet connection to work, as images are processed on DeepSeek's cloud servers. However, it does not rely on internet search by default, giving the user greater control over the source of information[reference:50].

The Future: What Lies Ahead?

Looking to the future, this new DeepSeek feature is expected to see tremendous developments in the coming years. The first of these will be in expanding the scope of visual understanding to include video, where the model will be able to analyze moving scenes and understand the temporal sequence of events. This will open new horizons in fields such as security surveillance, sports motion analysis, and complex social interactions.

Secondly, the model's capabilities are expected to expand to include image generation, transforming it into an integrated creative tool capable of creating new visual content based on text descriptions. This will make DeepSeek a direct competitor to systems like DALL-E and Midjourney, with the added advantage of deep contextual understanding.

Thirdly, the field of interaction between different media will see significant development, where the model will be able to seamlessly transition between text, image, audio, and video, creating rich interactive experiences that were not previously possible. Imagine being able to speak to the model about an image, and it responds with voice, and displays illustrative charts, all in one seamless interaction.

Fourthly, the model's efficiency is expected to improve continuously, with reduced response time and increased coordinate accuracy in fine-grained scenes. DeepSeek will also work on continuously expanding the knowledge base to include the latest information and avoid the problem of knowledge obsolescence[reference:51]. All these developments will make DeepSeek a comprehensive AI platform, capable of competing with the biggest players in the market on all fronts.

References and Sources

IT之家. (2026, June 18). DeepSeek 识图模式正式上线 App 和网页端. https://www.ithome.com/0/966/066.htm[reference:52]
科技日报. (2026, May 14). DeepSeek开放识图模式 AI装上了“赛博手指”. https://www.ncsti.gov.cn/kjdt/kjrd/202605/t20260514_246641.html[reference:53]
太平洋科技. (2026, May 1). DeepSeek公开新技术了！多模态模型技术报告公布：超越GPT-5.4. https://g.pconline.com.cn/x/2142/21428411.html[reference:54]
智东西. (2026, April 30). DeepSeek“开眼”背后的技术，公开了！ https://m.zhidx.com/p/555086.html[reference:55]
36氪. (2026, April 30). DeepSeek多模态技术范式公布，以视觉原语思考. https://36kr.com/p/3789208597372165[reference:56]
DeepTech深科技. (2026, April 29). DeepSeek多模态真的来了？识图模式已开始小范围灰度. http://www.163.com/dy/article/KRMS52V105119734.html[reference:57]
DeepSeek. (2026). Thinking with Visual Primitives (Technical Report). GitHub. https://github.com/deepseek-ai/Thinking-with-Visual-Primitives[reference:58]
观察者网. (2026, April 29). DeepSeek内测识图模式，中国头部模型公司全员“睁眼”. http://www.163.com/dy/article/KRN0UEEH051481US.html[reference:59]