On April 24, 2026, DeepSeek officially released the V4 Pro flagship training model and V4 Flash lightweight and efficient inference model to the public. The biggest industry change of this new version of the model is to completely break away from the single NVIDIA CUDA ecosystem binding. On the day of release, it completed the native deep adaptation of multiple mainstream domestic AI computing chips, becoming the first AI technology manufacturer in China to achieve synchronous commercial landing of global universal models and domestic computing hardware, reshaping the software and hardware ecosystem of the domestic AI industry. For a long time, the vast majority of global large model training and inference frameworks have been developed based on the NVIDIA CUDA interface, forming a strong software and hardware ecosystem lock-in effect. Even if the hardware computing power indicators of domestic AI chips such as Ascend, Cambrian, Haiguang, and Bilibili meet the standards, it takes several months or even years to complete model migration and adaptation. The long deployment cycle has hindered the large-scale commercial use of domestic computing power clusters, and the fragmentation of software and hardware has become the core bottleneck for the large-scale production of domestic AI chips.
This DeepSeek V4 series model architecture reconstruction, self-developed fine-grained expert parallel computing solution, completed complete full chain verification on two hardware platforms, Nvidia GPU and Huawei Ascend NPU, achieving one-time code compilation and seamless deployment on multiple hardware platforms. Underneath native optimization of the domestic Ascend CANN operator architecture, eliminating the need for intermediate translation layers and improving operator calling efficiency by 27%; At the same time, it is fully compatible with Nvidia's CUDA ecosystem stock code. Cloud service providers and AI application vendors do not need to reconstruct their original business code, and can flexibly switch to domestic computing hardware, greatly reducing the cost of ecosystem migration. At the level of the inference engine, it synchronously supports CUDA and CANN dual standards, and natively integrates FP4 and FP8 ultra-low precision quantization compression algorithms. The large model inference memory usage is reduced by 58%, and the number of concurrent inference requests that can be carried by a single card is increased by 2.1 times. In high-frequency commercial scenarios such as ultra long document parsing, intelligent customer service, code generation, and multimodal content generation, the performance is compared to top overseas solutions, and the comprehensive deployment cost-effectiveness is 32% higher.
The first batch of adaptation list includes six mainstream domestic computing hardware, including Huawei Ascend 310 and Ascend 910 full series training inference chips, Cambricon Siyuan series GPU, Haiguang DCU-3G universal computing chip, Moore Thread universal AI graphics card, and Weiren Boli 166 training chip. All adaptation work has been completed synchronously and launched for commercial use, achieving the industry's first "model release and full stack localization deployment". Previously, there was a general lag of 6-12 months for domestic chip adaptation to large models. Cloud manufacturers' self built domestic computing power clusters could only undertake customized niche businesses and could not access mainstream large model businesses. The rack on shelf rate remained low, and the investment return cycle was prolonged. After the implementation of this synchronous adaptation, domestic public cloud vendors can directly connect their self built Ascend computing power clusters to the standardized business of the general large model. The cabinet utilization rate is expected to increase from 35% to over 75%, and the commercial profitability of domestic computing infrastructure will be greatly improved.
From the perspective of industrial synergy, the two-way binding of "National Model and National Chip" has officially entered the stage of large-scale implementation. In the past, domestic AI chip hardware continued to iterate, but there was a gap in the upper level application ecosystem, and hardware production capacity continued to climb, but there was a lack of downstream models to undertake it; Domestic large model manufacturers highly rely on overseas computing hardware, resulting in high purchasing costs for computing power and difficulties in commercializing models for profitability. This bidirectional adaptation has cleared the upstream and downstream bottlenecks, allowing model manufacturers to obtain low-cost localized computing power supply and chip manufacturers to obtain standardized upper level application landing scenarios, forming a positive industrial cycle. Domestic government and enterprise information and innovation procurement, AI privatization deployment of central state-owned enterprises, and new construction projects of local intelligent computing centers all have complete domestic software and hardware alternative solutions, and there is no longer a need to purchase overseas GPU clusters.
According to actual commercial data, in the scenario of retrieving millions of words of ultra long documents, the Ascend 910B chip is equipped with the DeepSeek V4 Flash model, with an average response delay of 128ms per request and a concurrent load capacity 18% higher than that of the Nvidia A10 chip in the same price range; In the code generation scenario, the FP8 quantization scheme has more than doubled the inference throughput and reduced the cost of single token computing power by 41%. For Internet manufacturers, government and enterprise units to build their own private AI clusters, the overall three-year TCO (total cost of ownership) is 29% lower than that of overseas hardware solutions, highlighting economic advantages.
At the segmented market level, inference computing power has replaced training computing power as the mainstream demand for AI chips. By 2026, the proportion of inference scenarios in global AI computing power expenditure will exceed 70%. The lightweight V4 Flash model accurately matches massive end-to-end, edge side, and cloud side inference needs, and the high concurrency and low-power scenarios that domestic AI chips are best at have received massive orders. Multiple local intelligent computing center operators have simultaneously released procurement notices, prioritizing the purchase of domestically produced Ascend machines for new computing power cabinets and deploying DeepSeek native adaptation models.
There are still obstacles to the migration of existing ecosystems in the short term: a large number of small and medium-sized AI application service providers have completed business development based on CUDA, and there is a short-term investment in hardware transformation when switching; Overseas top model manufacturers still insist on exclusive binding to the Nvidia ecosystem. However, domestic government and enterprise procurement have a clear policy orientation towards localization, coupled with the continuous amplification of economic advantages, and the domestic software and hardware ecosystem has formed a closed loop. DeepSeek's adaptation landing this time has a benchmarking effect. Later, domestic ERNIE Bot, Tongyi Qianwen and other leading model manufacturers will accelerate the whole stack localization adaptation, and the inflection point of large-scale shipment of domestic AI chips has clearly arrived.
Service Hotline:
0755-82712782
QQ:3571411889
Email:sales@ozxelec.com