AI领域的发展日新月异,其中一个重要的方向就是如何更高效、更智能地设计神经网络。就像高级厨师设计菜肴或建筑师设计大楼一样,构建一个高性能的神经网络往往需要大量的专业知识、经验和反复试验。而“可微分架构搜索(Differentiable Architecture Search, 简称DARTS)”技术,正是为了自动化这个复杂过程而生。
一、 什么是DARTS?——AI的“自动设计师”
在人工智能,特别是深度学习领域,神经网络的“架构”指的是它的结构,比如有多少层,每一层使用什么样的操作(例如卷积、池化、激活函数等),以及这些操作之间如何连接。传统上,这些架构都是由人类专家凭经验手动设计,耗时耗力,而且很难保证找到最优解。
想象一下,你是一家餐厅的老板,要想推出一道新菜。你可以请一位经验丰富的大厨(人类专家)来设计食谱。他会根据经验挑选食材、烹饪方法,然后调试很多次,最终确定出美味的菜肴。这个过程非常考验大厨的功力,且效率有限。
而“神经网络架构搜索”(Neural Architecture Search, NAS)的目标,就是让AI自己来做这个“大厨”的工作。DARTS就是NAS领域中一种非常高效且巧妙的方法。它不同于以往NAS方法(例如基于强化学习或进化算法),后者通常需要尝试无数种离散的架构组合,耗费巨大的计算资源,就像要让机器人尝试每一种可能的食材和烹饪方式组合,才能找到最佳食谱一样。
DARTS的核心思想是:把原本离散的“选择哪个操作”的问题,变成一个连续的、可以被“微调”的问题。这就像是,我们不再是简单地选择“加盐”还是“加糖”,而是可以“加0.3份盐和0.7份糖”这样精细地调整比例。通过这种“软选择”的方式,DARTS能够使用我们熟悉的梯度下降法来优化神经网络的结构,大大提高了搜索效率。
二、DARTS的工作原理:一道“融合菜”的诞生
要理解DARTS如何实现这种“软选择”,我们可以用一个“融合菜”的比喻来解释。
1. 搭建“超级厨房”——定义搜索空间
首先,我们需要一个包含了所有可能操作的“超级厨房”,这在DARTS中被称为“搜索空间”。这个空间不是指整个神经网络,而是指构成神经网络基本单元(通常称为“Cell”或“单元模块”)内部的结构。
- 食材与烹饪工具(操作集): 在每个“烹饪环节”(节点之间的连接)中,我们可以选择不同的“食材处理方式”或“烹饪工具”,比如:切丁(3x3卷积)、切片(5x5卷积)、焯水(最大池化)、过油(平均池化),甚至什么都不做(跳跃连接,即直接传递)。DARTS预定义了8种不同的操作供选择。
- 菜谱骨架(Cell单元): 我们的目的是设计一个核心的“菜谱单元”。这个单元通常有两个输入(比如前两道菜的精华),然后通过一系列内部的烹饪环节,最终产生一个输出。通过重复堆叠这种“单元”,就能构成整个“大菜”(完整的神经网络)。
2. 制作“魔法调料包”——连续松弛化
传统方法是在每个烹饪环节从菜单中“明确选择”一个操作。但DARTS的巧妙之处在于,它引入了一个“魔法调料包”。在任何一个烹饪环节,我们不再是选择单一的操作,而是将所有可能的操作用一定的“权重”混合起来,形成一个“混合操作”。
举个例子,在某一步,我们不是选“切丁”或“焯水”,而是用了一个“50%切丁 + 30%焯水 + 20%什么都不做”的混合操作。这些百分比就是DARTS中的“架构参数”(α),它们是连续的,可以被微调。
这样,原本在离散空间中“生硬选择”的问题,就转化成了在连续空间中“调整比例”的问题。我们就拥有了一个包含所有可能菜谱的“超级食谱”(Supernet),它一次性包含了所有可能的结构。
3. “先尝后调”——双层优化
有了这个“魔法调料包”和“超级食谱”,DARTS如何找到最佳比例呢?它采用了一种“两步走”的优化策略,称为“双层优化”:
- 内层优化(调整菜的味道): 想象一下,你根据当前的“混合比例”(建筑参数 α)制作了一道“融合菜”。在确定了调料包的比例后,你需要快速品尝并调整这道菜的“细微火候和时间”(模型权重 w),让它在“训练餐桌”(训练数据集)上尽可能美味。
- 外层优化(调整调料包比例): 在上一道菜尝起来还不错的基础上,你会把它端到另一张“顾客品鉴餐桌”(验证数据集)上,看看顾客的反馈。根据顾客的评价,你就可以知道是“切丁”的比例太少,还是“焯水”的比例太多。然后,你再回头调整你的“魔法调料包”的配方(架构参数 α),让下一道菜更受“顾客”欢迎。
这两个过程交替进行,就像大厨在烹饪过程中,一边小尝微调,一边根据反馈调整整体配方。最终,当“魔法调料包”的比例调整到最佳时,我们就得到了最优的“菜谱单元”结构。
4. “定型”最佳菜谱——离散化
当训练结束,架构参数(α)稳定后,每个“混合操作”中各个子操作的权重就确定了。DARTS会选择每个混合操作中权重最大的那个子操作,从而生成一个具体的、离散的神经网络结构。 这就像是从“50%切丁 + 30%焯水”中,最终确定“切丁”是最佳选择。
三、DARTS的优势与挑战
优势:快而准
- 效率高: 由于可以应用梯度下降进行优化,DARTS的搜索速度比传统的黑盒搜索方法快几个数量级,能够在短短几个GPU天(甚至更短时间)内找到高性能的架构。
挑战:美味之路并非坦途
- 性能崩溃: 尽管DARTS非常高效,但有时会遇到“性能崩溃”问题。随着训练的进行,搜索到的最佳架构倾向于过度使用“跳跃连接”(skip connection,即什么都不做,直接传递数据),导致模型性能不佳。 这就像在设计菜谱时,有时“魔法调料包”会越来越倾向于“什么都不加”,最终做出来的菜平淡无味。
- 内存消耗: 训练一个包含了所有可能操作的“超级食谱”仍然需要较大的内存。
四、最新进展:克服挑战,追求更稳健的自动化设计
针对DARTS的性能崩溃问题,研究者们提出了许多改进方案。例如:
- DARTS+: 引入了“早停”机制,就像在“魔法调料包”开始走偏时及时停止调整,避免过度优化导致性能下降。
- Fair DARTS: 进一步分析发现,性能崩溃可能是因为在竞争中,某些操作(如跳跃连接)拥有“不公平的优势”。Fair DARTS尝试通过调整优化方式,让不同操作之间的竞争更加公平,并鼓励架构权重趋向于0或1,从而获得更稳健的架构。
五、 结语
DARTS作为可微分架构搜索的开创性工作,让神经网络的结构设计从繁重的手工劳动迈向了智能自动化。它深刻地改变了AI模型的开发流程,使研究人员和工程师能够更快速、更高效地探索更优异的神经网络结构。尽管面临性能崩溃等挑战,但通过不断的改进和创新,DARTS及其衍生的方法正持续推动着AI领域的发展,让AI成为更优秀的“自动设计师”,为我们创造出更强大、更精妙的智能系统。
The development of the AI field changes with each passing day, and one important direction is how to design neural networks more efficiently and intelligently. Just like a master chef designing dishes or an architect designing buildings, constructing a high-performance neural network often requires a lot of professional knowledge, experience, and trial and error. “Differentiable Architecture Search” (DARTS) technology was born to automate this complex process.
1. What is DARTS? — AI’s “Automatic Designer”
In artificial intelligence, especially in the field of deep learning, the “architecture” of a neural network refers to its structure, such as how many layers there are, what operations are used in each layer (e.g., convolution, pooling, activation functions, etc.), and how these operations are connected. Traditionally, these architectures were manually designed by human experts based on experience, which is time-consuming, laborious, and hard to guarantee finding the optimal solution.
Imagine you are a restaurant owner who wants to launch a new dish. You can hire an experienced chef (human expert) to design the recipe. He will select ingredients and cooking methods based on experience, then debug many times, and finally determine a delicious dish. This process tests the chef’s skill greatly and has limited efficiency.
The goal of “Neural Architecture Search” (NAS) is to let AI do this “chef’s” job itself. DARTS is a very efficient and ingenious method in the NAS field. It differs from previous NAS methods (such as those based on reinforcement learning or evolutionary algorithms), which usually need to try countless discrete architecture combinations, consuming huge computational resources, just like letting a robot try every possible combination of ingredients and cooking methods to find the best recipe.
The core idea of DARTS is: Turn the originally discrete problem of “choosing which operation” into a continuous problem that can be “fine-tuned”. It’s like we are no longer simply choosing “add salt” or “add sugar”, but can finely adjust the ratio like “add 0.3 parts salt and 0.7 parts sugar”. Through this “soft selection” method, DARTS can use the gradient descent method we are familiar with to optimize the structure of the neural network, greatly improving search efficiency.
2. How DARTS Works: The Birth of a “Fusion Dish”
To understand how DARTS achieves this “soft selection”, we can use the metaphor of a “fusion dish” to explain.
1. Building a “Super Kitchen” — Defining the Search Space
First, we need a “super kitchen” containing all possible operations, which is called “search space” in DARTS. This space does not refer to the entire neural network, but the internal structure constituting the basic unit of the neural network (usually called “Cell”).
- Ingredients and Cooking Tools (Operation Set): In each “cooking step” (connection between nodes), we can choose different “ingredient processing methods” or “cooking tools”, such as: dicing (3x3 convolution), slicing (5x5 convolution), blanching (max pooling), oiling (average pooling), or even doing nothing (skip connection, i.e., passing directly). DARTS predefines 8 different operations for selection.
- Recipe Skeleton (Cell): Our purpose is to design a core “recipe unit”. This unit usually has two inputs (like the essence of the previous two dishes), then goes through a series of internal cooking steps, and finally produces an output. By repeatedly stacking this “unit”, the entire “big dish” (complete neural network) can be constituted.
2. Making “Magic Seasoning Packet” — Continuous Relaxation
Traditional methods “explicitly choose” an operation from the menu at each cooking step. but the ingenuity of DARTS lies in introducing a “magic seasoning packet”. In any cooking step, we no longer choose a single operation but mix all possible operations with certain “weights” to form a “mixed operation”.
For example, in a step, we don’t choose “dicing” or “blanching”, but use a mixed operation of “50% dicing + 30% blanching + 20% doing nothing”. These percentages are the “architecture parameters” (α) in DARTS, which are continuous and can be fine-tuned.
Thus, the problem of “rigid selection” in discrete space is transformed into a problem of “adjusting ratios” in continuous space. We possess a “Supernet” containing all possible recipes, which includes all possible structures at once.
3. “Taste First, Adjust Later” — Bilevel Optimization
With this “magic seasoning packet” and “super recipe”, how does DARTS find the best ratio? It adopts a “two-step” optimization strategy called “bilevel optimization”:
- Inner Optimization (Adjusting Dish Taste): Imagine you made a “fusion dish” based on the current “mixing ratio” (architecture parameter α). After determining the ratio of the seasoning packet, you need to quickly taste and adjust the “subtle heat and time” (model weights w) of this dish to make it as delicious as possible on the “training table” (training dataset).
- Outer Optimization (Adjusting Seasoning Packet Ratio): On the basis that the previous dish tastes okay, you serve it to another “customer tasting table” (validation dataset) to see customer feedback. According to customer evaluation, you can know whether the proportion of “dicing” is too little or “blanching” is too much. Then, you go back and adjust the formula of your “magic seasoning packet” (architecture parameter α) to make the next dish more popular with “customers”.
These two processes alternate, just like a chef tasting and fine-tuning while cooking, and adjusting the overall formula based on feedback. Finally, when the ratio of the “magic seasoning packet” is adjusted to the best, we get the optimal “recipe unit” structure.
4. “Finalizing” the Best Recipe — Discretization
When training ends and architecture parameters (α) stabilize, the weights of each sub-operation in each “mixed operation” are determined. DARTS will choose the sub-operation with the largest weight in each mixed operation, thereby generating a specifically discrete neural network structure. This is like finally determining that “dicing” is the best choice from “50% dicing + 30% blanching”.
3. Advantages and Challenges of DARTS
Advantages: Fast and Accurate
- High Efficiency: Since gradient descent can be applied for optimization, the search speed of DARTS is orders of magnitude faster than traditional black-box search methods, capable of finding high-performance architectures within a few GPU days (or even shorter time).
Challenges: The Road to Delicacy is Not Smooth
- Performance Collapse: Although DARTS is very efficient, it sometimes encounters “performance collapse” problems. As training proceeds, the searched optimal architecture tends to overuse “skip connections” (doing nothing, passing data directly), leading to poor model performance. This is like when designing a recipe, sometimes the “magic seasoning packet” increasingly tends to “add nothing”, and the final dish is bland and tasteless.
- Memory Consumption: Training a “super recipe” containing all possible operations still requires large memory.
4. Recent Progress: Overcoming Challenges, Pursuing More Robust Automatic Design
Addressing the performance collapse problem of DARTS, researchers have proposed many improvement schemes. For example:
- DARTS+: Introduces an “early stopping” mechanism, just like stopping the adjustment in time when the “magic seasoning packet” starts to go astray, avoiding performance degradation caused by over-optimization.
- Fair DARTS: Further analysis found that performance collapse might be because certain operations (like skip connections) have an “unfair advantage” in competition. Fair DARTS attempts to make competition between different operations fairer by adjusting optimization methods and encouraging architecture weights to tend toward 0 or 1, thereby obtaining more robust architectures.