male-1: Welcome back to Byte-Sized Breakthroughs, the podcast where we dissect the latest advancements in machine learning. I’m your host, Alex Askwell, and today we have an incredibly exciting paper to discuss: 'Learning to learn by gradient descent by gradient descent'. Joining me are two experts, Dr. Paige Turner, the lead researcher on this work, and Prof. Wyd Spectrum, an authority in the field. Welcome, both of you. female-1: Thanks, Alex, happy to be here. female-2: Thanks for having me, Alex. male-1: Paige, let’s start with the big picture. What problem were you trying to solve, and why was it important? female-1: Certainly, Alex. In machine learning, we've seen a huge shift from hand-designed features to learning them automatically. However, the optimization algorithms themselves – the methods we use to train those models – are still largely designed by hand. This presents a problem because these hand-designed methods are often tailored for specific types of problems. We wanted to explore a meta-learning approach: could we learn the optimization algorithm itself? male-1: This raises a fundamental point, Paige, that many in the field grapple with. We spend a lot of effort hand-crafting optimization algorithms, often based on intuitions and heuristics. So, what's the issue with the traditional approach, Wyd? female-2: It's about generalizability, Alex. Traditional optimizers, like Gradient Descent, Momentum, Adam, RMSprop and Nesterov's Accelerated Gradient (NAG), are designed with certain assumptions in mind. For example, Adam is very popular because it has good performance in a lot of different settings, but it is certainly not optimal everywhere. The 'No Free Lunch' theorem states that no single optimization algorithm is universally superior across all optimization problems. Each method has strengths and weaknesses, and they may not perform well outside of their designed scope. Different research communities, working in different domains, end up designing very different algorithms, which further highlights this point. This leads us to consider whether we can learn a better optimization procedure tailored to a specific class of problems, and this paper addresses this point directly. male-1: So, rather than rely on hand-designed methods, you're proposing to learn an optimization algorithm. This sounds like a significant departure, Paige. Can you explain how this is done and what the main innovations in the paper are? female-1: Precisely, Alex. Instead of having a hand-designed algorithm that updates parameters of a function, we use a recurrent neural network, specifically a Long Short-Term Memory (LSTM) network, to parameterize the update rule. Think of it as an algorithm that learns to update other algorithms. The main innovations are: 1. **The meta-learning approach:** Casting the optimization algorithm design itself as a learning problem. We are trying to learn a general procedure that will perform well across a range of similar optimization problems. 2. **The coordinatewise LSTM optimizer:** We use a separate LSTM network for each parameter of the optimizee, meaning each parameter is updated using its own independent LSTM state. However, each LSTM uses the same weights. This allows us to scale this method to high dimensional problems because we do not need a huge LSTM network. It also makes the optimizer invariant to the order of parameters in the network, which is a nice property. 3. **BPTT for optimizer training:** We train the LSTM optimizer using Backpropagation Through Time (BPTT) by treating the whole optimization trajectory as the training signal. This allows us to train on partial trajectories rather than only the final optimization step, which is often inefficient. 4. **Transfer Learning as Generalization:** We approach transfer learning as a problem of generalization which means that knowledge learned in one optimization task can be used on a new but similar optimization task. male-1: That's fascinating. So, you’re using LSTMs, a type of recurrent neural network, to learn to optimize another function. Wyd, can you comment on why LSTMs were chosen for this task? female-2: LSTMs are well-suited for this because they can learn temporal dependencies. In optimization, the update to a parameter depends on the history of gradients encountered. An LSTM can naturally integrate this history. By learning how to update parameters using past gradients as information, the LSTM is able to essentially develop a sense of 'momentum', or other types of update rules automatically, rather than needing to be given those rules by hand, like Adam or RMSprop does. male-1: Paige, let's dive into the details of how this LSTM-based optimizer is trained. Can you explain the process? female-1: Certainly, Alex. The core idea is that we want the LSTM network to produce good updates to the parameters of a target function. We call the target function the 'optimizee' and the LSTM that is doing the optimization is the 'optimizer'. Here's the training procedure: 1. **Initialization:** We start by randomly initializing the parameters of both the LSTM optimizer (which we denote as φ), and the parameters of the function we want to optimize, the optimizee (which we denote as θ). 2. **Sample a function:** We then sample a function from a distribution of similar functions. This allows the optimizer to be trained on a distribution of problems. This is similar to how we train machine learning models, we do not expect them to perform well on data we have never seen. 3. **Optimization Trajectory:** We then let the LSTM optimizer update the optimizee for a set number of steps, denoted as T. At each step, we calculate the gradient of the optimizee function (∇f(θt)). We input this gradient and the previous hidden state of the LSTM, denoted as ht, into the LSTM. The LSTM outputs an update to the optimizee parameters, which we call gt. The update to the optimizee is θt+1 = θt + gt. The LSTM also produces a new hidden state, ht+1, which is fed into the LSTM at the next time step. 4. **Loss Calculation:** After running for T steps, we calculate the loss of the optimizer, which we denote as L(φ). We calculate L(φ) by taking the sum of the values of the function over the T steps of the trajectory. In the paper, we used equal weights for simplicity, meaning each function value contributed equally to the loss. However, in principle, we could also give more weight to some steps over others. 5. **Backpropagation:** We then compute the gradient of this loss, ∂L(φ)/∂φ, with respect to the optimizer's parameters (φ), using BPTT. This means we effectively propagate the gradient through the entire unrolled computation graph of the LSTM and the optimizee. 6. **Update Optimizer:** Finally, we update the parameters of the LSTM optimizer using gradient descent (specifically, using the Adam optimizer on the LSTM's parameters). This allows the optimizer itself to learn how to optimize functions. Crucially, in step 5, we drop the gradients flowing through the calculation of the gradient of the optimizee itself. That is we assume ∂∇t/∂φ = 0. This means we are avoiding calculation of second-order derivatives of the optimizee, which would be significantly more expensive to calculate. male-1: So, to clarify, the LSTM is essentially learning how to update the parameters of *another* function, and that learning process is guided by the loss function, which is evaluated over the optimization trajectory rather than only the final optimization value. Can you explain why the paper chose to include intermediate steps in the loss function, instead of only using the final step? female-1: That's right, Alex. If we only used the final step in the loss function, then only the final step of the trajectory would provide information for training the optimizer. This would make training with BPTT extremely inefficient. In the paper we use a weighted sum of the optimizee function at each time step, and for simplicity the weights were all set to one. In theory, by using different weights, we could guide the optimization process in different ways. male-1: Wyd, from your perspective, is this approach more computationally expensive than traditional optimization methods? female-2: It is, Alex. Training this LSTM optimizer requires significant computational resources, because you need to backpropagate through the entire optimization trajectory and then update the LSTM parameters. However, after the LSTM is trained you can use it to optimize many different functions. And, the experiments in the paper show that once trained, this approach can outperform traditional methods, which highlights its potential. In some sense you are trading off the computational cost during the training phase, for the computational savings once the optimizer is trained. male-1: Paige, you also mentioned coordinatewise LSTMs. Can you explain the significance of this architectural choice? female-1: Certainly, Alex. The coordinatewise LSTM optimizer is critical for scaling this approach to large models. Instead of having one large LSTM network that outputs updates for all parameters of the optimizee, we use a separate LSTM for each parameter of the optimizee. However, these LSTMs share the same parameters. This means that there is a single shared LSTM that operates on each parameter individually. This drastically reduces the number of parameters in our optimizer. This allows us to avoid a huge hidden state and an enormous number of parameters. Moreover, this makes the optimizer invariant to the order of parameters in the network, since the same update rule is used independently on each coordinate. Standard update rules like Adam and RMSprop are also coordinatewise, so we are essentially learning a complex coordinatewise update rule rather than defining one by hand. male-1: That makes sense. Now, let's talk about the experiments. You tested this approach on several tasks. Let's start with the quadratic functions. Can you walk us through what you did and the key results? female-1: Okay, Alex. We first started by testing the approach on a simple class of synthetic 10-dimensional quadratic functions. We chose this class of functions because it's relatively simple and we can easily control the complexity of the optimization problem. We generated random quadratic functions of the form f(θ) = ||Wθ - y||^2, where W was a random 10x10 matrix and y was a random 10-dimensional vector. The elements of both W and y are sampled from an independent and identically distributed Gaussian distribution. We then trained the LSTM optimizer on a distribution of these randomly sampled quadratic functions, and tested on newly sampled ones. Each function was optimized for 100 steps, and the optimizer was unrolled for 20 steps during testing. We did not use any preprocessing, or post-processing for this experiment. The key result here was that the learned LSTM optimizer significantly outperformed traditional optimizers such as SGD, Adam, RMSprop and NAG. Specifically, looking at Figure 4 of the paper, the LSTM optimizer had a lower final loss by the end of the optimization trajectory than all the baseline methods. male-1: Those are very strong results. What about the MNIST task? How did the learned optimizer perform when optimizing neural networks? female-1: For the MNIST experiment, we trained a small MLP (Multi-Layer Perceptron) with one hidden layer of 20 units with a sigmoid activation function. The goal was to classify the MNIST digits. We used a batch size of 128 to estimate the value of the cross-entropy objective function, and the gradients of this function. We then trained the LSTM optimizer to optimize the parameters of this base network and again tested it on different runs. Here we preprocessed the inputs and rescaled the outputs of the LSTM to improve numerical stability. Looking at the center plot of Figure 4, the LSTM optimizer significantly outperformed all of the hand-designed baseline methods. The final loss obtained using the LSTM was much lower than the loss obtained by other methods. The right panel shows that the LSTM continues to outperform the hand-designed methods if we allow it to run for longer than it was trained for. The LSTM was trained for 100 steps but was unrolled for 200 steps for this comparison, and the LSTM still performed better. male-1: This shows good transfer learning, meaning the optimizer can perform well on tasks it was not explicitly trained for. Did you see evidence of transfer learning in any other scenarios? female-1: Yes, Alex. We explored this systematically. We tried varying the architecture of the MLP. For example we tried doubling the number of hidden units to 40, adding a second hidden layer, and even changing the activation function. For the experiments with doubled units and adding a second layer, the LSTM generalized well. However, changing the activation function to ReLU resulted in a large performance drop, which suggests that the dynamics of the underlying optimization problem changed too much for the learned optimizer to perform well. We systematically studied these effects, and presented results in Figure 6 which shows that the learned LSTM optimizer performs well even when faced with different network architectures than it was trained on, as long as the network is not too different. male-1: That's an important detail, highlighting both the strength and potential fragility of the generalization. What about the CIFAR-10 experiments? female-1: In the CIFAR-10 experiments, we used a more complex optimizee: a convolutional network with three convolutional layers, max pooling, and a fully connected layer with 32 hidden units, along with ReLU activations and batch normalization. We found that the coordinatewise optimizer we used for the MNIST experiment wasn't sufficient. Instead, we split the LSTM optimizer into two separate LSTMs, one for the fully connected layers and another for the convolutional layers, using coordinatewise optimization. Each LSTM had a shared set of weights, but separate hidden states as before. In Figure 7, the left plot demonstrates that the LSTM optimizer outperforms the baselines on a held out test set. The other two plots in Figure 7 show that the LSTM optimizer also has excellent performance when used on subsets of the CIFAR dataset, denoted as CIFAR-5 and CIFAR-2. An additional optimizer, denoted as LSTM-sub, was trained only on the subset of the data, and we can see that the performance of this optimizer is comparable to the original LSTM optimizer. This indicates that transfer to a totally unseen dataset is possible. male-1: So even when the problem is quite complex, the LSTM optimizer exhibits superior learning performance and transfer learning? Wyd, does this match your expectations? female-2: It's quite impressive. The ability of the LSTM to outperform hand-designed optimizers in these diverse settings highlights the power of the meta-learning approach. It also shows how valuable it is to move away from human intuition for designing update rules, which can sometimes lead to methods which may not be ideal for every optimization task. The fact that this optimizer can be trained on some dataset, and then used to optimize a network on a completely disjoint dataset, is another important result. male-1: Let's move on to the neural art experiments. How did the LSTM optimizer perform in the context of style transfer? female-1: In the neural art experiments, we trained optimizers to perform artistic style transfer, where an image is styled according to the artistic style of another image. Each content and style image pair gives rise to a different optimization problem. The objective function is a combination of content loss, style loss, and regularization, as described in the Gatys et al., 2015 paper. We trained the optimizer using one style and 1800 content images. We then tested it on 100 new content images and new styles. Crucially, we trained the optimizer on 64x64 images and tested on 128x128 images. The results in Figure 8 show that the LSTM optimizer outperforms all standard optimizers when both style and resolution are the same as the training setting, and continues to perform well when these are changed. These are really impressive results, as we are seeing generalization to both new styles, and new image sizes, at the same time. You can find images that were styled by the LSTM in Figure 9 and also in Appendix C, which illustrates the visual quality of the results. male-1: Those results are striking. It seems this method displays a remarkable ability to generalize across several tasks and scenarios. Before we move on, can you explain a preprocessing step the authors made? female-1: Yes, Alex. A key aspect of training the LSTM optimizer was gradient preprocessing. Different parameters of the optimizee can have very different magnitudes of gradients and this makes training the optimizer more difficult. To overcome this, we preprocessed the gradients using a log transformation, that is described more fully in Appendix A. We provide the optimizer both with the log of the magnitude of the gradient and the sign. However, we also take special care to account for gradients that are very close to zero. The key idea here is that the neural network will naturally disregard the small input signals if they have a vastly different magnitude compared to other inputs, so this preprocessing step was a way to deal with this. male-1: That makes sense. Before moving onto the limitations, in the appendix, it seems the authors also considered adding additional mechanisms that allow the coordinates of the LSTM optimizer to communicate with each other, can you explain why they were motivated to try this, and what their findings were? female-1: Yes, Alex. The main part of the paper explored the coordinatewise LSTM which is akin to a learned version of RMSprop or Adam. These are highly effective methods in practice, however they do not take into account the correlations between parameters. In other words, the update to parameter 1 does not consider the information from the gradients seen by parameter 2. To explore how to do this, we first introduced Global Averaging Cells, or GACs. These are cells in the LSTM that pass along a signal that is averaged across all coordinates. By doing this, the network can learn an update based on the overall state of the optimization. We then went further and created the NTM-BFGS optimizer. This optimizer used an external memory in the style of a neural turing machine. We used the LSTM as a controller to guide updates to this external memory. The memory is updated via read and write operations. We motivated this design by noting that BFGS can also be seen as a set of independent processes working coordinatewise, but communicating through an inverse Hessian approximation stored in memory. By learning the read and write operations we aimed to create a general version of the BFGS algorithm. We showed that the NTM-BFGS with one read and three write heads can in theory simulate the inverse Hessian BFGS update, if the LSTM controller is expressive enough. We also explored a low rank version of this memory, denoted NTM-L-BFGS. Unfortunately, however, these architectures did not outperform the simpler coordinatewise LSTM. This means that more work is needed to create better methods that utilize parameter correlation information during optimization. male-1: That's very interesting. So the more complex method didn't outperform the simple approach, at least not in this particular case. Let's shift our focus to limitations. What are the main weaknesses of this approach, and what directions for future research do you see? female-1: Absolutely, Alex. While the results are promising, there are limitations. First, the learned optimizers are currently specialized to classes of problems. Their performance may degrade if applied to significantly different problems. Also, training the optimizers themselves is computationally expensive, and requires careful hyperparameter tuning for the ADAM optimizer. The NTM-BFGS optimizer, as we mentioned, didn't consistently outperform the coordinatewise LSTM optimizer, which indicates that more research is needed to optimize that approach. Furthermore, the generalization ability of the learned optimizer can be somewhat sensitive to the characteristics of the objective function. For example, the switch to ReLU activations on MNIST led to poor performance. The experiments so far have also been conducted on a limited set of tasks, so more exploration is needed. We also only considered coordinatewise updates, which limits the interactions between parameter updates. The way forward is to explore more sophisticated optimizer architectures, potentially using memory augmented networks, as well as to develop more robust optimizers that can generalize to a wider range of optimization problems. Reinforcement learning may also be a direction worth pursuing. Finally, there is also a need for better understanding of the theoretical properties of these optimizers such as their convergence behavior. male-1: Wyd, what are your thoughts on the limitations and future directions? female-2: I agree with Paige. The paper clearly indicates the promise of the meta-learning approach for optimization. The limitations are in some sense expected given how new this line of research is. It is important to note that the presented method is not a universal solution, and it makes a number of assumptions about the class of optimization problems. The next steps would be to address the mentioned shortcomings, particularly in terms of generalization and robustness. It is also important to understand whether the coordinatewise nature of the approach is optimal, and more exploration is needed to figure out if it is a good idea to model dependencies between different parameters during optimization, which seems to be the next big challenge in this field. In terms of future directions, the incorporation of domain knowledge, more exploration of different memory update mechanisms and reinforcement learning are also important points. male-1: Okay, let’s move to the broader impact of this research. What are the potential real-world applications of learned optimizers? female-1: There are many exciting applications for this approach. Firstly, we could use learned optimizers for automated hyperparameter tuning, automatically adjusting learning rates and other parameters for base models, speeding up training. Secondly, we could develop specialized optimizers for specific problem domains, such as physics, chemistry, or engineering where the current hand-crafted optimizers may not be ideal. These specialized optimizers could be trained for a certain distribution of problems that are common in these fields. Furthermore, learned optimizers might be used to train more robust neural networks, which are less sensitive to noisy or adversarial inputs. Finally, this method is a good starting point for meta-learning frameworks, learning how to learn rather than training a specific model on a specific dataset, which would be a key step forward in machine learning. male-1: Wyd, how do you see this research impacting the broader field of machine learning? female-2: This work has the potential to revolutionize how we approach optimization in machine learning. Moving away from hand-crafted methods to data-driven approaches could lead to significant advancements in terms of both performance, and also in the automation of machine learning pipelines, which will certainly reduce the burden on ML practitioners. The meta-learning concept ties into ideas that exist in cognitive science and control theory, which means there are a lot of potential links to other fields that are yet to be fully explored. This idea that we can not only learn models, but we can also learn how to train models is likely to be very important moving forward. male-1: This has been an incredibly insightful discussion. Before we wrap up, Paige, any final thoughts or key takeaways you'd like to emphasize? female-1: Thank you, Alex. The key takeaway is that we can learn optimization algorithms, and by doing so, create specialized methods that outperform generic ones. The coordinatewise LSTM optimizer is a scalable method, and shows surprisingly strong generalization. We have demonstrated this on many tasks and have highlighted the potential of moving away from hand designed optimizers towards learning update rules from data. There is still a lot of work to be done, but I feel the field of learned optimizers is very promising. male-1: Thank you both for your incredibly insightful contributions. This has been an illuminating discussion of a very exciting paper. And thank you listeners for joining us on another episode of Byte-Sized Breakthroughs. Until next time.