diff options
author | SIPB | 2024-12-10 23:00:13 -0500 |
---|---|---|
committer | SIPB | 2024-12-10 23:00:13 -0500 |
commit | 75a6921af214a4d7157627524d916a5bda7d1406 (patch) | |
tree | 4ed7e096911b40e43e228e585355acc851becebd /blog.md | |
parent | 6affd742f11839ed0d09f620e73317f655ec568b (diff) |
Diffstat (limited to 'blog.md')
-rw-r--r-- | blog.md | 4 |
1 files changed, 2 insertions, 2 deletions
@@ -157,9 +157,9 @@ Finally, the last head will be in charge of noticing whether vertex 1 has reache The field of Singular Learning Theory (SLT; see Liam Carroll's Master's thesis "Phase Transitions in Neural Networks" for an introduction) aims to understand model training and loss-landscape geometry. In efforts to better understand the loss landscape of the shortest paths loss function according to the tokens used in our hand coded implementation of the shortest paths transformers, we decided to start at a good setting of the parameters, and then perturb the weights, and see if the model can subsequently achieve low loss. The intuition for why this is a good approach at measuring "how attractive of a loss basin" we have is that this experiment is similar to the Local Learning Coefficient from SLT. (see Lau, Edmund, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. "The Local Learning Coefficient: A Singularity-Aware Complexity Measure"). We found that, perturbing the weights led to high loss, but gradient descent was able to recover low loss, indicating that the solution is somewhat "findable" by gradient descent. <div style="text-align:center"> -![perturb.png](perturb.png) +![perturb.png](img/perturb.png) -![perturb-loss.png](perturb-loss.png) +![perturb-loss.png](img/perturb-loss.png) </div> ## Training |