diff options
author | Anthony Wang | 2024-12-12 19:54:14 -0500 |
---|---|---|
committer | Anthony Wang | 2024-12-12 19:54:14 -0500 |
commit | 494af1d4f114a566a90e023d3322c5eb068505b3 (patch) | |
tree | 99566cf0c438601ab832f2a557115a78c93179a3 | |
parent | 3dfc72eecaa4a9e2252ae31d9d3d461a60ba573a (diff) |
Use \[ \] instead of $$ $$ for display math
-rw-r--r-- | content/posts/solving-shortest-paths-with-transformers.md | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/content/posts/solving-shortest-paths-with-transformers.md b/content/posts/solving-shortest-paths-with-transformers.md index 94457c9..859f6e8 100644 --- a/content/posts/solving-shortest-paths-with-transformers.md +++ b/content/posts/solving-shortest-paths-with-transformers.md @@ -185,7 +185,7 @@ For our training run, we used the following specifications: | Optimizer | Adam | The number of bits required to store the model parameters in float32 is around $1.76\cdot10^6$. The number of possible graphs on 15 vertices generated using our procedure is approximately -$$\frac{\binom{15}{2}^{15}}{15!} \approx 1.59\cdot10^{18}.$$ +\[\frac{\binom{15}{2}^{15}}{15!} \approx 1.59\cdot10^{18}.\] This is because there are $\binom{15}{2}$ choices for each of the 15 edges and we don't care about the order of the edges. This is only an approximation because some edges might be duplicated. Each graph has an answer between 1 and 15 which requires around 4 bits, so memorizing all the answers requires requires $4\cdot1.59\cdot10^{18} = 6.36\cdot10^{18}$ bits, which is $3.61\cdot10^{12}$ times larger than our model size. This implies that in order to get really low loss, our model needs to do something other than brute memorization. A single training run takes roughly three hours on a Radeon 7900 XTX graphics card. |