From 494af1d4f114a566a90e023d3322c5eb068505b3 Mon Sep 17 00:00:00 2001 From: Anthony Wang Date: Thu, 12 Dec 2024 19:54:14 -0500 Subject: Use \[ \] instead of $$ $$ for display math --- content/posts/solving-shortest-paths-with-transformers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/posts/solving-shortest-paths-with-transformers.md b/content/posts/solving-shortest-paths-with-transformers.md index 94457c9..859f6e8 100644 --- a/content/posts/solving-shortest-paths-with-transformers.md +++ b/content/posts/solving-shortest-paths-with-transformers.md @@ -185,7 +185,7 @@ For our training run, we used the following specifications: | Optimizer | Adam | The number of bits required to store the model parameters in float32 is around $1.76\cdot10^6$. The number of possible graphs on 15 vertices generated using our procedure is approximately -$$\frac{\binom{15}{2}^{15}}{15!} \approx 1.59\cdot10^{18}.$$ +\[\frac{\binom{15}{2}^{15}}{15!} \approx 1.59\cdot10^{18}.\] This is because there are $\binom{15}{2}$ choices for each of the 15 edges and we don't care about the order of the edges. This is only an approximation because some edges might be duplicated. Each graph has an answer between 1 and 15 which requires around 4 bits, so memorizing all the answers requires requires $4\cdot1.59\cdot10^{18} = 6.36\cdot10^{18}$ bits, which is $3.61\cdot10^{12}$ times larger than our model size. This implies that in order to get really low loss, our model needs to do something other than brute memorization. A single training run takes roughly three hours on a Radeon 7900 XTX graphics card. -- cgit v1.2.3-70-g09d2