From a24288e28c4b53fdd6467ed4eed626fa0586bf72 Mon Sep 17 00:00:00 2001
From: SIPB
Date: Mon, 2 Dec 2024 16:54:41 -0500
Subject: Latest copy of blog post and insane TSP
---
blog.md | 81 ++++++++++---
fine-tuning-2d-histogram.png | Bin 0 -> 8484 bytes
fine-tuning-loss.png | Bin 0 -> 15999 bytes
index.html | 166 +++++++++++++++++++++------
insane-shortest-paths.ipynb | 263 +++++++++++++++++++++++++++++++++++++++++++
test-2d-histogram.png | Bin 0 -> 10783 bytes
training-2d-histogram.png | Bin 0 -> 8764 bytes
training-loss.png | Bin 0 -> 16108 bytes
8 files changed, 457 insertions(+), 53 deletions(-)
create mode 100644 fine-tuning-2d-histogram.png
create mode 100644 fine-tuning-loss.png
create mode 100644 insane-shortest-paths.ipynb
create mode 100644 test-2d-histogram.png
create mode 100644 training-2d-histogram.png
create mode 100644 training-loss.png
diff --git a/blog.md b/blog.md
index 68de189..02a425e 100644
--- a/blog.md
+++ b/blog.md
@@ -1,5 +1,6 @@
---
build: pandoc blog.md --citeproc -s -o index.html
+mkzip: zip project.zip index.html *.png
title: "6.7960 Project: Investigating Off-Distribution Generalization of Transformers"
bibliography: blog.bib
link-citations: true
@@ -7,10 +8,6 @@ link-citations: true
-
-
Anthony Wang, Alek Westover, Kevin Zhao
@@ -19,35 +16,57 @@ Anthony Wang, Alek Westover, Kevin Zhao
## Abstract
-TODO: Probably should move this to the introduction instead of abstract?
+TODO
+
+## Introduction
+
+### Overview
+
+Recently, LLMs have been developing very fast, and with that comes the concern of aligning the models to output true and productive statements. One common approach for ensuring this is to have a human in the loop rewarding the model for true outputs (e.g. RLHF), but one drawback to this problem is that humans can be poor judges of truthfulness. As LLMs become more capable, there might not even exist experts that are good judges of whether the model's outputs, such as difficult mathematical proofs, are truthful. So, we'd like to propose a potential solution to this issue via **off-distribution generalization** - applying human-like intuition to solve problems not in the dataset. Paul Christiano [proposed an experiment](https://www.alignmentforum.org/posts/BxersHYN2qcFoonwg/experimentally-evaluating-whether-honesty-generalizes?commentId=dsDA2BWpHPdgLvaXX) about shortest paths in a graph; our project is essentially to implement Christiano's proposed experiment. To the best of our knowledge, although there has been research in applying machine learning for different variations of graph searches [@10.5555/3666122.3666260], no one has done our exact experiment yet.
-Recently, LLMs have been developing very fast, and with that comes the concern of aligning the models to output true and productive statements. One common approach for ensuring this is to have a human in the loop rewarding the model for true outputs (e.g. RLHF), but one drawback to this problem is that humans can be poor judges of truthfulness. As LLMs become more capable, there might not even exist experts that are good judges of whether the model's outputs, such as difficult mathematical proofs, are truthful. So, we'd like to propose a potential solution to this issue via **off-distribution generalization** - applying human-like intuition to solve problems not in the dataset. To do so involves Graph Neural Networks (GNNs), which are important in many artificial intelligence applications, including the design of practical features. Paul Christiano [proposed an experiment](https://www.alignmentforum.org/posts/BxersHYN2qcFoonwg/experimentally-evaluating-whether-honesty-generalizes?commentId=dsDA2BWpHPdgLvaXX) about shortest paths in a graph; our project is essentially to implement Christiano's proposed experiment. To the best of our knowledge, although there has been research in applying machine learning for different variations of graph searches [@10.5555/3666122.3666260], no one has done our exact experiment yet.
+It is generally desirable for LLMs to output true statements. A current approach for ensuring this is to have a human in the loop rewarding the model for true outputs (e.g. RLHF); however, humans can be poor judges of truthfulness. We enjoy many cognitive biases and might employ superficial heuristics when judging truthfulness. A further challenge is that as LLMs develop further, there might not even exist experts that can correctly judge the accuracy and truthfulness of sophisticated outputs such as difficult mathematical proofs.
-# Introduction
+One approach to solving this problem is to reward an LLM for truthful behavior on simple inputs, and then hoping that the LLM generalizes its truthful behavior for more complex inputs where humans cannot provide helpful labels. Deep learning models often perform remarkable feats of off-distribution generalization -- for instance, a model trained to transform hand drawn cats into images of cats might be able to handle a "cat" with three eyes in an intuitive way. We might hope that generalizing truthfully is simple, thus promoted by "Occam's Razor", and aim to investigate that with this project.
-## Overview
+COMMENT FROM KEVIN -- synthesize from intorduction
+
+## Task
+
+We will use a synthetic task to test our hypothesis that models will generalize truthfully off-distribution. The synthetic task is computing the distance between various vertices in an input graph. Our experiment will have three parts:
+
+1. Pre-train a transformer to predict the distance between two fixed vertices $s,t$ on graphs with $n\in [8, 32)$ vertices.
+2. Fine-tune a transformer to predict the distances between $s,t'$ for any $t'$ which is on the shortest path from $s$ to $t$, but only do fine-tuning on graphs with $n\in [8,16)$ vertices.
+3. Test whether the transformer can accurately predict the distances between $s,t'$ for any $t'$ on the shortest path from $s$ to $t$ for graphs with $n\in [16,32)$ vertices.
## Related Work
+COMMENT FROM ALEK
+-- please remove all mentions of graph neural networks -- that is BS: there is no actual reason why you'd ever use a Neural network to solve shortest paths, the point of choosing a synthetic task is because there is a **simple ground truth** which makes it easy to evaluate whether or not our model is performing correctly. We'd also hoped that the simplicity of the task would make it more feasible to do with a limited compute budget, but apparently this task was too hard for our architecture.
+
+
There has been some research into the algorithmic optimization of GNNs and how they may solve real-world issues; however, none of the related work targets using generic machine learning methods to solve graph problems.
- Cappart et al. has researched more into the Combinatorial Optimization of GNNs and developed algorithms for related tasks, thus facilitating machine learning [@DBLP:journals/corr/abs-2102-09544]. Their results are mostly algorithmic so we develop further by trading a bit of accuracy for much faster computation in such tasks.
- Tutsoy uses a graph-theory-based approach to model the epidemiological characteristics of infectious diseases, such as COVID-19 [@10.1109/TPAMI.2023.3256421]. We understand from his paper how GNN optimization may also be useful in researching novel diseases.
-## Sectioning
+### Theory
-# Our Machine Learning Approach
+### Algorithm for Shortest Paths
-## Task
+The standard algorithm to find the shortest path in a graph between a source numbered as $u$ and sink numbered as $v$ is **breadth-first search (BFS)**. The BFS algorithm maintains a mapping of visited vertices to their distances with respect to $u$, and each run of the algorithm goes through all the vertices newly visited in the previous run, and for each vertex, visits any of its unvisited neighbors. The algorithm terminates once either $v$ is visited or the set of newly visited vertices in a single run is empty.
-We will use a synthetic task to test our hypothesis that models will generalize truthfully off-distribution. The synthetic task is computing the distance between various vertices in an input graph. Our experiment will have three parts:
+We will use this algorithm to verify the accuracy of our machine learning approach. Given $V$ vertices and $E$ edges, the runtime of this algorithm is thus $O(V + E)$; however, a machine learning approach may do better in time through parallelism, although at the expense of using much more memory.
-1. Pre-train a transformer to predict the distance between two fixed vertices $s,t$ on graphs with $n\in [8, 32)$ vertices.
-2. Fine-tune a transformer to predict the distances between $s,t'$ for any $t'$ which is on the shortest path from $s$ to $t$, but only do fine-tuning on graphs with $n\in [8,16)$ vertices.
-3. Test whether the transformer can accurately predict the distances between $s,t'$ for any $t'$ on the shortest path from $s$ to $t$ for graphs with $n\in [16,32)$ vertices.
+### Potential Mathematical Approaches to Shortest Paths
+
+Another way one can think of the shortest path of a graph is using a *matrix* to record which vertices are connected. Given vertices numbered $1$ to $V$, we denote the **adjacency matrix** $\textbf{M}$ of dimensions $V \times V$ as the matrix with element $\textbf{M}_{i, j} = 1$ if vertices $i$ and $j$ are connected by an edge and $\textbf{M}_{i, j} = 0$ if they are not. Now, we note that (1) For all $k$, $(\textbf{M}+I)^k_{i, j} = 0$ if and only if there exists no path from the vertex numbered $i$ to the vertex numbered $j$ that is distance $k$ or less due to Markov matrix processes. As a result, if the distance between vertices numbered $i$ and $j$ is $d$, then $\text{min}\left((\textbf{M}+I)^k_{i, j}, 1\right) = 1$ if $k \ge d$ and $\text{min}\left((\textbf{M}+I)^k_{i, j}, 1\right) = 0$ if $k < d$.
+
+With this information, because the distance between any two vertices is at most $V-1$ in a graph with $V$ vertices, we note that the *distance* matrix turns out to be simply $$\textbf{D} = \textbf{1}_{V \times V} \cdot V - \Sigma_{i=0}^{V-1}\text{min}\left((\textbf{M}+I)^k_{i, j}, 1\right).$$ The runtime to compute this is $O(V)$, although it will take more space to compute all powers of $\textbf{M}$.
-## Data
+## Our Machine Learning Approach
+
+### Data
We will represent an $n$ vertex, $m$ edge unweighted, undirected graph as sequence of the endpoints of the $m$ edges, so $[a_1,b_1,a_2,b_2,\ldots,a_m,b_m]$ represents a graph with the edges $\{(a_i,b_i)\}$ for $1 \leq i \leq m$. We will pad all sequences to be the same length using the padding token 0.
@@ -61,13 +80,39 @@ We have three separate datasets.
As a side note, we are also curious whether the transformer learns to generalize to different distributions of graphs, such as denser graphs or graphs with different properties. Time permitting, we will also investigate this.
-## Architecture
+### Architecture
We plan to use a standard transformer architecture. We will ensure that the number of layers in our transformer is at least the diameter of the graph. By doing this, we ensure that there is an extremely simple circuit --- namely BFS --- that the transformer could in theory learn to perform the task. Note that if the transformer actually learns a simple circuit to perform this task, then it seems more likely to generalize well. This is also our intuition for why it should be possible to fine tune on a small amount of data for finding shortest paths to other vertices besides $2$ -- it seems like the model should be computing these other distances as intermediate values in its computation to find the distance to vertex $2$.
-## Positional Encodings
+### Embeddings
+
+TODO: fix this
In order to facilitate performing this task with limited computational resources, we plan to use custom-made positional encodings that tell the model extra information about the
structure of the problem, rather than the traditional sine/cosine positional encodings. (TODO: THIS IS OUTDATED) Specifically, our positional encodings are $v_1,v_1,v_2,v_2,\ldots,v_m,v_m,v_{m+1}$ where each $v_i$ is a random vector so each $v_i,v_j$ pair is nearly orthogonal with high probability. We will concatenate these with the token encodings rather than adding them. This should let the model easily have large attention scores between vertices corresponding to a single edge.
+### Explicit transformer formula for shortest paths
+
+
+
+## Results
+
+### Initial Results
+
+We used a model dimension of 64, four layers, and two heads per layer. We used MSE loss, the Adam optimizer, a learning rate of 8e-4, and a batch size of 131,072 for 8000 unique randomly generated batches. Our final MSE loss was 0.35546875.
+
+![](training-loss.png)
+
+![](training-2d-histogram.png)
+
+### Fine Tuning
+
+After receiving our initial results, we fine-tuned with a learning rate of 1e-5, also with MSE and the same batch size. Our final results are shown below.
+
+![](fine-tuning-loss.png)
+
+![](fine-tuning-2d-histogram.png)
+
+![](test-2d-histogram.png)
+
## References
diff --git a/fine-tuning-2d-histogram.png b/fine-tuning-2d-histogram.png
new file mode 100644
index 0000000..df45973
Binary files /dev/null and b/fine-tuning-2d-histogram.png differ
diff --git a/fine-tuning-loss.png b/fine-tuning-loss.png
new file mode 100644
index 0000000..7e95a06
Binary files /dev/null and b/fine-tuning-loss.png differ
diff --git a/index.html b/index.html
index 0126a3b..95d23ff 100644
--- a/index.html
+++ b/index.html
@@ -202,8 +202,9 @@ Generalization of Transformers
{xy,alekw,kevinmz}@mit.edu
Abstract
-TODO: Probably should move this to the introduction instead of
-abstract?
+TODO
+Introduction
+Overview
Recently, LLMs have been developing very fast, and with that comes
the concern of aligning the models to output true and productive
statements. One common approach for ensuring this is to have a human in
@@ -214,10 +215,7 @@ experts that are good judges of whether the model’s outputs, such as
difficult mathematical proofs, are truthful. So, we’d like to propose a
potential solution to this issue via off-distribution
generalization - applying human-like intuition to solve
-problems not in the dataset. To do so involves Graph Neural Networks
-(GNNs), which are important in many artificial intelligence
-applications, including the design of practical features. Paul
-Christiano proposed
an experiment about shortest paths in a graph; our project is
essentially to implement Christiano’s proposed experiment. To the best
@@ -226,32 +224,24 @@ learning for different variations of graph searches (Zang et al.
2024), no one has done our exact experiment yet.
-Introduction
-Overview
-
-There has been some research into the algorithmic optimization of
-GNNs and how they may solve real-world issues; however, none of the
-related work targets using generic machine learning methods to solve
-graph problems.
-
-Cappart et al. has researched more into the Combinatorial
-Optimization of GNNs and developed algorithms for related tasks, thus
-facilitating machine learning (Cappart et al. 2021). Their results are
-mostly algorithmic so we develop further by trading a bit of accuracy
-for much faster computation in such tasks.
-Tutsoy uses a graph-theory-based approach to model the
-epidemiological characteristics of infectious diseases, such as COVID-19
-(Tutsoy
-2023). We understand from his paper how GNN optimization may
-also be useful in researching novel diseases.
-
-Sectioning
-Our Machine Learning
-Approach
+It is generally desirable for LLMs to output true statements. A
+current approach for ensuring this is to have a human in the loop
+rewarding the model for true outputs (e.g. RLHF); however, humans can be
+poor judges of truthfulness. We enjoy many cognitive biases and might
+employ superficial heuristics when judging truthfulness. A further
+challenge is that as LLMs develop further, there might not even exist
+experts that can correctly judge the accuracy and truthfulness of
+sophisticated outputs such as difficult mathematical proofs.
+One approach to solving this problem is to reward an LLM for truthful
+behavior on simple inputs, and then hoping that the LLM generalizes its
+truthful behavior for more complex inputs where humans cannot provide
+helpful labels. Deep learning models often perform remarkable feats of
+off-distribution generalization – for instance, a model trained to
+transform hand drawn cats into images of cats might be able to handle a
+“cat” with three eyes in an intuitive way. We might hope that
+generalizing truthfully is simple, thus promoted by “Occam’s Razor”, and
+aim to investigate that with this project.
+COMMENT FROM KEVIN – synthesize from intorduction
Task
We will use a synthetic task to test our hypothesis that models will
generalize truthfully off-distribution. The synthetic task is computing
@@ -276,7 +266,95 @@ between s, t′ for any
class="math inline">t for graphs with n ∈ [16, 32) vertices.
-
Data
+
+COMMENT FROM ALEK – please remove all mentions of graph neural
+networks – that is BS: there is no actual reason why you’d ever use a
+Neural network to solve shortest paths, the point of choosing a
+synthetic task is because there is a simple ground
+truth which makes it easy to evaluate whether or not our model
+is performing correctly. We’d also hoped that the simplicity of the task
+would make it more feasible to do with a limited compute budget, but
+apparently this task was too hard for our architecture.
+There has been some research into the algorithmic optimization of
+GNNs and how they may solve real-world issues; however, none of the
+related work targets using generic machine learning methods to solve
+graph problems.
+
+Cappart et al. has researched more into the Combinatorial
+Optimization of GNNs and developed algorithms for related tasks, thus
+facilitating machine learning (Cappart et al. 2021). Their results are
+mostly algorithmic so we develop further by trading a bit of accuracy
+for much faster computation in such tasks.
+Tutsoy uses a graph-theory-based approach to model the
+epidemiological characteristics of infectious diseases, such as COVID-19
+(Tutsoy
+2023). We understand from his paper how GNN optimization may
+also be useful in researching novel diseases.
+
+Theory
+Algorithm for Shortest Paths
+The standard algorithm to find the shortest path in a graph between a
+source numbered as u and sink
+numbered as v is
+breadth-first search (BFS). The BFS algorithm maintains
+a mapping of visited vertices to their distances with respect to u, and each run of the algorithm
+goes through all the vertices newly visited in the previous run, and for
+each vertex, visits any of its unvisited neighbors. The algorithm
+terminates once either v is
+visited or the set of newly visited vertices in a single run is
+empty.
+We will use this algorithm to verify the accuracy of our machine
+learning approach. Given V
+vertices and E edges, the
+runtime of this algorithm is thus O(V + E); however,
+a machine learning approach may do better in time through parallelism,
+although at the expense of using much more memory.
+Potential
+Mathematical Approaches to Shortest Paths
+Another way one can think of the shortest path of a graph is using a
+matrix to record which vertices are connected. Given vertices
+numbered 1 to V, we denote the adjacency
+matrix M of
+dimensions V × V as
+the matrix with element Mi, j = 1
+if vertices i and j are connected by an edge and Mi, j = 0
+if they are not. Now, we note that (1) For all k, (M + I)i, jk = 0
+if and only if there exists no path from the vertex numbered i to the vertex numbered j that is distance k or less due to Markov matrix
+processes. As a result, if the distance between vertices numbered i and j is d, then min((M + I)i, jk, 1) = 1
+if k ≥ d and min((M + I)i, jk, 1) = 0
+if k < d.
+With this information, because the distance between any two vertices
+is at most V − 1 in a graph
+with V vertices, we note that
+the distance matrix turns out to be simply D = 1V × V ⋅ V − Σi = 0V − 1min((M + I)i, jk, 1).
+The runtime to compute this is O(V), although it will take
+more space to compute all powers of M.
+Our Machine Learning
+Approach
+Data
We will represent an n
vertex, m edge unweighted,
undirected graph as sequence of the endpoints of the n ∈ [16, 32) instead.
generalize to different distributions of graphs, such as denser graphs
or graphs with different properties. Time permitting, we will also
investigate this.
-Architecture
+Architecture
We plan to use a standard transformer architecture. We will ensure
that the number of layers in our transformer is at least the diameter of
the graph. By doing this, we ensure that there is an extremely simple
@@ -334,7 +412,8 @@ vertices besides 2 – it seems like the
model should be computing these other distances as intermediate values
in its computation to find the distance to vertex 2.
-Positional Encodings
+Embeddings
+TODO: fix this
In order to facilitate performing this task with limited
computational resources, we plan to use custom-made positional encodings
that tell the model extra information about the structure of the
@@ -350,6 +429,23 @@ pair is nearly orthogonal with high probability. We will concatenate
these with the token encodings rather than adding them. This should let
the model easily have large attention scores between vertices
corresponding to a single edge.
+
+Results
+Initial Results
+We used a model dimension of 64 64, four layers, and two heads per
+layer. We used MSE loss, the Adam optimizer, a learning rate of 8e-4,
+and a batch size of 131,072 for 8000 unique randomly generated batches.
+Our final MSE loss was 0.35546875.
+
+
+Fine Tuning
+After receiving our initial results, we fine-tuned with a learning
+rate of 1e-5, also with MSE and the same batch size. Our final results
+are shown below.
+
+
+
References
diff --git a/insane-shortest-paths.ipynb b/insane-shortest-paths.ipynb
new file mode 100644
index 0000000..72846c2
--- /dev/null
+++ b/insane-shortest-paths.ipynb
@@ -0,0 +1,263 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "execution_state": "idle",
+ "id": "86ce5f44-94f6-43b0-a0d1-091b8134ffb6",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[set(), set(), {5, 6}, {4}, {3}, {2, 6}, {2, 5}]\n",
+ "[set(), {6}, set(), {4, 5, 6}, {3}, {3, 6}, {1, 3, 5}]\n",
+ "[set(), {4}, set(), {4, 5}, {1, 3}, {3, 6}, {5}]\n",
+ "[set(), {2, 6}, {1, 6}, {6}, set(), set(), {1, 2, 3}]\n",
+ "[set(), {3}, {3}, {1, 2, 5, 6}, {5}, {3, 4}, {3}]\n",
+ "[set(), {3, 6}, {4}, {1}, {2}, {6}, {1, 5}]\n",
+ "[set(), {2, 3}, {1, 3, 6}, {1, 2, 4}, {3}, set(), {2}]\n",
+ "[set(), {4}, set(), {4}, {1, 3, 5, 6}, {4}, {4}]\n",
+ "[set(), {3, 4, 5}, {6}, {1}, {1, 6}, {1}, {2, 4}]\n",
+ "[set(), {5, 6}, {6}, {6}, {5, 6}, {1, 4}, {1, 2, 3, 4}]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# -*- coding: utf-8 -*-\n",
+ "\"\"\"how-tsp-should-be.ipynb\n",
+ "\n",
+ "Automatically generated by Colab.\n",
+ "\n",
+ "Original file is located at\n",
+ " https://colab.research.google.com/drive/1InE1iW8ARzndPpvqH_9y22s81sOiHxPs\n",
+ "\"\"\"\n",
+ "\n",
+ "from tqdm import tqdm\n",
+ "import torch\n",
+ "import torch.nn as nn\n",
+ "import matplotlib as mpl\n",
+ "import matplotlib.pyplot as plt\n",
+ "from torch.utils.data import DataLoader, TensorDataset\n",
+ "\n",
+ "from math import sqrt\n",
+ "from collections import deque\n",
+ "import os\n",
+ "import random\n",
+ "import pickle\n",
+ "import ipdb\n",
+ "\n",
+ "# torch.manual_seed(30)\n",
+ "# random.seed(30)\n",
+ "torch.manual_seed(33)\n",
+ "random.seed(33)\n",
+ "\n",
+ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+ "# assert device.type == \"cuda\", \"CUDA is not available. Please check your GPU setup.\"\n",
+ "\n",
+ "NVTXS = 6\n",
+ "MAXDIST = NVTXS+1\n",
+ "AVGDEG = 2\n",
+ "SEQLEN = NVTXS + 1\n",
+ "HIDDENDIM = 4*NVTXS+2\n",
+ "\n",
+ "# 0: ANSFLAG\n",
+ "# 1:NVTXS+1 NBRS\n",
+ "# NVTXS+1: 2*NVTXS+1 REACH\n",
+ "# 2*NVTXS+1: 3*NVTXS+1 SELF\n",
+ "# -1 NOTANSFLAG\n",
+ "\n",
+ "START_REACH = NVTXS+1\n",
+ "START_OUT = 2*NVTXS+1\n",
+ "START_SELF = 3*NVTXS+1\n",
+ "SRC_FLAG_IDX = START_SELF\n",
+ "SOURCE = 1\n",
+ "TARGET = 2\n",
+ "ANS_FLAG_IDX = 0\n",
+ "NOTANS_FLAG_IDX = -1\n",
+ "\n",
+ "def print_everything(data):\n",
+ " print(\"NBRS\")\n",
+ " print(data[0, 1:, 1:1+NVTXS])\n",
+ " print(\"REACH\")\n",
+ " print(data[0, 1:, START_REACH:START_REACH+NVTXS])\n",
+ " print(\"ANSFLAG\")\n",
+ " print(data[0, :, 0])\n",
+ " print(\"MORE FLAGS\")\n",
+ " print(data[0, :, -1])\n",
+ " print(\"SELF\")\n",
+ " print(data[0, 1:, START_SELF:START_SELF+NVTXS])\n",
+ " print(\"OUT\")\n",
+ " print(data[0, 0, START_OUT:START_OUT+NVTXS])\n",
+ "\n",
+ "\n",
+ "def random_graph():\n",
+ " data = torch.zeros((SEQLEN, HIDDENDIM))\n",
+ "\n",
+ " for i in range(1,NVTXS+1):\n",
+ " data[i, START_SELF-1+i] = 1\n",
+ "\n",
+ " adj_list = [set() for _ in range(SEQLEN)]\n",
+ " indices = [random.randint(1, NVTXS) for _ in range(AVGDEG * NVTXS)]\n",
+ " for i in range(0, len(indices), 2):\n",
+ " u = indices[i]\n",
+ " v = indices[i + 1]\n",
+ " if u != v:\n",
+ " data[v,u] = 1\n",
+ " data[u,v] = 1\n",
+ " data[v,NVTXS+u] = 1\n",
+ " data[u,NVTXS+v] = 1\n",
+ " adj_list[u].add(v)\n",
+ " adj_list[v].add(u)\n",
+ "\n",
+ " data[0, ANS_FLAG_IDX] = 1\n",
+ " data[1:, NOTANS_FLAG_IDX] = 1\n",
+ "\n",
+ " # TODO: this is kind of a hack\n",
+ " data[0, START_REACH:START_REACH+NVTXS] = 1\n",
+ " return data, adj_list\n",
+ "\n",
+ "\"\"\"\n",
+ "input: G, represented as an adjacency list\n",
+ "output: distance from SOURCE to TARGET\n",
+ "\"\"\"\n",
+ "def SSSP(G):\n",
+ " dist = [MAXDIST for _ in G]\n",
+ " dist[SOURCE] = 0\n",
+ " frontier = deque()\n",
+ " frontier.append(SOURCE)\n",
+ " while len(frontier) > 0:\n",
+ " vtx = frontier.popleft()\n",
+ " for x in G[vtx]:\n",
+ " if dist[x] == MAXDIST:\n",
+ " dist[x] = 1 + dist[vtx]\n",
+ " frontier.append(x)\n",
+ " if x == TARGET:\n",
+ " return dist[TARGET]\n",
+ " return MAXDIST\n",
+ "\n",
+ "def mkbatch(size):\n",
+ " graphs1 = []\n",
+ " distance1 = []\n",
+ "\n",
+ " for i in range(size):\n",
+ " data, adj_list = random_graph()\n",
+ " dist = SSSP(adj_list)\n",
+ " graphs1.append(data)\n",
+ " distance1.append(dist)\n",
+ "\n",
+ " print(adj_list)\n",
+ "\n",
+ " data = torch.stack(graphs1)\n",
+ " labels = torch.tensor(distance1, dtype=torch.float16)\n",
+ " return data, labels\n",
+ "\n",
+ "\"\"\"\n",
+ "TODO: WRAP EVERYTHING in nn.Parameter(torch.zeros((1, HIDDENDIM)))\n",
+ "and then do my perturbing parameters experiment\n",
+ "\n",
+ "TODO:\n",
+ " USE activation magic to bring everything back to the 0/1 realm instead of possibly being 0/2 valued\n",
+ "\"\"\"\n",
+ "\n",
+ "class SillyTransformer(nn.Module):\n",
+ " def __init__(self):\n",
+ " super().__init__()\n",
+ " self.most_KQVs = []\n",
+ " for head in range(1,NVTXS+1):\n",
+ " Q = torch.zeros((2, HIDDENDIM))\n",
+ " Q[0, START_REACH-1+head] = 1000\n",
+ " Q[1, NOTANS_FLAG_IDX] = 1\n",
+ "\n",
+ " K = torch.zeros((2, HIDDENDIM))\n",
+ " K[0, head] = 1\n",
+ " K[1, ANS_FLAG_IDX] = 200\n",
+ "\n",
+ " V = torch.zeros((NVTXS,HIDDENDIM))\n",
+ " for i in range(NVTXS):\n",
+ " V[i, START_SELF+i] = 1\n",
+ "\n",
+ " self.most_KQVs.append((K, Q, V))\n",
+ "\n",
+ " self.weird_KQVs = []\n",
+ " for layer in range(NVTXS):\n",
+ " K = torch.zeros((3, HIDDENDIM))\n",
+ " K[0, NOTANS_FLAG_IDX] = -1000\n",
+ " K[0, SRC_FLAG_IDX] = +1100\n",
+ " K[1, NOTANS_FLAG_IDX] = -1000\n",
+ " K[1, NVTXS+TARGET] = +1100\n",
+ " K[1, ANS_FLAG_IDX] = -1100\n",
+ " K[2, ANS_FLAG_IDX] = 10\n",
+ "\n",
+ " Q = torch.zeros((3, HIDDENDIM))\n",
+ " Q[:, ANS_FLAG_IDX] = 1\n",
+ "\n",
+ " V = torch.zeros((NVTXS, HIDDENDIM))\n",
+ " V[layer, SRC_FLAG_IDX] = 1\n",
+ "\n",
+ " self.weird_KQVs.append((K, Q, V))\n",
+ "\n",
+ " def forward(self, src):\n",
+ " for layer in range(NVTXS):\n",
+ " allKQVs = [self.weird_KQVs[layer]] + self.most_KQVs\n",
+ " head_outputs = []\n",
+ " for (K, Q, V) in allKQVs:\n",
+ " ksrc = torch.matmul(src, K.unsqueeze(0).transpose(-2, -1))\n",
+ " qsrc = torch.matmul(src, Q.unsqueeze(0).transpose(-2, -1))\n",
+ " vsrc = torch.matmul(src, V.unsqueeze(0).transpose(-2, -1))\n",
+ "\n",
+ " scores = torch.matmul(qsrc, ksrc.transpose(-2, -1))\n",
+ " attention_weights = torch.softmax(scores, dim=-1)\n",
+ " head_output = torch.matmul(attention_weights, vsrc)\n",
+ " head_outputs.append(head_output)\n",
+ "\n",
+ " new_reaches = sum(head_outputs[1:])\n",
+ " BSZ = new_reaches.shape[0]\n",
+ "\n",
+ " nodelta_nbrs = torch.zeros((BSZ, SEQLEN, NVTXS+1))\n",
+ " morepadlol = torch.zeros((BSZ, SEQLEN, 1+NVTXS))\n",
+ "\n",
+ " DIFF = torch.cat((nodelta_nbrs, new_reaches, head_outputs[0], morepadlol), dim=2)\n",
+ " src += torch.cat((nodelta_nbrs, new_reaches, head_outputs[0], morepadlol), dim=2)\n",
+ " src[:, :, START_REACH:START_REACH+NVTXS] = 2*torch.sigmoid(src[:,:, START_REACH:START_REACH+NVTXS]*1000)-1\n",
+ "\n",
+ " # print(\"SRC\")\n",
+ " # print_everything(src)\n",
+ "\n",
+ " canreach = src[:,0,START_OUT:START_OUT+NVTXS]\n",
+ " # __import__('ipdb').set_trace()\n",
+ " final_output = 1+torch.sum(1-canreach,dim=1)\n",
+ " return final_output\n",
+ "\n",
+ "model = SillyTransformer()\n",
+ "model.to(device)\n",
+ "\n",
+ "data, labels = mkbatch(10)\n",
+ "assert torch.all(model(data) == labels)\n",
+ "\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/test-2d-histogram.png b/test-2d-histogram.png
new file mode 100644
index 0000000..75f2f6e
Binary files /dev/null and b/test-2d-histogram.png differ
diff --git a/training-2d-histogram.png b/training-2d-histogram.png
new file mode 100644
index 0000000..45d53a2
Binary files /dev/null and b/training-2d-histogram.png differ
diff --git a/training-loss.png b/training-loss.png
new file mode 100644
index 0000000..98a2a20
Binary files /dev/null and b/training-loss.png differ
--
cgit v1.2.3-70-g09d2