1 files changed, 381 insertions, 0 deletions
diff --git a/index.html b/index.html
new file mode 100644
index 0000000..0126a3b
--- /dev/null
+++ b/index.html
@@ -0,0 +1,381 @@
+<!DOCTYPE html>
+<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
+<head>
+  <meta charset="utf-8" />
+  <meta name="generator" content="pandoc" />
+  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
+  <title>6.7960 Project: Investigating Off-Distribution Generalization of Transformers</title>
+  <style>
+    html {
+      color: #1a1a1a;
+      background-color: #fdfdfd;
+    }
+    body {
+      margin: 0 auto;
+      max-width: 36em;
+      padding-left: 50px;
+      padding-right: 50px;
+      padding-top: 50px;
+      padding-bottom: 50px;
+      hyphens: auto;
+      overflow-wrap: break-word;
+      text-rendering: optimizeLegibility;
+      font-kerning: normal;
+    }
+    @media (max-width: 600px) {
+      body {
+        font-size: 0.9em;
+        padding: 12px;
+      }
+      h1 {
+        font-size: 1.8em;
+      }
+    }
+    @media print {
+      html {
+        background-color: white;
+      }
+      body {
+        background-color: transparent;
+        color: black;
+        font-size: 12pt;
+      }
+      p, h2, h3 {
+        orphans: 3;
+        widows: 3;
+      }
+      h2, h3, h4 {
+        page-break-after: avoid;
+      }
+    }
+    p {
+      margin: 1em 0;
+    }
+    a {
+      color: #1a1a1a;
+    }
+    a:visited {
+      color: #1a1a1a;
+    }
+    img {
+      max-width: 100%;
+    }
+    svg {
+      height: auto;
+      max-width: 100%;
+    }
+    h1, h2, h3, h4, h5, h6 {
+      margin-top: 1.4em;
+    }
+    h5, h6 {
+      font-size: 1em;
+      font-style: italic;
+    }
+    h6 {
+      font-weight: normal;
+    }
+    ol, ul {
+      padding-left: 1.7em;
+      margin-top: 1em;
+    }
+    li > ol, li > ul {
+      margin-top: 0;
+    }
+    blockquote {
+      margin: 1em 0 1em 1.7em;
+      padding-left: 1em;
+      border-left: 2px solid #e6e6e6;
+      color: #606060;
+    }
+    code {
+      font-family: Menlo, Monaco, Consolas, 'Lucida Console', monospace;
+      font-size: 85%;
+      margin: 0;
+      hyphens: manual;
+    }
+    pre {
+      margin: 1em 0;
+      overflow: auto;
+    }
+    pre code {
+      padding: 0;
+      overflow: visible;
+      overflow-wrap: normal;
+    }
+    .sourceCode {
+     background-color: transparent;
+     overflow: visible;
+    }
+    hr {
+      background-color: #1a1a1a;
+      border: none;
+      height: 1px;
+      margin: 1em 0;
+    }
+    table {
+      margin: 1em 0;
+      border-collapse: collapse;
+      width: 100%;
+      overflow-x: auto;
+      display: block;
+      font-variant-numeric: lining-nums tabular-nums;
+    }
+    table caption {
+      margin-bottom: 0.75em;
+    }
+    tbody {
+      margin-top: 0.5em;
+      border-top: 1px solid #1a1a1a;
+      border-bottom: 1px solid #1a1a1a;
+    }
+    th {
+      border-top: 1px solid #1a1a1a;
+      padding: 0.25em 0.5em 0.25em 0.5em;
+    }
+    td {
+      padding: 0.125em 0.5em 0.25em 0.5em;
+    }
+    header {
+      margin-bottom: 4em;
+      text-align: center;
+    }
+    #TOC li {
+      list-style: none;
+    }
+    #TOC ul {
+      padding-left: 1.3em;
+    }
+    #TOC > ul {
+      padding-left: 0;
+    }
+    #TOC a:not(:hover) {
+      text-decoration: none;
+    }
+    code{white-space: pre-wrap;}
+    span.smallcaps{font-variant: small-caps;}
+    div.columns{display: flex; gap: min(4vw, 1.5em);}
+    div.column{flex: auto; overflow-x: auto;}
+    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
+    /* The extra [class] is a hack that increases specificity enough to
+       override a similar rule in reveal.js */
+    ul.task-list[class]{list-style: none;}
+    ul.task-list li input[type="checkbox"] {
+      font-size: inherit;
+      width: 0.8em;
+      margin: 0 0.8em 0.2em -1.6em;
+      vertical-align: middle;
+    }
+    .display.math{display: block; text-align: center; margin: 0.5rem auto;}
+    /* CSS for citations */
+    div.csl-bib-body { }
+    div.csl-entry {
+      clear: both;
+      margin-bottom: 0em;
+    }
+    .hanging-indent div.csl-entry {
+      margin-left:2em;
+      text-indent:-2em;
+    }
+    div.csl-left-margin {
+      min-width:2em;
+      float:left;
+    }
+    div.csl-right-inline {
+      margin-left:2em;
+      padding-left:1em;
+    }
+    div.csl-indent {
+      margin-left: 2em;
+    }  </style>
+</head>
+<body>
+<header id="title-block-header">
+<h1 class="title">6.7960 Project: Investigating Off-Distribution
+Generalization of Transformers</h1>
+</header>
+<!-- Guidelines: https://www.dropbox.com/scl/fi/bet8enscln8ue36kd8t17/final_project_guidelines.pdf?rlkey=knd19cnumk51ho1y9crno56ib&e=2&dl=0 -->
+<!-- <div style="display: flex; justify-content: space-between;">
+
+<div style="flex: 1; margin: 5px; padding: 10px; border: 1px solid #ddd; text-align: center;"> -->
+<div style="text-align:center">
+<p>Anthony Wang, Alek Westover, Kevin Zhao</p>
+<p>{xy,alekw,kevinmz}@mit.edu</p>
+</div>
+<h2 id="abstract">Abstract</h2>
+<p>TODO: Probably should move this to the introduction instead of
+abstract?</p>
+<p>Recently, LLMs have been developing very fast, and with that comes
+the concern of aligning the models to output true and productive
+statements. One common approach for ensuring this is to have a human in
+the loop rewarding the model for true outputs (e.g. RLHF), but one
+drawback to this problem is that humans can be poor judges of
+truthfulness. As LLMs become more capable, there might not even exist
+experts that are good judges of whether the model’s outputs, such as
+difficult mathematical proofs, are truthful. So, we’d like to propose a
+potential solution to this issue via <strong>off-distribution
+generalization</strong> - applying human-like intuition to solve
+problems not in the dataset. To do so involves Graph Neural Networks
+(GNNs), which are important in many artificial intelligence
+applications, including the design of practical features. Paul
+Christiano <a
+href="https://www.alignmentforum.org/posts/BxersHYN2qcFoonwg/experimentally-evaluating-whether-honesty-generalizes?commentId=dsDA2BWpHPdgLvaXX">proposed
+an experiment</a> about shortest paths in a graph; our project is
+essentially to implement Christiano’s proposed experiment. To the best
+of our knowledge, although there has been research in applying machine
+learning for different variations of graph searches <span
+class="citation" data-cites="10.5555/3666122.3666260">(<a
+href="#ref-10.5555/3666122.3666260" role="doc-biblioref">Zang et al.
+2024</a>)</span>, no one has done our exact experiment yet.</p>
+<h1 id="introduction">Introduction</h1>
+<h2 id="overview">Overview</h2>
+<h2 id="related-work">Related Work</h2>
+<p>There has been some research into the algorithmic optimization of
+GNNs and how they may solve real-world issues; however, none of the
+related work targets using generic machine learning methods to solve
+graph problems.</p>
+<ul>
+<li><p>Cappart et al. has researched more into the Combinatorial
+Optimization of GNNs and developed algorithms for related tasks, thus
+facilitating machine learning <span class="citation"
+data-cites="DBLP:journals/corr/abs-2102-09544">(<a
+href="#ref-DBLP:journals/corr/abs-2102-09544"
+role="doc-biblioref">Cappart et al. 2021</a>)</span>. Their results are
+mostly algorithmic so we develop further by trading a bit of accuracy
+for much faster computation in such tasks.</p></li>
+<li><p>Tutsoy uses a graph-theory-based approach to model the
+epidemiological characteristics of infectious diseases, such as COVID-19
+<span class="citation" data-cites="10.1109/TPAMI.2023.3256421">(<a
+href="#ref-10.1109/TPAMI.2023.3256421" role="doc-biblioref">Tutsoy
+2023</a>)</span>. We understand from his paper how GNN optimization may
+also be useful in researching novel diseases.</p></li>
+</ul>
+<h2 id="sectioning">Sectioning</h2>
+<h1 id="our-machine-learning-approach">Our Machine Learning
+Approach</h1>
+<h2 id="task">Task</h2>
+<p>We will use a synthetic task to test our hypothesis that models will
+generalize truthfully off-distribution. The synthetic task is computing
+the distance between various vertices in an input graph. Our experiment
+will have three parts:</p>
+<ol type="1">
+<li>Pre-train a transformer to predict the distance between two fixed
+vertices <span class="math inline"><em>s</em>, <em>t</em></span> on
+graphs with <span class="math inline"><em>n</em> ∈ [8, 32)</span>
+vertices.</li>
+<li>Fine-tune a transformer to predict the distances between <span
+class="math inline"><em>s</em>, <em>t</em>′</span> for any <span
+class="math inline"><em>t</em>′</span> which is on the shortest path
+from <span class="math inline"><em>s</em></span> to <span
+class="math inline"><em>t</em></span>, but only do fine-tuning on graphs
+with <span class="math inline"><em>n</em> ∈ [8, 16)</span>
+vertices.</li>
+<li>Test whether the transformer can accurately predict the distances
+between <span class="math inline"><em>s</em>, <em>t</em>′</span> for any
+<span class="math inline"><em>t</em>′</span> on the shortest path from
+<span class="math inline"><em>s</em></span> to <span
+class="math inline"><em>t</em></span> for graphs with <span
+class="math inline"><em>n</em> ∈ [16, 32)</span> vertices.</li>
+</ol>
+<h2 id="data">Data</h2>
+<p>We will represent an <span class="math inline"><em>n</em></span>
+vertex, <span class="math inline"><em>m</em></span> edge unweighted,
+undirected graph as sequence of the endpoints of the <span
+class="math inline"><em>m</em></span> edges, so <span
+class="math inline">[<em>a</em><sub>1</sub>, <em>b</em><sub>1</sub>, <em>a</em><sub>2</sub>, <em>b</em><sub>2</sub>, …, <em>a</em><sub><em>m</em></sub>, <em>b</em><sub><em>m</em></sub>]</span>
+represents a graph with the edges <span
+class="math inline">{(<em>a</em><sub><em>i</em></sub>, <em>b</em><sub><em>i</em></sub>)}</span>
+for <span class="math inline">1 ≤ <em>i</em> ≤ <em>m</em></span>. We
+will pad all sequences to be the same length using the padding token
+0.</p>
+<p>The full input to our model will additionally add the target vertex
+after the padding tokens. The model is tasked with predicting the length
+of the shortest path between vertex 1 and the target vertex <span
+class="math inline"><em>t</em></span>. If no such path exists, we define
+the length to be <span class="math inline"><em>n</em> + 1</span> which
+represents infinity. For example, an input-output pair for our model
+could look like <span
+class="math inline">[1, 3, 3, 2, 0, 0, 0, 0, 2]</span> and <span
+class="math inline">2</span> respectively.</p>
+<p>We have three separate datasets.</p>
+<ul>
+<li><strong>Pre-train data</strong>: For each <span
+class="math inline"><em>n</em> ∈ [8, 32)</span>, we will generate
+several graphs on <span class="math inline"><em>n</em></span> vertices.
+We generate these graphs by inserting <span
+class="math inline">2<em>n</em></span> random edges into the graph. We
+always set the target vertex to be <span class="math inline">2</span>
+here.</li>
+<li><strong>Fine-tune data</strong>: For each <span
+class="math inline"><em>n</em> ∈ [8, 16)</span>, we will generate
+several graphs on <span class="math inline"><em>n</em></span> vertices.
+We generate these graphs by inserting <span
+class="math inline">2<em>n</em></span> random edges into the graph. We
+select the target vertex to be a random vertex on the shortest path from
+<span class="math inline">1</span> to <span
+class="math inline">2</span>.</li>
+<li><strong>Generalization testing data</strong>: The same as the
+fine-tune data, except we sample <span
+class="math inline"><em>n</em> ∈ [16, 32)</span> instead.</li>
+</ul>
+<p>As a side note, we are also curious whether the transformer learns to
+generalize to different distributions of graphs, such as denser graphs
+or graphs with different properties. Time permitting, we will also
+investigate this.</p>
+<h2 id="architecture">Architecture</h2>
+<p>We plan to use a standard transformer architecture. We will ensure
+that the number of layers in our transformer is at least the diameter of
+the graph. By doing this, we ensure that there is an extremely simple
+circuit — namely BFS — that the transformer could in theory learn to
+perform the task. Note that if the transformer actually learns a simple
+circuit to perform this task, then it seems more likely to generalize
+well. This is also our intuition for why it should be possible to fine
+tune on a small amount of data for finding shortest paths to other
+vertices besides <span class="math inline">2</span> – it seems like the
+model should be computing these other distances as intermediate values
+in its computation to find the distance to vertex <span
+class="math inline">2</span>.</p>
+<h2 id="positional-encodings">Positional Encodings</h2>
+<p>In order to facilitate performing this task with limited
+computational resources, we plan to use custom-made positional encodings
+that tell the model extra information about the structure of the
+problem, rather than the traditional sine/cosine positional encodings.
+(TODO: THIS IS OUTDATED) Specifically, our positional encodings are
+<span
+class="math inline"><em>v</em><sub>1</sub>, <em>v</em><sub>1</sub>, <em>v</em><sub>2</sub>, <em>v</em><sub>2</sub>, …, <em>v</em><sub><em>m</em></sub>, <em>v</em><sub><em>m</em></sub>, <em>v</em><sub><em>m</em> + 1</sub></span>
+where each <span
+class="math inline"><em>v</em><sub><em>i</em></sub></span> is a random
+vector so each <span
+class="math inline"><em>v</em><sub><em>i</em></sub>, <em>v</em><sub><em>j</em></sub></span>
+pair is nearly orthogonal with high probability. We will concatenate
+these with the token encodings rather than adding them. This should let
+the model easily have large attention scores between vertices
+corresponding to a single edge.</p>
+<h2 class="unnumbered" id="references">References</h2>
+<div id="refs" class="references csl-bib-body hanging-indent"
+data-entry-spacing="0" role="list">
+<div id="ref-DBLP:journals/corr/abs-2102-09544" class="csl-entry"
+role="listitem">
+Cappart, Quentin, Didier Chételat, Elias B. Khalil, Andrea Lodi,
+Christopher Morris, and Petar Velickovic. 2021. <span>“Combinatorial
+Optimization and Reasoning with Graph Neural Networks.”</span>
+<em>CoRR</em> abs/2102.09544. <a
+href="https://arxiv.org/abs/2102.09544">https://arxiv.org/abs/2102.09544</a>.
+</div>
+<div id="ref-10.1109/TPAMI.2023.3256421" class="csl-entry"
+role="listitem">
+Tutsoy, Onder. 2023. <span>“Graph Theory Based Large-Scale Machine
+Learning with Multi-Dimensional Constrained Optimization Approaches for
+Exact Epidemiological Modeling of Pandemic Diseases.”</span> <em>IEEE
+Trans. Pattern Anal. Mach. Intell.</em> 45 (8): 9836–45. <a
+href="https://doi.org/10.1109/TPAMI.2023.3256421">https://doi.org/10.1109/TPAMI.2023.3256421</a>.
+</div>
+<div id="ref-10.5555/3666122.3666260" class="csl-entry" role="listitem">
+Zang, Xiao, Miao Yin, Jinqi Xiao, Saman Zonouz, and Bo Yuan. 2024.
+<span>“GraphMP: Graph Neural Network-Based Motion Planning with
+Efficient Graph Search.”</span> In <em>Proceedings of the 37th
+International Conference on Neural Information Processing Systems</em>.
+NIPS ’23. Red Hook, NY, USA: Curran Associates Inc.
+</div>
+</div>
+</body>
+</html>