index.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>6.7960 Project: Investigating Off-Distribution Generalization of Transformers</title>
  <style>
    html {
      color: #1a1a1a;
      background-color: #fdfdfd;
    }
    body {
      margin: 0 auto;
      max-width: 36em;
      padding-left: 50px;
      padding-right: 50px;
      padding-top: 50px;
      padding-bottom: 50px;
      hyphens: auto;
      overflow-wrap: break-word;
      text-rendering: optimizeLegibility;
      font-kerning: normal;
    }
    @media (max-width: 600px) {
      body {
        font-size: 0.9em;
        padding: 12px;
      }
      h1 {
        font-size: 1.8em;
      }
    }
    @media print {
      html {
        background-color: white;
      }
      body {
        background-color: transparent;
        color: black;
        font-size: 12pt;
      }
      p, h2, h3 {
        orphans: 3;
        widows: 3;
      }
      h2, h3, h4 {
        page-break-after: avoid;
      }
    }
    p {
      margin: 1em 0;
    }
    a {
      color: #1a1a1a;
    }
    a:visited {
      color: #1a1a1a;
    }
    img {
      max-width: 100%;
    }
    svg {
      height: auto;
      max-width: 100%;
    }
    h1, h2, h3, h4, h5, h6 {
      margin-top: 1.4em;
    }
    h5, h6 {
      font-size: 1em;
      font-style: italic;
    }
    h6 {
      font-weight: normal;
    }
    ol, ul {
      padding-left: 1.7em;
      margin-top: 1em;
    }
    li > ol, li > ul {
      margin-top: 0;
    }
    blockquote {
      margin: 1em 0 1em 1.7em;
      padding-left: 1em;
      border-left: 2px solid #e6e6e6;
      color: #606060;
    }
    code {
      font-family: Menlo, Monaco, Consolas, 'Lucida Console', monospace;
      font-size: 85%;
      margin: 0;
      hyphens: manual;
    }
    pre {
      margin: 1em 0;
      overflow: auto;
    }
    pre code {
      padding: 0;
      overflow: visible;
      overflow-wrap: normal;
    }
    .sourceCode {
     background-color: transparent;
     overflow: visible;
    }
    hr {
      background-color: #1a1a1a;
      border: none;
      height: 1px;
      margin: 1em 0;
    }
    table {
      margin: 1em 0;
      border-collapse: collapse;
      width: 100%;
      overflow-x: auto;
      display: block;
      font-variant-numeric: lining-nums tabular-nums;
    }
    table caption {
      margin-bottom: 0.75em;
    }
    tbody {
      margin-top: 0.5em;
      border-top: 1px solid #1a1a1a;
      border-bottom: 1px solid #1a1a1a;
    }
    th {
      border-top: 1px solid #1a1a1a;
      padding: 0.25em 0.5em 0.25em 0.5em;
    }
    td {
      padding: 0.125em 0.5em 0.25em 0.5em;
    }
    header {
      margin-bottom: 4em;
      text-align: center;
    }
    #TOC li {
      list-style: none;
    }
    #TOC ul {
      padding-left: 1.3em;
    }
    #TOC > ul {
      padding-left: 0;
    }
    #TOC a:not(:hover) {
      text-decoration: none;
    }
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    div.columns{display: flex; gap: min(4vw, 1.5em);}
    div.column{flex: auto; overflow-x: auto;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    /* The extra [class] is a hack that increases specificity enough to
       override a similar rule in reveal.js */
    ul.task-list[class]{list-style: none;}
    ul.task-list li input[type="checkbox"] {
      font-size: inherit;
      width: 0.8em;
      margin: 0 0.8em 0.2em -1.6em;
      vertical-align: middle;
    }
    .display.math{display: block; text-align: center; margin: 0.5rem auto;}
    /* CSS for citations */
    div.csl-bib-body { }
    div.csl-entry {
      clear: both;
      margin-bottom: 0em;
    }
    .hanging-indent div.csl-entry {
      margin-left:2em;
      text-indent:-2em;
    }
    div.csl-left-margin {
      min-width:2em;
      float:left;
    }
    div.csl-right-inline {
      margin-left:2em;
      padding-left:1em;
    }
    div.csl-indent {
      margin-left: 2em;
    }  </style>
</head>
<body>
<header id="title-block-header">
<h1 class="title">6.7960 Project: Investigating Off-Distribution
Generalization of Transformers</h1>
</header>
<!-- Guidelines: https://www.dropbox.com/scl/fi/bet8enscln8ue36kd8t17/final_project_guidelines.pdf?rlkey=knd19cnumk51ho1y9crno56ib&e=2&dl=0 -->
<!-- <div style="display: flex; justify-content: space-between;">

<div style="flex: 1; margin: 5px; padding: 10px; border: 1px solid #ddd; text-align: center;"> -->
<div style="text-align:center">
<p>Anthony Wang, Alek Westover, Kevin Zhao</p>
<p>{xy,alekw,kevinmz}@mit.edu</p>
</div>
<h2 id="abstract">Abstract</h2>
<p>TODO: Probably should move this to the introduction instead of
abstract?</p>
<p>Recently, LLMs have been developing very fast, and with that comes
the concern of aligning the models to output true and productive
statements. One common approach for ensuring this is to have a human in
the loop rewarding the model for true outputs (e.g. RLHF), but one
drawback to this problem is that humans can be poor judges of
truthfulness. As LLMs become more capable, there might not even exist
experts that are good judges of whether the model’s outputs, such as
difficult mathematical proofs, are truthful. So, we’d like to propose a
potential solution to this issue via <strong>off-distribution
generalization</strong> - applying human-like intuition to solve
problems not in the dataset. To do so involves Graph Neural Networks
(GNNs), which are important in many artificial intelligence
applications, including the design of practical features. Paul
Christiano <a
href="https://www.alignmentforum.org/posts/BxersHYN2qcFoonwg/experimentally-evaluating-whether-honesty-generalizes?commentId=dsDA2BWpHPdgLvaXX">proposed
an experiment</a> about shortest paths in a graph; our project is
essentially to implement Christiano’s proposed experiment. To the best
of our knowledge, although there has been research in applying machine
learning for different variations of graph searches <span
class="citation" data-cites="10.5555/3666122.3666260">(<a
href="#ref-10.5555/3666122.3666260" role="doc-biblioref">Zang et al.
2024</a>)</span>, no one has done our exact experiment yet.</p>
<h1 id="introduction">Introduction</h1>
<h2 id="overview">Overview</h2>
<h2 id="related-work">Related Work</h2>
<p>There has been some research into the algorithmic optimization of
GNNs and how they may solve real-world issues; however, none of the
related work targets using generic machine learning methods to solve
graph problems.</p>
<ul>
<li><p>Cappart et al. has researched more into the Combinatorial
Optimization of GNNs and developed algorithms for related tasks, thus
facilitating machine learning <span class="citation"
data-cites="DBLP:journals/corr/abs-2102-09544">(<a
href="#ref-DBLP:journals/corr/abs-2102-09544"
role="doc-biblioref">Cappart et al. 2021</a>)</span>. Their results are
mostly algorithmic so we develop further by trading a bit of accuracy
for much faster computation in such tasks.</p></li>
<li><p>Tutsoy uses a graph-theory-based approach to model the
epidemiological characteristics of infectious diseases, such as COVID-19
<span class="citation" data-cites="10.1109/TPAMI.2023.3256421">(<a
href="#ref-10.1109/TPAMI.2023.3256421" role="doc-biblioref">Tutsoy
2023</a>)</span>. We understand from his paper how GNN optimization may
also be useful in researching novel diseases.</p></li>
</ul>
<h2 id="sectioning">Sectioning</h2>
<h1 id="our-machine-learning-approach">Our Machine Learning
Approach</h1>
<h2 id="task">Task</h2>
<p>We will use a synthetic task to test our hypothesis that models will
generalize truthfully off-distribution. The synthetic task is computing
the distance between various vertices in an input graph. Our experiment
will have three parts:</p>
<ol type="1">
<li>Pre-train a transformer to predict the distance between two fixed
vertices <span class="math inline"><em>s</em>, <em>t</em></span> on
graphs with <span class="math inline"><em>n</em> ∈ [8, 32)</span>
vertices.</li>
<li>Fine-tune a transformer to predict the distances between <span
class="math inline"><em>s</em>, <em>t</em>′</span> for any <span
class="math inline"><em>t</em>′</span> which is on the shortest path
from <span class="math inline"><em>s</em></span> to <span
class="math inline"><em>t</em></span>, but only do fine-tuning on graphs
with <span class="math inline"><em>n</em> ∈ [8, 16)</span>
vertices.</li>
<li>Test whether the transformer can accurately predict the distances
between <span class="math inline"><em>s</em>, <em>t</em>′</span> for any
<span class="math inline"><em>t</em>′</span> on the shortest path from
<span class="math inline"><em>s</em></span> to <span
class="math inline"><em>t</em></span> for graphs with <span
class="math inline"><em>n</em> ∈ [16, 32)</span> vertices.</li>
</ol>
<h2 id="data">Data</h2>
<p>We will represent an <span class="math inline"><em>n</em></span>
vertex, <span class="math inline"><em>m</em></span> edge unweighted,
undirected graph as sequence of the endpoints of the <span
class="math inline"><em>m</em></span> edges, so <span
class="math inline">[<em>a</em><sub>1</sub>, <em>b</em><sub>1</sub>, <em>a</em><sub>2</sub>, <em>b</em><sub>2</sub>, …, <em>a</em><sub><em>m</em></sub>, <em>b</em><sub><em>m</em></sub>]</span>
represents a graph with the edges <span
class="math inline">{(<em>a</em><sub><em>i</em></sub>, <em>b</em><sub><em>i</em></sub>)}</span>
for <span class="math inline">1 ≤ <em>i</em> ≤ <em>m</em></span>. We
will pad all sequences to be the same length using the padding token
0.</p>
<p>The full input to our model will additionally add the target vertex
after the padding tokens. The model is tasked with predicting the length
of the shortest path between vertex 1 and the target vertex <span
class="math inline"><em>t</em></span>. If no such path exists, we define
the length to be <span class="math inline"><em>n</em> + 1</span> which
represents infinity. For example, an input-output pair for our model
could look like <span
class="math inline">[1, 3, 3, 2, 0, 0, 0, 0, 2]</span> and <span
class="math inline">2</span> respectively.</p>
<p>We have three separate datasets.</p>
<ul>
<li><strong>Pre-train data</strong>: For each <span
class="math inline"><em>n</em> ∈ [8, 32)</span>, we will generate
several graphs on <span class="math inline"><em>n</em></span> vertices.
We generate these graphs by inserting <span
class="math inline">2<em>n</em></span> random edges into the graph. We
always set the target vertex to be <span class="math inline">2</span>
here.</li>
<li><strong>Fine-tune data</strong>: For each <span
class="math inline"><em>n</em> ∈ [8, 16)</span>, we will generate
several graphs on <span class="math inline"><em>n</em></span> vertices.
We generate these graphs by inserting <span
class="math inline">2<em>n</em></span> random edges into the graph. We
select the target vertex to be a random vertex on the shortest path from
<span class="math inline">1</span> to <span
class="math inline">2</span>.</li>
<li><strong>Generalization testing data</strong>: The same as the
fine-tune data, except we sample <span
class="math inline"><em>n</em> ∈ [16, 32)</span> instead.</li>
</ul>
<p>As a side note, we are also curious whether the transformer learns to
generalize to different distributions of graphs, such as denser graphs
or graphs with different properties. Time permitting, we will also
investigate this.</p>
<h2 id="architecture">Architecture</h2>
<p>We plan to use a standard transformer architecture. We will ensure
that the number of layers in our transformer is at least the diameter of
the graph. By doing this, we ensure that there is an extremely simple
circuit — namely BFS — that the transformer could in theory learn to
perform the task. Note that if the transformer actually learns a simple
circuit to perform this task, then it seems more likely to generalize
well. This is also our intuition for why it should be possible to fine
tune on a small amount of data for finding shortest paths to other
vertices besides <span class="math inline">2</span> – it seems like the
model should be computing these other distances as intermediate values
in its computation to find the distance to vertex <span
class="math inline">2</span>.</p>
<h2 id="positional-encodings">Positional Encodings</h2>
<p>In order to facilitate performing this task with limited
computational resources, we plan to use custom-made positional encodings
that tell the model extra information about the structure of the
problem, rather than the traditional sine/cosine positional encodings.
(TODO: THIS IS OUTDATED) Specifically, our positional encodings are
<span
class="math inline"><em>v</em><sub>1</sub>, <em>v</em><sub>1</sub>, <em>v</em><sub>2</sub>, <em>v</em><sub>2</sub>, …, <em>v</em><sub><em>m</em></sub>, <em>v</em><sub><em>m</em></sub>, <em>v</em><sub><em>m</em> + 1</sub></span>
where each <span
class="math inline"><em>v</em><sub><em>i</em></sub></span> is a random
vector so each <span
class="math inline"><em>v</em><sub><em>i</em></sub>, <em>v</em><sub><em>j</em></sub></span>
pair is nearly orthogonal with high probability. We will concatenate
these with the token encodings rather than adding them. This should let
the model easily have large attention scores between vertices
corresponding to a single edge.</p>
<h2 class="unnumbered" id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent"
data-entry-spacing="0" role="list">
<div id="ref-DBLP:journals/corr/abs-2102-09544" class="csl-entry"
role="listitem">
Cappart, Quentin, Didier Chételat, Elias B. Khalil, Andrea Lodi,
Christopher Morris, and Petar Velickovic. 2021. <span>“Combinatorial
Optimization and Reasoning with Graph Neural Networks.”</span>
<em>CoRR</em> abs/2102.09544. <a
href="https://arxiv.org/abs/2102.09544">https://arxiv.org/abs/2102.09544</a>.
</div>
<div id="ref-10.1109/TPAMI.2023.3256421" class="csl-entry"
role="listitem">
Tutsoy, Onder. 2023. <span>“Graph Theory Based Large-Scale Machine
Learning with Multi-Dimensional Constrained Optimization Approaches for
Exact Epidemiological Modeling of Pandemic Diseases.”</span> <em>IEEE
Trans. Pattern Anal. Mach. Intell.</em> 45 (8): 9836–45. <a
href="https://doi.org/10.1109/TPAMI.2023.3256421">https://doi.org/10.1109/TPAMI.2023.3256421</a>.
</div>
<div id="ref-10.5555/3666122.3666260" class="csl-entry" role="listitem">
Zang, Xiao, Miao Yin, Jinqi Xiao, Saman Zonouz, and Bo Yuan. 2024.
<span>“GraphMP: Graph Neural Network-Based Motion Planning with
Efficient Graph Search.”</span> In <em>Proceedings of the 37th
International Conference on Neural Information Processing Systems</em>.
NIPS ’23. Red Hook, NY, USA: Curran Associates Inc.
</div>
</div>
</body>
</html>