diff options
Diffstat (limited to 'index.html')
-rw-r--r-- | index.html | 381 |
1 files changed, 381 insertions, 0 deletions
diff --git a/index.html b/index.html new file mode 100644 index 0000000..0126a3b --- /dev/null +++ b/index.html @@ -0,0 +1,381 @@ +<!DOCTYPE html> +<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang=""> +<head> + <meta charset="utf-8" /> + <meta name="generator" content="pandoc" /> + <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" /> + <title>6.7960 Project: Investigating Off-Distribution Generalization of Transformers</title> + <style> + html { + color: #1a1a1a; + background-color: #fdfdfd; + } + body { + margin: 0 auto; + max-width: 36em; + padding-left: 50px; + padding-right: 50px; + padding-top: 50px; + padding-bottom: 50px; + hyphens: auto; + overflow-wrap: break-word; + text-rendering: optimizeLegibility; + font-kerning: normal; + } + @media (max-width: 600px) { + body { + font-size: 0.9em; + padding: 12px; + } + h1 { + font-size: 1.8em; + } + } + @media print { + html { + background-color: white; + } + body { + background-color: transparent; + color: black; + font-size: 12pt; + } + p, h2, h3 { + orphans: 3; + widows: 3; + } + h2, h3, h4 { + page-break-after: avoid; + } + } + p { + margin: 1em 0; + } + a { + color: #1a1a1a; + } + a:visited { + color: #1a1a1a; + } + img { + max-width: 100%; + } + svg { + height: auto; + max-width: 100%; + } + h1, h2, h3, h4, h5, h6 { + margin-top: 1.4em; + } + h5, h6 { + font-size: 1em; + font-style: italic; + } + h6 { + font-weight: normal; + } + ol, ul { + padding-left: 1.7em; + margin-top: 1em; + } + li > ol, li > ul { + margin-top: 0; + } + blockquote { + margin: 1em 0 1em 1.7em; + padding-left: 1em; + border-left: 2px solid #e6e6e6; + color: #606060; + } + code { + font-family: Menlo, Monaco, Consolas, 'Lucida Console', monospace; + font-size: 85%; + margin: 0; + hyphens: manual; + } + pre { + margin: 1em 0; + overflow: auto; + } + pre code { + padding: 0; + overflow: visible; + overflow-wrap: normal; + } + .sourceCode { + background-color: transparent; + overflow: visible; + } + hr { + background-color: #1a1a1a; + border: none; + height: 1px; + margin: 1em 0; + } + table { + margin: 1em 0; + border-collapse: collapse; + width: 100%; + overflow-x: auto; + display: block; + font-variant-numeric: lining-nums tabular-nums; + } + table caption { + margin-bottom: 0.75em; + } + tbody { + margin-top: 0.5em; + border-top: 1px solid #1a1a1a; + border-bottom: 1px solid #1a1a1a; + } + th { + border-top: 1px solid #1a1a1a; + padding: 0.25em 0.5em 0.25em 0.5em; + } + td { + padding: 0.125em 0.5em 0.25em 0.5em; + } + header { + margin-bottom: 4em; + text-align: center; + } + #TOC li { + list-style: none; + } + #TOC ul { + padding-left: 1.3em; + } + #TOC > ul { + padding-left: 0; + } + #TOC a:not(:hover) { + text-decoration: none; + } + code{white-space: pre-wrap;} + span.smallcaps{font-variant: small-caps;} + div.columns{display: flex; gap: min(4vw, 1.5em);} + div.column{flex: auto; overflow-x: auto;} + div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} + /* The extra [class] is a hack that increases specificity enough to + override a similar rule in reveal.js */ + ul.task-list[class]{list-style: none;} + ul.task-list li input[type="checkbox"] { + font-size: inherit; + width: 0.8em; + margin: 0 0.8em 0.2em -1.6em; + vertical-align: middle; + } + .display.math{display: block; text-align: center; margin: 0.5rem auto;} + /* CSS for citations */ + div.csl-bib-body { } + div.csl-entry { + clear: both; + margin-bottom: 0em; + } + .hanging-indent div.csl-entry { + margin-left:2em; + text-indent:-2em; + } + div.csl-left-margin { + min-width:2em; + float:left; + } + div.csl-right-inline { + margin-left:2em; + padding-left:1em; + } + div.csl-indent { + margin-left: 2em; + } </style> +</head> +<body> +<header id="title-block-header"> +<h1 class="title">6.7960 Project: Investigating Off-Distribution +Generalization of Transformers</h1> +</header> +<!-- Guidelines: https://www.dropbox.com/scl/fi/bet8enscln8ue36kd8t17/final_project_guidelines.pdf?rlkey=knd19cnumk51ho1y9crno56ib&e=2&dl=0 --> +<!-- <div style="display: flex; justify-content: space-between;"> + +<div style="flex: 1; margin: 5px; padding: 10px; border: 1px solid #ddd; text-align: center;"> --> +<div style="text-align:center"> +<p>Anthony Wang, Alek Westover, Kevin Zhao</p> +<p>{xy,alekw,kevinmz}@mit.edu</p> +</div> +<h2 id="abstract">Abstract</h2> +<p>TODO: Probably should move this to the introduction instead of +abstract?</p> +<p>Recently, LLMs have been developing very fast, and with that comes +the concern of aligning the models to output true and productive +statements. One common approach for ensuring this is to have a human in +the loop rewarding the model for true outputs (e.g. RLHF), but one +drawback to this problem is that humans can be poor judges of +truthfulness. As LLMs become more capable, there might not even exist +experts that are good judges of whether the model’s outputs, such as +difficult mathematical proofs, are truthful. So, we’d like to propose a +potential solution to this issue via <strong>off-distribution +generalization</strong> - applying human-like intuition to solve +problems not in the dataset. To do so involves Graph Neural Networks +(GNNs), which are important in many artificial intelligence +applications, including the design of practical features. Paul +Christiano <a +href="https://www.alignmentforum.org/posts/BxersHYN2qcFoonwg/experimentally-evaluating-whether-honesty-generalizes?commentId=dsDA2BWpHPdgLvaXX">proposed +an experiment</a> about shortest paths in a graph; our project is +essentially to implement Christiano’s proposed experiment. To the best +of our knowledge, although there has been research in applying machine +learning for different variations of graph searches <span +class="citation" data-cites="10.5555/3666122.3666260">(<a +href="#ref-10.5555/3666122.3666260" role="doc-biblioref">Zang et al. +2024</a>)</span>, no one has done our exact experiment yet.</p> +<h1 id="introduction">Introduction</h1> +<h2 id="overview">Overview</h2> +<h2 id="related-work">Related Work</h2> +<p>There has been some research into the algorithmic optimization of +GNNs and how they may solve real-world issues; however, none of the +related work targets using generic machine learning methods to solve +graph problems.</p> +<ul> +<li><p>Cappart et al. has researched more into the Combinatorial +Optimization of GNNs and developed algorithms for related tasks, thus +facilitating machine learning <span class="citation" +data-cites="DBLP:journals/corr/abs-2102-09544">(<a +href="#ref-DBLP:journals/corr/abs-2102-09544" +role="doc-biblioref">Cappart et al. 2021</a>)</span>. Their results are +mostly algorithmic so we develop further by trading a bit of accuracy +for much faster computation in such tasks.</p></li> +<li><p>Tutsoy uses a graph-theory-based approach to model the +epidemiological characteristics of infectious diseases, such as COVID-19 +<span class="citation" data-cites="10.1109/TPAMI.2023.3256421">(<a +href="#ref-10.1109/TPAMI.2023.3256421" role="doc-biblioref">Tutsoy +2023</a>)</span>. We understand from his paper how GNN optimization may +also be useful in researching novel diseases.</p></li> +</ul> +<h2 id="sectioning">Sectioning</h2> +<h1 id="our-machine-learning-approach">Our Machine Learning +Approach</h1> +<h2 id="task">Task</h2> +<p>We will use a synthetic task to test our hypothesis that models will +generalize truthfully off-distribution. The synthetic task is computing +the distance between various vertices in an input graph. Our experiment +will have three parts:</p> +<ol type="1"> +<li>Pre-train a transformer to predict the distance between two fixed +vertices <span class="math inline"><em>s</em>, <em>t</em></span> on +graphs with <span class="math inline"><em>n</em> ∈ [8, 32)</span> +vertices.</li> +<li>Fine-tune a transformer to predict the distances between <span +class="math inline"><em>s</em>, <em>t</em>′</span> for any <span +class="math inline"><em>t</em>′</span> which is on the shortest path +from <span class="math inline"><em>s</em></span> to <span +class="math inline"><em>t</em></span>, but only do fine-tuning on graphs +with <span class="math inline"><em>n</em> ∈ [8, 16)</span> +vertices.</li> +<li>Test whether the transformer can accurately predict the distances +between <span class="math inline"><em>s</em>, <em>t</em>′</span> for any +<span class="math inline"><em>t</em>′</span> on the shortest path from +<span class="math inline"><em>s</em></span> to <span +class="math inline"><em>t</em></span> for graphs with <span +class="math inline"><em>n</em> ∈ [16, 32)</span> vertices.</li> +</ol> +<h2 id="data">Data</h2> +<p>We will represent an <span class="math inline"><em>n</em></span> +vertex, <span class="math inline"><em>m</em></span> edge unweighted, +undirected graph as sequence of the endpoints of the <span +class="math inline"><em>m</em></span> edges, so <span +class="math inline">[<em>a</em><sub>1</sub>, <em>b</em><sub>1</sub>, <em>a</em><sub>2</sub>, <em>b</em><sub>2</sub>, …, <em>a</em><sub><em>m</em></sub>, <em>b</em><sub><em>m</em></sub>]</span> +represents a graph with the edges <span +class="math inline">{(<em>a</em><sub><em>i</em></sub>, <em>b</em><sub><em>i</em></sub>)}</span> +for <span class="math inline">1 ≤ <em>i</em> ≤ <em>m</em></span>. We +will pad all sequences to be the same length using the padding token +0.</p> +<p>The full input to our model will additionally add the target vertex +after the padding tokens. The model is tasked with predicting the length +of the shortest path between vertex 1 and the target vertex <span +class="math inline"><em>t</em></span>. If no such path exists, we define +the length to be <span class="math inline"><em>n</em> + 1</span> which +represents infinity. For example, an input-output pair for our model +could look like <span +class="math inline">[1, 3, 3, 2, 0, 0, 0, 0, 2]</span> and <span +class="math inline">2</span> respectively.</p> +<p>We have three separate datasets.</p> +<ul> +<li><strong>Pre-train data</strong>: For each <span +class="math inline"><em>n</em> ∈ [8, 32)</span>, we will generate +several graphs on <span class="math inline"><em>n</em></span> vertices. +We generate these graphs by inserting <span +class="math inline">2<em>n</em></span> random edges into the graph. We +always set the target vertex to be <span class="math inline">2</span> +here.</li> +<li><strong>Fine-tune data</strong>: For each <span +class="math inline"><em>n</em> ∈ [8, 16)</span>, we will generate +several graphs on <span class="math inline"><em>n</em></span> vertices. +We generate these graphs by inserting <span +class="math inline">2<em>n</em></span> random edges into the graph. We +select the target vertex to be a random vertex on the shortest path from +<span class="math inline">1</span> to <span +class="math inline">2</span>.</li> +<li><strong>Generalization testing data</strong>: The same as the +fine-tune data, except we sample <span +class="math inline"><em>n</em> ∈ [16, 32)</span> instead.</li> +</ul> +<p>As a side note, we are also curious whether the transformer learns to +generalize to different distributions of graphs, such as denser graphs +or graphs with different properties. Time permitting, we will also +investigate this.</p> +<h2 id="architecture">Architecture</h2> +<p>We plan to use a standard transformer architecture. We will ensure +that the number of layers in our transformer is at least the diameter of +the graph. By doing this, we ensure that there is an extremely simple +circuit — namely BFS — that the transformer could in theory learn to +perform the task. Note that if the transformer actually learns a simple +circuit to perform this task, then it seems more likely to generalize +well. This is also our intuition for why it should be possible to fine +tune on a small amount of data for finding shortest paths to other +vertices besides <span class="math inline">2</span> – it seems like the +model should be computing these other distances as intermediate values +in its computation to find the distance to vertex <span +class="math inline">2</span>.</p> +<h2 id="positional-encodings">Positional Encodings</h2> +<p>In order to facilitate performing this task with limited +computational resources, we plan to use custom-made positional encodings +that tell the model extra information about the structure of the +problem, rather than the traditional sine/cosine positional encodings. +(TODO: THIS IS OUTDATED) Specifically, our positional encodings are +<span +class="math inline"><em>v</em><sub>1</sub>, <em>v</em><sub>1</sub>, <em>v</em><sub>2</sub>, <em>v</em><sub>2</sub>, …, <em>v</em><sub><em>m</em></sub>, <em>v</em><sub><em>m</em></sub>, <em>v</em><sub><em>m</em> + 1</sub></span> +where each <span +class="math inline"><em>v</em><sub><em>i</em></sub></span> is a random +vector so each <span +class="math inline"><em>v</em><sub><em>i</em></sub>, <em>v</em><sub><em>j</em></sub></span> +pair is nearly orthogonal with high probability. We will concatenate +these with the token encodings rather than adding them. This should let +the model easily have large attention scores between vertices +corresponding to a single edge.</p> +<h2 class="unnumbered" id="references">References</h2> +<div id="refs" class="references csl-bib-body hanging-indent" +data-entry-spacing="0" role="list"> +<div id="ref-DBLP:journals/corr/abs-2102-09544" class="csl-entry" +role="listitem"> +Cappart, Quentin, Didier Chételat, Elias B. Khalil, Andrea Lodi, +Christopher Morris, and Petar Velickovic. 2021. <span>“Combinatorial +Optimization and Reasoning with Graph Neural Networks.”</span> +<em>CoRR</em> abs/2102.09544. <a +href="https://arxiv.org/abs/2102.09544">https://arxiv.org/abs/2102.09544</a>. +</div> +<div id="ref-10.1109/TPAMI.2023.3256421" class="csl-entry" +role="listitem"> +Tutsoy, Onder. 2023. <span>“Graph Theory Based Large-Scale Machine +Learning with Multi-Dimensional Constrained Optimization Approaches for +Exact Epidemiological Modeling of Pandemic Diseases.”</span> <em>IEEE +Trans. Pattern Anal. Mach. Intell.</em> 45 (8): 9836–45. <a +href="https://doi.org/10.1109/TPAMI.2023.3256421">https://doi.org/10.1109/TPAMI.2023.3256421</a>. +</div> +<div id="ref-10.5555/3666122.3666260" class="csl-entry" role="listitem"> +Zang, Xiao, Miao Yin, Jinqi Xiao, Saman Zonouz, and Bo Yuan. 2024. +<span>“GraphMP: Graph Neural Network-Based Motion Planning with +Efficient Graph Search.”</span> In <em>Proceedings of the 37th +International Conference on Neural Information Processing Systems</em>. +NIPS ’23. Red Hook, NY, USA: Curran Associates Inc. +</div> +</div> +</body> +</html> |