aboutsummaryrefslogtreecommitdiff

title: 6.7800 Project author: Anthony Wang

In part 1 of this project, we used Python, MCMC, and bigram frequencies to decode a substitution cipher. However, this approach is slow (I tried PyPy and it's still too slow) and makes mistakes like "qust" instead of "just" since both "j" and "q" are very rare letters. The solution?

Rust ❤️ large small language models! Hooray for buzzwords!

The basic idea is that language models are like n-gram tables but cooler. Specifically, we'd like to have a model that given a string with a missing character, can predict the probability distribution of which character should fill in that gap. For instance, if we give it "a monad is _ust a monoid in", it should predict "j". This enables us to catch and fix mistakes in a nearly correctly decoded ciphertext.

Architecture

  1. Embedding layer to 12 dimensions
  2. 12x33 convolution with 512 channels with the center stripe masked
  3. ReLU
  4. 512x512 FC layer
  5. ReLU
  6. 512x512 FC layer
  7. ReLU
  8. 512x28 FC layer
  9. Softmax

Since we'd like to use information from both sides of the gap when predicting the target distribution, this model is not autoregressive and can't be used for text generation.

Training

I implemented this architecture using tch-rs (Rust bindings to PyTorch). The tricky part is that we need to mask the character that the model is trying to predict to prevent the model from cheating. My implementation applies two convolutions for the left and right of the target character and then adds them together. I used " " as a padding token (probably a bad idea). For more details, see my code.

I trained the model on WikiText-103 for 20 epochs using an AMD Radeon 7900XTX and fine-tuned for an additional epoch on the Gutenberg dataset. (Note to self: for training on ROCm, run using LD_LIBRARY_PATH=.venv/lib/python3.13/site-packages/torch/lib:$LD_LIBRARY_PATH LIBTORCH_USE_PYTORCH=1 cargo r -r.) It achieves a final loss of 0.37 which means it puts $e^{-0.37} \approx 0.69$ probability on the correct character on average.

Model architecture diagram visualized using Netron

Decoding

For the no-breakpoint decoding, my algorithm first runs MCMC for 10000 iterations using bigram and trigram frequencies (computed from WikiText and Gutenberg) and then if the probability is high enough, it switches to MCMC using the language model. The bigram and trigram step is necessary to compute an almost-correct answer, since the language model gets confused when many of the characters are wrong. The language model computes for each character the distribution for which it should be replaced with, which I use to build a 28x28 matrix for how likely we should make each possible swap. My program then repeats this at most 50000 times using blazingly fast fearless concurrency and returns the answer with the lowest loss according to the language model.

To deal with breakpoints, my algorithm splits the input in half and decodes each half separately. If the first half was decoded correctly, it tries to decode the second half using the same permutation and uses the language model to detect when the text turns from coherent to gibberish. This narrows down the breakpoint location to within 20 characters, and then my algorithm brute-forces all those locations and picks the best one. Likewise, if the second half was decoded correctly, we repeat the same process. Sometimes it's possible for both halves to decode correctly or almost correctly if the breakpoint is near the middle.

To summarize, my approach is "repeat the algorithm an absurd number of times" + "fix typos with a CNN" + "lots of submissions to Gradescope to tune parameters".

Submitting on Gradescope

Since the Gradescope grading server doesn't have PyTorch installed, I had to bundle a copy of LibTorch with my submission. Initially I tried statically linking to LibTorch, but PyTorch doesn't distribute prebuilt static builds of LibTorch so I had to compile it myself. Note to self: DO NOT COMPILE PYTORCH! YOU WILL REGRET IT!

For static linking, use an Ubuntu 22.04 container (to avoid glibc version mismatch). Build PyTorch with USE_XNNPACK=OFF USE_MKLDNN=OFF USE_DISTRIBUTED=OFF USE_TENSORPIPE=OFF USE_CUDA=OFF BUILD_SHARED_LIBS=OFF python3 setup.py build, then copy over some random include stuff until the errors go away. Comment out dnnl_graph and XNNPACK in ~/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/torch-sys-0.20.0/build.rs. Then compile a (mostly) static binary using LIBTORCH=/work/pytorch-static/build LIBTORCH_STATIC=1 cargo b -r (musl is too much work) and upx the binary so it's not huge

Surprisingly I was able to successfully compile PyTorch but my build didn't have AVX2 support for some reason so it was stupidly slow (as in 10x slower than the official dynamically linked builds of LibTorch). Instead, I realized it was much easier to just use dynamic linking and track down all the linked libraries and submit them to Gradescope in one giant tarball. I also had to compile my code in an Ubuntu 22.04 container to avoid glibc version mismatches. Lastly, Gradescope has a 100MB upload limit so I had to splice up the tarball, which is then reconstructed by decode-cli.

Here is the code I used to generate my submission:

!include mkthingy.sh

Then decode-cli decompresses my files and sets LD_LIBRARY_PATH to use the linked libraries:

!include decode-cli

I also wrote several Python and Lean scripts for cleaning the datasets and other small tasks.