Last Edit: 2020-02-22

# Independent Study on Modern Deep Learning

Misc. Machine Learning Methodologies

This page serves as both research notes and a workspace for my independent study on Multitask Learning and other (meta) machine learning methodologies, advised by Pratik Chaudhari at the University of Pennsylvania. I'll be adding interesting code snippets/results here as well as a brief literature review of various research papers.

Disclaimer: the "Modern Deep Learning* name is meant to be a joke, since the focus on this independent study is on cool/interesting trends in deep learning research.

## ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Rajbhandari et al. 2020

This paper introduces a novel optimizer named the Zero Redundency Optimizer (ZeRO), which aims to make it feasible and efficient to train previously impossible to train model architectures whose training exhibits memory limitations. This is done by partitioning model states as opposed to standard model state replication across clusters. Memory analysis shows that the optimizer can train a one trillion parameter model on 1024 GPUs with data parallelism degree $N_d = 1024$.

Thanks, Microsoft!

## Exploring Randomly Wired Neural Networks for Image Recognition

Xie et al. (2019)

This paper explores different neural network architectures by generating random neural network wirings. This is done by defining a stochastic network generator to encapsulate Neural Architecture Search, and later using classical random graph algorithms for wiring the networks. The authors show that the generated networks have competitve performance on the ImageNet task.

### Network Generator

Network Generators define a family of possible wiring patterns. Network architectures can thereby be sampled according to a probability distribution, which is differentiably learnable.

Formally, the generator is a mapping $g: \Theta \rightarrow N$ where $\Theta$ is the parameter space and $N$ is the space of neural network architectures. As such, $g$ determines how the computational graph representing the neural network is wired. The given parameters $\theta \in \Theta$ specifies meta-information about the network such as the number of layers, activation types, etc. The output of $g$ is symbolic, so it doesn't return the weights of the networks (which can be learned from standard differentiable training processes) but instead a representation of the network (e.g. flow of data and types of operations).

The network generator $g$ can be extended to include an additional argument $s$, which acts as a stochastic seed. Then, the generator $g(s, \theta)$ can be repeatedly called to generate a pseudo-random family of architectures.

### Graphs to Neural Networks

The neural network generator generates a general graph, which is a set of nodes followed by a set of edges that connect the nodes. This general representation does not specify how the graph corresponds to a neural network, which is a later post-processing step. The non-restrictivness of the general graph allows the use of classical graph generation techniques from graph theory. In particular, the authors experiment with Erdos-Renyi (ER), Barabasi-Albert (BA), and Watts-Strogatz (WS) models of graph generation.

The generated edges are defined to be data flow (i.e. sending a tensor of data from one node to another) and that nodes define operations of either:

1. Aggregation (e.g. weighted sum)
2. Transformation (e.g. non-linearity)
3. Distribution (e.g. copying data)

### Experiments

For each generator, the authors sample 5 instances (generated by 5 random seeds), and train them from scratch. Networks are trained for roughly 100 epochs, using a half-period-cosine learning rate decay from an initial learning rate of 0.1 with a momentum of 0.9.

The authors note that every random generator yields decent accuracy. Furthermore, the variation among the random network instances is rather low with a standard deviation in the range of 0.2% to 0.4%.

## Compositional Attention Networks for Machine Reasoning

Hudson et. al, 2018

The authors design a novel fully differentiable neural network architecture that is capable of explicit and expressive reasoning. One primary goal of the paper is interpretability, without sacrificing the predictive performance of black box methods. Problems are decomposed into attention-based steps, and are solved using Memory, Attention, and Composition (MAC) sub-units. On the CLEVR dataset for visual reasoning, the model accomplishes a state-of-the-art 98.9% accuracy, using less data than competing models.

## Fine-tuning Transformer Models

Easy usage can be done through the GPT-2 Simple package by Max Woolf (https://github.com/minimaxir/gpt-2-simple).

Install using pip3 install gpt-2-simple and provide text for fine-tuning:

import gpt_2_simple as gpt2
import os
import requests

model_name = "124M"

# provide file for fine-tuning
file_name = "shakespeare.txt"

# start fine-tuning tensorflow session
sess = gpt2.start_tf_sess()
gpt2.finetune(sess, file_name, model_name=model_name, steps=1000)

# generate text
gpt2.generate(sess)

I trained a fine-tuned GPT-2 model on a corpus of Barack Obama tweets I put together.

We have a clear goal: Ending the use of force in Afghanistan as quickly
as possible. That means giving Congress more time to figure out how to
make that happen. And doing so is the single most effective way forward.

The Afghan people deserve better. They and I are foot soldiers for them.
We're going to use all our might to get that goal accomplished.
But America is not going to give ourselves up for expedience's sake.

Wow! Thanks, Obama for the big policy change!

Lately, I've also been playing with Style Transfer GANs by exploring abstract artwork from contemporary artists:

(this is based off of pieces from Marc Chagall)

(this is based off pieces from Jerret Lee)