Cute math trick and some tips for proofs

I had the privilege of attending ICML 2019 with some colleagues last month, and I’ve started working through some of the papers that stood out to me. First on the docket: Combating Label Noise in Deep Learning Using Abstention. Key idea: When classifying $k$ variables, create a $k + 1$ category, codify the extra class as “choosing to abstain”, and then train your model to abstain explicitly rather than deriving abstention after the fact based on classification scores.

The trick

I was hitting some issues with the very first formula in the paper (I’m new to research if that wasn’t obvious), and I wanted to see how the authors coded it up.

The formula

L (x_{j}) = (1 - p_{k + 1}) (- \sum_{i = 1}^{k} t_{i} \log \frac{p_{i}}{1 - p_{k + 1}}) + α \log \frac{1}{1 - p_{k + 1}}

Gets translated to

loss = (1. - p_out_abstain)*h_c - \ self.alpha_var*torch.log(1. - p_out_abstain)

This seems okay (log transform on the rightmost term but whatever), but what is h_c?

h_c = F.cross_entropy(input_batch[:,0:-1],target_batch,reduce=False)

Huh? What happened to $\log \frac{p_{i}}{1 - p_{k + 1}}$ ? It took me a few tries to convince myself, but these are actually equivalent.

Even after I was convinced, I had to chew on the proof for a bit. I revisited the problem after a long weekend off, and it’s pretty slick.

\begin{aligned} \frac{p_{i}}{1 - p_{k + 1}} & = \frac{\frac{e^{v}}{\sum_{i = 1}^{k + 1} e^{v}}}{1 - p_{k + 1}} & given ouput v, softmax function \\ = \frac{e^{v}}{\sum_{i = 1}^{k + 1} e^{v}} \cdot \frac{1}{1 - p_{k + 1}} \\ = \frac{e^{v}}{\sum_{i = 1}^{k + 1} e^{v}} \cdot \frac{1}{\sum_{j = 1}^{k} p_{j}} & (1) \\ = \frac{e^{v}}{\sum_{i = 1}^{k + 1} e^{v}} \cdot \frac{1}{\sum_{j = 1}^{k} (\frac{e^{v}}{\sum_{i = 1}^{k + 1} e^{v}})} \\ = \frac{e^{v}}{\sum_{i = 1}^{k + 1} e^{v}} \cdot \frac{\sum_{i = 1}^{k + 1} e^{v}}{\sum_{j = 1}^{k} e^{v}} \\ = \frac{e^{v}}{\sum_{j = 1}^{k} e^{v}} \\ = p_{j} & by softmax definition \end{aligned}

This means that

\begin{aligned} h_c & = - \sum^{k}_i = 1 t_{i} \log \frac{p_{i}}{1 - p_{k + 1}} \\ = - \sum^{k}_i = 1 t_{i} \log p_{i} \\ = cross_entropy (v, t) & for outputs, targets v, t \end{aligned}

Pretty cool stuff, but definitely deserved a comment!

The tips

As mentioned above, it’s been a few years since I tried a proof (and to be honest, I don’t think I ever successfully did this much series - apologies to my Calc II teacher). Here are some tips I have for future me the next time I try this, maybe someone else can find them useful, too.

Demonstrate, then prove. When I originally saw h_c, I assumed it was a typo or mistake in the calculation. It wasn’t until I’d shown for myself that the trick works in practice that I could approach demonstrating why it worked
A little understanding goes a long way. I spent many hours on this little proof. If I had to give a “eureka” moment, it’d definitely be understanding what $1 - p_{k + 1}$ represents (See $(1)$ in the proof.). It wasn’t until I asked myself out lout Well, what is this? and answered with The remaining probability mass of the sum of the first k probabilities that I saw a path through the proof.
Don’t stress subscripts at first, just focus on what’s a scalar and what’s a vector. This might be a little too specific, but I found labeling the dimension of each term immensely helpful. My first work through of this proof eschewed subscripts entirely Instead, I just noted when each value represented a scalar or a vector. I haven’t used many vector-valued functions in proofs, and this explicit labeling approach helped me a lot more than pining for intuition. (Note: This technique was especially helpful, I think, because $p_{i}$ is overloaded in the paper - see my introduction of $p_{k}$ in the proof).

If you're reading this, you might like the Recurse Center! Ask me about it if you want to hear more :)