The maximum expected F-score decoder gives you different classification quality than simply adjusting the threshold parameter in your classifier. Or rather, you don’t know what the optimal threshold value should be on test data without knowledge of the true labels. You do get higher recall by lowering your acceptance probability, but since you sacrifice precision along with gains in recall, you generally don’t know how low you have to go in order to maximize F-score. You could calibrate that threshold on held-out data for which you have labels, but it won’t carry over to unseen data whose class distribution is very different from your held-out data. The decoder on the other hand will consider all relevant ways of labeling unseen data and optimize the expected F-score (under a fixed model in my naive implementation) on those data. I’ll post some of the slides that go with the talk for my ACL paper (which I never gave), which show results on standard UCI datasets.

Finally, you’re absolutely right that “logistic regression” is a red herring (my words). The general optimization techniques work (at a minimum) for any probabilistic classification model that can be trained using gradient-based maximum likelihood. This, too, is much clearer in my 2007 paper, which drops any reference to “logistic regression” other than as a concrete example.

]]>