Log Odds Ratio Informative Dirichlet Prior
The “log odds ratio informative dirichlet prior” of method of Monroe, et al. (2008) is proposed to find words that are statistically overrepresented in a particular category of documents compared to another.
This method modifies the commonly used log-odds ratio in two ways: it uses teh z-scores of the log-odds-ratio, which controls for the amount of variance in a word’s frequencey , and it uses counts from a background corpus to provide a prior count for words, essentially shrinking the counts toward to the prior frequency in a large backgroud corpus. These features enable differences even in a very frequent words to be detected; previous linguistic methods used to discover word associateion (mutual information (Church and Hnks, 1990), log likelihood ratio (Dunning, 1993), t-test (Manning and Schutze, 1999), and chi-squre (Yang and Pederson, 1997)) have all bad problems with frequent words. Because function words like pronouns and auxiliary verbs are both extremely frequent and have been shown to be important cues to social and narrative meaning, this is a major limitation of these methods, and one of the reasons the Monrose, et al. (2008) method was chosen.