Structured Learning

I've now added a short comment to each paper. This list is created semi-automatically from BibDesk with a custom HTML export template and some minor post-editing. The red titles are my special picks.

Frustratingly Easy Domain Adaptation
H. Daume III
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 256--263 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1033
Assumes both source and target labeled data. Instance features are replicated as "feature from source" and "fature from target". Results are surprisingly good for such a simple method. Why? It is easy to create a counterexample in which this does not work, so it would be important to characterize precisely when it works.
A Bayesian Model for Discovering Typological Implications
H. Daume III and L. Campbell
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 65--72 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1009
Induces relationships between typological features of languages from very sparse descriptive data. Finds relationships discussed in the comparative literature as well as some others that deserve investigation.
Sparse Information Extraction: Unsupervised Language Models to the Rescue
D. Downey, S. Schoenmackers, and O. Etzioni
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 696--703 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1088
The main problem with previous work on unsupervised extraction based on finding many instances of a putative entity or relationship is that it has low recall. To address this, this paper creates HMM models from the contexts of common extractions and uses them to measure the plausibility of rare candidate extractions. Simple idea with good results.
A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing
J. Gao, G. Andrew, M. Johnson, and K. Toutanova
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 824--831 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1104
Bottom line: L_1 regularization of logistic regression does not hurt generalization, and makes the models much smaller. Nice to have a careful study that documents the benefits and limitations of L_1 regularization in a range of common text classification tasks.
Unsupervised Coreference Resolution in a Nonparametric Bayesian Model
A. Haghighi and D. Klein
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 848--855 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1107
A very nice result beautifully presented. The "magic" of Dirichlet processes yields an unsupervised generative model of coreference that competes with supervised methods and can naturally incorporate a discourse model.
K-best Spanning Tree Parsing
K. Hall
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 392--399 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1050
Digging into old directed spanning tree literature continues to bear fruit for dependency parsing, this time a k-best algorithm that can be used for reranking with global features. I have my reservations about reranking, but this is a good addition to the dependency parsing toolbox.
Forest Rescoring: Faster Decoding with Integrated Language Models
L. Huang and D. Chiang
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 144--151 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1019
I like this aproach much better than reranking: evaluate global features as soon as possible and add their score with the local feature score in a dynamic programming parser or decoder, to produce efficiently an approximate set of k-best partial hypotheses. No such method for spanning tree dependency parsers, though... Liang gave a very clear talk.
Exploiting Wikipedia as External Knowledge for Named Entity Recognition
J. Kazama and K. Torisawa
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 698--707 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1073
The basic idea is simple and effective. For each entity described in Wikipedia, find a defining sentence, extract from it heuristically a noun that is likely to be the entity's category, and add that as a "label" feature to other features in a CRF extractor. The details are a bit complicated, but the accuracy improvements make it very worthwhile.
Structured Prediction Models via the Matrix-Tree Theorem
T. Koo, A. Globerson, X. Carreras, and M. Collins
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 141--150 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1015
Several groups discovered concurrently that Tutte's matrix-tree theorem would yield an efficient computation of the normalization for log-linear models of non-projective dependencies. There were three papers on different aspects of this in Prague, one at IWPT which I didn't see, and two at EMNLP. I selected this one because it shows how to cast several learning methods (log-linear and max-margin) into a common framework with very good results. The talk by Terry Koo was clear and convincing.
Mildly Context-Sensitive Dependency Languages
M. Kuhlmann and M. M\"ohl
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 160--167 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1021
The complexity of dependency grammar parsing is related to formal measured os their degree of nonprojectivlty following an approach first introduced for mildly context-sensitive grammars. Dependency grammar is a formal island no longer.
The Infinite PCFG Using Hierarchical Dirichlet Processes
P. Liang, S. Petrov, M. Jordan, and D. Klein
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 688--697 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1072
Another idea that was in the air got two papers in Prague: hierarchical Dirichlet processes for unsupervised PCFG induction. I liked this paper better, as Percy Liang gave a beautifully clear exposition of a method that was pretty opaque in most previous presentations of related work. It must have helped that Percy and Dan Klein had given a tutorial on Bayesian nonparametric models a few days before.
Characterizing the Errors of Data-Driven Dependency Parsing Models
R. McDonald and J. Nivre
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 122--131 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1013
It was rather intriguing at last year's CoNLL evaluation of multilingual dependency parsing that the two top parsers (our MSTParser and Nivre's MaltParser) had overall scores that were statistically indistinguishable, even though they are very different in design. This paper explains the results: MaltParser's greedy deterministic method can use more context and works best on shorter sentences, but greed hurts it on longer sentences. MSTParser uses just local features, so it suffers on shorter sentences, but optimal search makes it do better on longer sentences. How can we combine these benefits? I know, I know, parser combination in the Sagai and Lavie mold can do it, but I'd prefer something more integrated.
Structured Models for Fine-to-Coarse Sentiment Analysis
R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 432--439 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1055
Combines document-level and sentence-level sentiment classification into a simple, easy to train structured model. Outperforms previous methods significantly at the sentence level, and does competitively at the document level.
Learning Structured Models for Phone Recognition
S. Petrov, A. Pauls, and D. Klein
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 897--905 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1094
Learn the structure of phone models from the data, rather than postulating a fixed structure in advance. Exploit it to represent context dependency concisely. Great paper, excellently presented. I tried to convince some speech colleagues that this could be done over ten years ago, but they were skeptical. It was probably too early, and this paper does it with way better methods than I had then.
Guided Learning for Bidirectional Sequence Classification
L. Shen, G. Satta, and A. Joshi
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 760--767 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1096
Learn a linear sequence model for a problem where exhaustive search is not possible by starting from high-confidence labels and learning which actions to apply to extend the high-confidence regions to a full labeling of the sequence. Best Penn Treebank POS tagging results ever, and the method applies easily to other tagging and parsing problems. Who needs reranking now?
Semi-Supervised Structured Output Learning Based on a Hybrid Generative and Discriminative Approach
J. Suzuki, A. Fujino, and H. Isozaki
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 791--800 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1083
I don't understand this paper fully yet -- the notation and presentation are pretty dense -- but the idea of learning together a CRF from labeled data and HMMs for the same state space from unlabeled data is an intriguing approach to semi-supervised CRF training. One of my post-conference homeworks is to figure our how this does (or not) relate with ASO. Lots of other possible connections, such as Pal and McCallum's multiconditional models.
Randomised Language Modelling for Statistical Machine Translation
D. Talbot and M. Osborne
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 512--519 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1065
I can't quite evaluate this, since I've not been working on large language models recently, but there's something deliciously preverse about using randomized hashing to throw away n-gram data from a big model that doesn't really matter in practice.
Online Learning of Relaxed CCG Grammars for Parsing to Logical Form
L. Zettlemoyer and M. Collins
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 678--687 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1071
Given a training set of sentence-meaning pairs, a CCG-based lexicon induction process discovers potential word meanings and category assignments, and a ranking of alternatives, so that the given training set is correctly analyzed and interpreted. I love this connection between online learning, categorial grammars, and logical semantics, and I think there's a rich vein to explore here.

I just got back from ACL and EMNLP-CoNLL in Prague. There were many interesting papers, more than I could attend because of session conflices. Here are some that I found especially worthwhile. Those highlighted in red really stood out. I'll add comments on some of these papers later, but I don't have time now.

Unsupervised Coreference Resolution in a Nonparametric Bayesian Model
A. Haghighi and D. Klein
45th ACL 848--855 (2007)
http://www.aclweb.org/anthology/P/P07/P07-0107

Structured Prediction Models via the Matrix-Tree Theorem
T. Koo, A. Globerson, X. Carreras, and M. Collins
EMNLP-CoNLL 141--150 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1015

Learning Structured Models for Phone Recognition
S. Petrov, A. Pauls, and D. Klein
EMNLP-CoNLL 897--905 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1094

Online Learning of Relaxed CCG Grammars for Parsing to Logical Form
L. Zettlemoyer and M. Collins
EMNLP-CoNLL 678--687 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1071

Exploiting Wikipedia as External Knowledge for Named Entity Recognition
J. Kazama and K. Torisawa
EMNLP-CoNLL 698--707 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1073

Semi-Supervised Structured Output Learning Based on a Hybrid Generative and Discriminative Approach
J. Suzuki, A. Fujino, and H. Isozaki
EMNLP-CoNLL 791--800 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1083

The Infinite PCFG Using Hierarchical Dirichlet Processes
P. Liang, S. Petrov, M. Jordan, and D. Klein
EMNLP-CoNLL 688--697 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1072

Characterizing the Errors of Data-Driven Dependency Parsing Models
R. McDonald and J. Nivre
EMNLP-CoNLL 122--131 (2007)
http://www.aclweb.org/anthology/D/D07/D07-1013

A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing
J. Gao, G. Andrew, M. Johnson, and K. Toutanova
45th ACL 824--831 (2007)
http://www.aclweb.org/anthology/P/P07/P07-0104

Guided Learning for Bidirectional Sequence Classification
L. Shen, G. Satta, and A. Joshi
45th ACL 760--767 (2007)
http://www.aclweb.org/anthology/P/P07/P07-0096

Sparse Information Extraction: Unsupervised Language Models to the Rescue
D. Downey, S. Schoenmackers, and O. Etzioni
45th ACL 696--703 (2007)
http://www.aclweb.org/anthology/P/P07/P07-0088

Randomised Language Modelling for Statistical Machine Translation
D. Talbot and M. Osborne
45th ACL 512--519 (2007)
http://www.aclweb.org/anthology/P/P07/P07-0065

Structured Models for Fine-to-Coarse Sentiment Analysis
R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar
45th ACL 432--439 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1055

K-best Spanning Tree Parsing
K. Hall
45th ACL 392--399 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1050

Frustratingly Easy Domain Adaptation
H. Daume III
45th ACL 256--263 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1033

A Bayesian Model for Discovering Typological Implications
H. Daume III and L. Campbell
45th ACL 65--72 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1009

Forest Rescoring: Faster Decoding with Integrated Language Models
L. Huang and D. Chiang
45th ACL 144--151 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1019

Mildly Context-Sensitive Dependency Languages
M. Kuhlmann and M. M\"ohl
45th ACL 160--167 (2007)
http://www.aclweb.org/anthology/P/P07/P07-1021

Structured Learning

Tuesday, July 3, 2007

Corrections to ACL Anthology URLs

Monday, July 2, 2007

Reposting interesting ACL and CoNLL-EMNLP papers

Sunday, July 1, 2007

Interesting papers at ACL and EMNLP-CoNLL

Monday, June 18, 2007

Data Catalysis

Saturday, June 16, 2007

Frustratingly Hard Domain Adaptation for Parsing

Biographies, Bollywood, Boom-boxes, and Blenders: Domain Adaptation for Sentiment Classification

Occam's Hammer

Blog Archive

About Me