Putting Back the Stops: Integrating Syntax with Neural Topic Models

Putting Back the Stops: Integrating Syntax with Neural Topic Models

Mayank Nagda, Sophie Fellenz

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 6424-6432. https://doi.org/10.24963/ijcai.2024/710

Syntax and semantics are two key concepts for language understanding. Topic models typically represent the semantics of a text corpus, while removing syntactic information during preprocessing. Without preprocessing, the generated topics become uninterpretable because the syntactic words dominate generated topics. To learn interpretable topics while keeping valuable syntactic information, we propose a novel framework that can simultaneously learn both syntactic and semantic topics from the corpus without requiring any preprocessing. A context network leverages textual dependencies to distinguish between syntactic and semantic words, while a composite VAE topic model learns two sets of topics. We demonstrate on seven datasets that our proposed method effectively captures both syntactic and semantic representations of a corpus while outperforming state-of-the-art neural topic models and statistical topic models in terms of topic quality.
Keywords:
Natural Language Processing: NLP: Interpretability and analysis of models for NLP
Machine Learning: ML: Applications
Machine Learning: ML: Generative models
Natural Language Processing: NLP: Information extraction