loading page

An Evaluation of Topic Models for the Estimation of Unobserved Variables in Structured and Unstructured Documents
  • Onyinye Nweke ,
  • Collins Udanor ,
  • George Okereke
Onyinye Nweke
University of Nigeria Nsukka

Corresponding Author:[email protected]

Author Profile
Collins Udanor
Author Profile
George Okereke
Author Profile


For effective data collection, researchers are often faced with three challenges of where, what and how? Where to find researchable data, with what tools and methodologies to scrape websites for such data, and how to perform the required analytics and extract insightful knowledge. This study examines the possibility and the extent user tweets could influence the direction of research, especially in the field of machine learning and artificial intelligence. In this paper, we use the Latent Dirichlet Allocation (LDA) topic modelling technique to discover machine learning research topics popularity in 35,860 unorganized datasets (tweets) from 20 Artificial Intelligence and machine learning related handles, while using 7,241 articles from 42 years’ Neural Information Processing Systems (NIPS) conference papers dataset, an organized document as a control. The Latent Semantic Index (LSI) and the Hierarchical Dirichlet Process (HDP) are used to compare the performance of the LDA. Embedding methods such as the bag of words and the term frequency inverse document frequency (tf-idf) are used to encode the corpora and compared. Results suggest that using the structured dataset guaranteed better classification, though the unstructured dataset is quite informative. However, a t-test showed that the difference between the results of the two datasets was not significant. The LDA model consistently out-performed LSI and HDP across topics, respectively. A comparison of the Gensim and Mallet Python frameworks showed that Mallet promised a better topic modelling result than Gensim.