An Evaluation of Topic Models for the Estimation of Unobserved Variables in Structured and Unstructured Documents.docx (824.31 kB)

An Evaluation of Topic Models for the Estimation of Unobserved Variables in Structured and Unstructured Documents

Download (824.31 kB)
posted on 2022-07-05, 22:05 authored by Onyinye NwekeOnyinye Nweke, Collins Udanor, George Okereke

For effective data collection, researchers are often faced with three challenges of where, what and how? Where to find researchable data, with what tools and methodologies to scrape websites for such data, and how to perform the required analytics and extract insightful knowledge. This study examines the possibility and the extent user tweets could influence the direction of research, especially in the field of machine learning and artificial intelligence. In this paper, we use the Latent Dirichlet Allocation (LDA) topic modelling technique to discover machine learning research topics popularity in 35,860 unorganized datasets (tweets) from 20 Artificial Intelligence and machine learning related handles, while using 7,241 articles from 42 years’ Neural Information Processing Systems (NIPS) conference papers dataset, an organized document as a control. The Latent Semantic Index (LSI) and the Hierarchical Dirichlet Process (HDP) are used to compare the performance of the LDA. Embedding methods such as the bag of words and the term frequency inverse document frequency (tf-idf) are used to encode the corpora and compared. Results suggest that using the structured dataset guaranteed better classification, though the unstructured dataset is quite informative. However, a t-test showed that the difference between the results of the two datasets was not significant. The LDA model consistently out-performed LSI and HDP across topics, respectively. A comparison of the Gensim and Mallet Python frameworks showed that Mallet promised a better topic modelling result than Gensim.


Email Address of Submitting Author

ORCID of Submitting Author


Submitting Author's Institution

University of Nigeria Nsukka

Submitting Author's Country

  • Nigeria

Usage metrics