Abstract:
The existing latent Dirichlet allocation (LDA) model used to analyze the theme of information hidden in the massive set of documents or corpus has the shortcoming of longer computation time. To overcome such a disadvantage, we propose a parallel LDA topic modeling method based on MapReduce architecture using a distributed programming model, that is, the parallel implementation of the LDA topic model. Experiment has been fulfilled by utilizing the Hadoop parallel computing platform. The results show that, when dealing with large amounts of text, the proposed method can get near-linear speedup and improve the establishing effect of the topic modeling.