1 분 소요

Latent Dirichlet Allocation, LDA

Fundamental Concept

  • One of the dimensional reduction methods suitable for modeling documents.
  • Various topics can be covered by using words in the document as input data.
  • It is used in NLP, etc
  • Create latent topics with the words in the document to indicate what each document is composed of.
  • You can explain that there are multiple topics in a document.

Algorithm

  1. Assign a topic randomly to each word in a document
  2. Calculate the probability of each topic in a document from the topic you assign to the word
  3. Calculate the probability of words by topic from the topic you assign to the word
  4. Reassign the topic to each word in the document with the probability of multiplying by 2 and 3
  5. Repeat courses 2 through 4 until converging conditions

Sample Code

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

max_features = 1000

tf_vectorizer = CountVectorizer(max_features = max_features, stop_words = 'english')

tf = tf_vectorizer.fit_transform(data.data)

n_topics = 20

model = LatentDirichletAllocation(n_components=n_topics)

model.fit(tf)

print(model.components_)
print(model.transform(tf))
[[5.00000003e-02 6.45710674e+01 5.00000001e-02 ... 6.38112685e+01
  1.16031380e+01 4.41074673e+01]
 [1.04157211e+02 2.64297969e+01 4.61960134e+01 ... 5.00000008e-02
  8.90142127e+00 2.33574093e+01]
 [5.00000012e-02 1.33680902e+02 5.00000002e-02 ... 7.67160443e+00
  5.36926525e+01 6.77485803e-01]
 ...
 [5.00000003e-02 2.52195300e+01 5.00000002e-02 ... 3.02218006e+02
  5.85861481e+00 4.29119104e+01]
 [5.00000006e-02 1.57706482e+02 5.00000009e-02 ... 1.06781782e+00
  8.91727565e+01 5.00000030e-02]
 [5.00000002e-02 5.00000004e-02 5.00000009e-02 ... 3.04400547e+01
  5.00000002e-02 5.00000002e-02]]
[[2.08333337e-03 2.08333333e-03 2.08333339e-03 ... 2.08333340e-03
  4.45888702e-01 2.08333336e-03]
 [2.50000004e-03 2.50000000e-03 2.50000003e-03 ... 2.50000010e-03
  2.50000009e-03 1.75613637e-01]
 [6.02409652e-04 6.02409639e-04 6.02409642e-04 ... 4.36103622e-01
  6.02409646e-04 2.42973293e-01]
 ...
 [4.54545463e-03 4.54545457e-03 4.54545457e-03 ... 4.54545465e-03
  4.54545459e-03 9.13636362e-01]
 [2.94117655e-03 2.94117647e-03 2.94117648e-03 ... 2.94117654e-03
  2.94117655e-03 2.94117653e-03]
 [3.57142863e-03 3.57142874e-03 3.57142862e-03 ... 2.00037149e-01
  5.76870462e-01 3.57142871e-03]]

Disposal of stop word

  • Topic 4
    :00 10 25 15 12 11 16 20 14 13 18 30 50 17 55 40 21 …
  • Topic 6
    :people said know did don didn just time went like say think told
  • As described above, the distribution of topics that are difficult to intuitively explain the topic can be improved by learning to exclude stop word.

참고문헌

  • 秋庭伸也 et al. 머신러닝 도감 : 그림으로 공부하는 머신러닝 알고리즘 17 / 아키바 신야, 스기야마 아세이, 데라다 마나부 [공] 지음 ; 이중민 옮김, 2019.

댓글남기기