LDA
Latent Dirichlet Allocation, LDA
Fundamental Concept
- One of the dimensional reduction methods suitable for modeling documents.
- Various topics can be covered by using words in the document as input data.
- It is used in NLP, etc
- Create latent topics with the words in the document to indicate what each document is composed of.
- You can explain that there are multiple topics in a document.
Algorithm
- Assign a topic randomly to each word in a document
- Calculate the probability of each topic in a document from the topic you assign to the word
- Calculate the probability of words by topic from the topic you assign to the word
- Reassign the topic to each word in the document with the probability of multiplying by 2 and 3
- Repeat courses 2 through 4 until converging conditions
Sample Code
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
max_features = 1000
tf_vectorizer = CountVectorizer(max_features = max_features, stop_words = 'english')
tf = tf_vectorizer.fit_transform(data.data)
n_topics = 20
model = LatentDirichletAllocation(n_components=n_topics)
model.fit(tf)
print(model.components_)
print(model.transform(tf))
[[5.00000003e-02 6.45710674e+01 5.00000001e-02 ... 6.38112685e+01
1.16031380e+01 4.41074673e+01]
[1.04157211e+02 2.64297969e+01 4.61960134e+01 ... 5.00000008e-02
8.90142127e+00 2.33574093e+01]
[5.00000012e-02 1.33680902e+02 5.00000002e-02 ... 7.67160443e+00
5.36926525e+01 6.77485803e-01]
...
[5.00000003e-02 2.52195300e+01 5.00000002e-02 ... 3.02218006e+02
5.85861481e+00 4.29119104e+01]
[5.00000006e-02 1.57706482e+02 5.00000009e-02 ... 1.06781782e+00
8.91727565e+01 5.00000030e-02]
[5.00000002e-02 5.00000004e-02 5.00000009e-02 ... 3.04400547e+01
5.00000002e-02 5.00000002e-02]]
[[2.08333337e-03 2.08333333e-03 2.08333339e-03 ... 2.08333340e-03
4.45888702e-01 2.08333336e-03]
[2.50000004e-03 2.50000000e-03 2.50000003e-03 ... 2.50000010e-03
2.50000009e-03 1.75613637e-01]
[6.02409652e-04 6.02409639e-04 6.02409642e-04 ... 4.36103622e-01
6.02409646e-04 2.42973293e-01]
...
[4.54545463e-03 4.54545457e-03 4.54545457e-03 ... 4.54545465e-03
4.54545459e-03 9.13636362e-01]
[2.94117655e-03 2.94117647e-03 2.94117648e-03 ... 2.94117654e-03
2.94117655e-03 2.94117653e-03]
[3.57142863e-03 3.57142874e-03 3.57142862e-03 ... 2.00037149e-01
5.76870462e-01 3.57142871e-03]]
Disposal of stop word
- Topic 4
:00 10 25 15 12 11 16 20 14 13 18 30 50 17 55 40 21 … - Topic 6
:people said know did don didn just time went like say think told - As described above, the distribution of topics that are difficult to intuitively explain the topic can be improved by learning to exclude stop word.
참고문헌
- 秋庭伸也 et al. 머신러닝 도감 : 그림으로 공부하는 머신러닝 알고리즘 17 / 아키바 신야, 스기야마 아세이, 데라다 마나부 [공] 지음 ; 이중민 옮김, 2019.
댓글남기기