Skip to content

Commit e770dad

Browse files
Added SensTopic to docs
1 parent bcadc08 commit e770dad

4 files changed

Lines changed: 7942 additions & 0 deletions

File tree

docs/SensTopic.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# SensTopic (BETA)
2+
3+
SensTopic is a version of Semantic Signal Separation, that only discovers positive signals, while allowing components to be unbounded.
4+
This is achieved with an algorithm called Semi-nonnegative Matrix Factorization or SNMF.
5+
6+
> :warning: This model is still in an experimental phase. More documentation and a paper are on their way. :warning:
7+
8+
SensTopic uses a very efficient implementation of the SNMF algorithm, that is implemented in raw NumPy, but also in JAX.
9+
If you want to enable hardware acceleration and JIT compilation, make sure to install JAX before running the model.
10+
11+
```bash
12+
pip install jax
13+
```
14+
15+
Here's an example of running SensTopic on the 20 Newsgroups dataset:
16+
17+
```python
18+
from sklearn.datasets import fetch_20newsgroups
19+
from turftopic import SensTopic
20+
21+
corpus = fetch_20newsgroups(
22+
subset="all",
23+
remove=("headers", "footers", "quotes"),
24+
).data
25+
26+
model = SensTopic(25)
27+
model.fit(corpus)
28+
29+
model.print_topics()
30+
```
31+
32+
33+
| Topic ID | Highest Ranking |
34+
| - | - |
35+
| | ... |
36+
| 8 | gospels, mormon, catholics, protestant, mormons, synagogues, seminary, catholic, liturgy, churches |
37+
| 9 | encryption, encrypt, encrypting, crypt, cryptosystem, cryptography, cryptosystems, decryption, encrypted, spying |
38+
| 10 | palestinians, israelis, palestinian, israeli, gaza, israel, gazans, palestine, zionist, aviv |
39+
| 11 | nasa, spacecraft, spaceflight, satellites, interplanetary, astronomy, astronauts, astronomical, orbiting, astronomers |
40+
| 12 | imagewriter, colormaps, bitmap, bitmaps, pkzip, imagemagick, colormap, formats, adobe, ghostscript |
41+
| | ... |
42+
43+
## Sparsity
44+
45+
SensTopic has a sparsity hyper-parameter, that roughly dictates how many documents will be assigned to a single document, where many topics per document get penalized.
46+
This means that the model is both a matrix factorization model, but can also function as a soft clustering model, depending on this parameter.
47+
Unlike clustering models, however, it may assign multiple topics to documents that have them, and won't force every document to contain only one topic.
48+
49+
Higher values will make your model more like a clustering model, while lower values will make it more like a decomposition model:
50+
51+
??? info "Click to see code"
52+
```python
53+
import pandas as pd
54+
import numpy as np
55+
import plotly.express as px
56+
from sentence_transformers import SentenceTransformer
57+
from datasets import load_dataset
58+
59+
from turftopic import SensTopic
60+
61+
ds = load_dataset("gopalkalpande/bbc-news-summary", split="train")
62+
corpus = list(ds["Summaries"])
63+
64+
encoder = SentenceTransformer("all-MiniLM-L6-v2")
65+
embeddings = encoder.encode(corpus, show_progress_bar=True)
66+
67+
models = []
68+
doc_topic_ms = []
69+
sparsities = np.array(
70+
[
71+
0.05,
72+
0.1,
73+
0.25,
74+
0.5,
75+
0.75,
76+
1.0,
77+
2.5,
78+
5.0,
79+
10.0,
80+
]
81+
)
82+
for i, sparsity in enumerate(sparsities):
83+
model = SensTopic(
84+
n_components=3, random_state=42, sparsity=sparsity, encoder=encoder
85+
)
86+
doc_topic = model.fit_transform(corpus, embeddings=embeddings)
87+
doc_topic = (doc_topic.T / doc_topic.sum(axis=1)).T
88+
models.append(model)
89+
doc_topic_ms.append(doc_topic)
90+
a_name, b_name, c_name = models[0].topic_names
91+
records = []
92+
for i, doc_topic in enumerate(doc_topic_ms):
93+
for dt in doc_topic:
94+
a, b, c, *_ = dt
95+
records.append(
96+
{
97+
"sparsity": sparsities[i],
98+
a_name: a,
99+
b_name: b,
100+
c_name: c,
101+
"topic": models[0].topic_names[np.argmax(dt)],
102+
}
103+
)
104+
df = pd.DataFrame.from_records(records)
105+
fig = px.scatter_ternary(
106+
df, a=a_name, b=b_name, c=c_name, animation_frame="sparsity", color="topic"
107+
)
108+
fig.show()
109+
```
110+
111+
<figure>
112+
<iframe src="../images/ternary_sparsity.html", title="Ternary plot of topics in documents.", style="height:800px;width:1050px;padding:0px;border:none;"></iframe>
113+
<figcaption> Ternary plot of topic distribution in a 3 topic SensTopic model varying with sparsity. </figcaption>
114+
</figure>
115+
116+
You can see that as the sparsity increases, topics get clustered much more clearly, and more weight gets allocated to the edges of the graph.
117+
118+
To see how many topics there are in your document you can use the `plot_topic_decay()` method, that shows you how topic weights get assigned to documents.
119+
120+
```python
121+
model.plot_topic_decay()
122+
```
123+
124+
<figure>
125+
<iframe src="../images/topic_decay.html", title="Topic Decay in SensTopic model", style="height:520px;width:1050px;padding:0px;border:none;"></iframe>
126+
<figcaption> Topic Decay in a SensTopic Model with sparsity=1. </figcaption>
127+
</figure>
128+
129+
## Automatic number of topics
130+
131+
SensTopic can learn the number of topics in a given dataset.
132+
In order to determine this quantity, we use a version of the Bayesian Information Criterion modified for NMF.
133+
This does not work equally well for all corpora, but it can be a powerful tool when the number of topics is not known a-priori.
134+
135+
In this example the model finds 6 topics in the BBC News dataset:
136+
137+
```python
138+
# pip install datasets
139+
from datasets import load_dataset
140+
141+
ds = load_dataset("gopalkalpande/bbc-news-summary", split="train")
142+
corpus = list(ds["Summaries"])
143+
144+
model = SensTopic("auto")
145+
model.fit(corpus)
146+
model.print_topics()
147+
```
148+
149+
| Topic ID | Highest Ranking |
150+
| - | - |
151+
| 0 | liverpool, mourinho, chelsea, premiership, arsenal, striker, madrid, midfield, uefa, manchester |
152+
| 1 | oscar, bafta, oscars, cast, cinema, hollywood, actor, screenplay, actors, films |
153+
| 2 | mobile, mobiles, broadband, devices, digital, internet, computers, microsoft, phones, telecoms |
154+
| 3 | tory, blair, minister, ministers, parliamentary, mps, parliament, politicians, constituency, ukip |
155+
| 4 | tennis, competing, federer, wimbledon, iaaf, olympic, tournament, athlete, rugby, olympics |
156+
| 5 | gdp, stock, economy, earnings, investments, investment, invest, exports, finance, economies |
157+
158+
159+
## API Reference
160+
161+
::: turftopic.models.senstopic.SensTopic

docs/images/ternary_sparsity.html

Lines changed: 3892 additions & 0 deletions
Large diffs are not rendered by default.

docs/images/topic_decay.html

Lines changed: 3888 additions & 0 deletions
Large diffs are not rendered by default.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ nav:
2525
- Topic Models:
2626
- Model Overview: model_overview.md
2727
- Semantic Signal Separation (S³): s3.md
28+
- SensTopic (BETA): SensTopic.md
2829
- KeyNMF: KeyNMF.md
2930
- GMM: GMM.md
3031
- Clustering Models (BERTopic & Top2Vec): clustering.md

0 commit comments

Comments
 (0)