|
| 1 | +# SensTopic (BETA) |
| 2 | + |
| 3 | +SensTopic is a version of Semantic Signal Separation, that only discovers positive signals, while allowing components to be unbounded. |
| 4 | +This is achieved with an algorithm called Semi-nonnegative Matrix Factorization or SNMF. |
| 5 | + |
| 6 | +> :warning: This model is still in an experimental phase. More documentation and a paper are on their way. :warning: |
| 7 | +
|
| 8 | +SensTopic uses a very efficient implementation of the SNMF algorithm, that is implemented in raw NumPy, but also in JAX. |
| 9 | +If you want to enable hardware acceleration and JIT compilation, make sure to install JAX before running the model. |
| 10 | + |
| 11 | +```bash |
| 12 | +pip install jax |
| 13 | +``` |
| 14 | + |
| 15 | +Here's an example of running SensTopic on the 20 Newsgroups dataset: |
| 16 | + |
| 17 | +```python |
| 18 | +from sklearn.datasets import fetch_20newsgroups |
| 19 | +from turftopic import SensTopic |
| 20 | + |
| 21 | +corpus = fetch_20newsgroups( |
| 22 | + subset="all", |
| 23 | + remove=("headers", "footers", "quotes"), |
| 24 | +).data |
| 25 | + |
| 26 | +model = SensTopic(25) |
| 27 | +model.fit(corpus) |
| 28 | + |
| 29 | +model.print_topics() |
| 30 | +``` |
| 31 | + |
| 32 | + |
| 33 | +| Topic ID | Highest Ranking | |
| 34 | +| - | - | |
| 35 | +| | ... | |
| 36 | +| 8 | gospels, mormon, catholics, protestant, mormons, synagogues, seminary, catholic, liturgy, churches | |
| 37 | +| 9 | encryption, encrypt, encrypting, crypt, cryptosystem, cryptography, cryptosystems, decryption, encrypted, spying | |
| 38 | +| 10 | palestinians, israelis, palestinian, israeli, gaza, israel, gazans, palestine, zionist, aviv | |
| 39 | +| 11 | nasa, spacecraft, spaceflight, satellites, interplanetary, astronomy, astronauts, astronomical, orbiting, astronomers | |
| 40 | +| 12 | imagewriter, colormaps, bitmap, bitmaps, pkzip, imagemagick, colormap, formats, adobe, ghostscript | |
| 41 | +| | ... | |
| 42 | + |
| 43 | +## Sparsity |
| 44 | + |
| 45 | +SensTopic has a sparsity hyper-parameter, that roughly dictates how many documents will be assigned to a single document, where many topics per document get penalized. |
| 46 | +This means that the model is both a matrix factorization model, but can also function as a soft clustering model, depending on this parameter. |
| 47 | +Unlike clustering models, however, it may assign multiple topics to documents that have them, and won't force every document to contain only one topic. |
| 48 | + |
| 49 | +Higher values will make your model more like a clustering model, while lower values will make it more like a decomposition model: |
| 50 | + |
| 51 | +??? info "Click to see code" |
| 52 | + ```python |
| 53 | + import pandas as pd |
| 54 | + import numpy as np |
| 55 | + import plotly.express as px |
| 56 | + from sentence_transformers import SentenceTransformer |
| 57 | + from datasets import load_dataset |
| 58 | + |
| 59 | + from turftopic import SensTopic |
| 60 | + |
| 61 | + ds = load_dataset("gopalkalpande/bbc-news-summary", split="train") |
| 62 | + corpus = list(ds["Summaries"]) |
| 63 | + |
| 64 | + encoder = SentenceTransformer("all-MiniLM-L6-v2") |
| 65 | + embeddings = encoder.encode(corpus, show_progress_bar=True) |
| 66 | + |
| 67 | + models = [] |
| 68 | + doc_topic_ms = [] |
| 69 | + sparsities = np.array( |
| 70 | + [ |
| 71 | + 0.05, |
| 72 | + 0.1, |
| 73 | + 0.25, |
| 74 | + 0.5, |
| 75 | + 0.75, |
| 76 | + 1.0, |
| 77 | + 2.5, |
| 78 | + 5.0, |
| 79 | + 10.0, |
| 80 | + ] |
| 81 | + ) |
| 82 | + for i, sparsity in enumerate(sparsities): |
| 83 | + model = SensTopic( |
| 84 | + n_components=3, random_state=42, sparsity=sparsity, encoder=encoder |
| 85 | + ) |
| 86 | + doc_topic = model.fit_transform(corpus, embeddings=embeddings) |
| 87 | + doc_topic = (doc_topic.T / doc_topic.sum(axis=1)).T |
| 88 | + models.append(model) |
| 89 | + doc_topic_ms.append(doc_topic) |
| 90 | + a_name, b_name, c_name = models[0].topic_names |
| 91 | + records = [] |
| 92 | + for i, doc_topic in enumerate(doc_topic_ms): |
| 93 | + for dt in doc_topic: |
| 94 | + a, b, c, *_ = dt |
| 95 | + records.append( |
| 96 | + { |
| 97 | + "sparsity": sparsities[i], |
| 98 | + a_name: a, |
| 99 | + b_name: b, |
| 100 | + c_name: c, |
| 101 | + "topic": models[0].topic_names[np.argmax(dt)], |
| 102 | + } |
| 103 | + ) |
| 104 | + df = pd.DataFrame.from_records(records) |
| 105 | + fig = px.scatter_ternary( |
| 106 | + df, a=a_name, b=b_name, c=c_name, animation_frame="sparsity", color="topic" |
| 107 | + ) |
| 108 | + fig.show() |
| 109 | + ``` |
| 110 | + |
| 111 | +<figure> |
| 112 | + <iframe src="../images/ternary_sparsity.html", title="Ternary plot of topics in documents.", style="height:800px;width:1050px;padding:0px;border:none;"></iframe> |
| 113 | + <figcaption> Ternary plot of topic distribution in a 3 topic SensTopic model varying with sparsity. </figcaption> |
| 114 | +</figure> |
| 115 | + |
| 116 | +You can see that as the sparsity increases, topics get clustered much more clearly, and more weight gets allocated to the edges of the graph. |
| 117 | + |
| 118 | +To see how many topics there are in your document you can use the `plot_topic_decay()` method, that shows you how topic weights get assigned to documents. |
| 119 | + |
| 120 | +```python |
| 121 | +model.plot_topic_decay() |
| 122 | +``` |
| 123 | + |
| 124 | +<figure> |
| 125 | + <iframe src="../images/topic_decay.html", title="Topic Decay in SensTopic model", style="height:520px;width:1050px;padding:0px;border:none;"></iframe> |
| 126 | + <figcaption> Topic Decay in a SensTopic Model with sparsity=1. </figcaption> |
| 127 | +</figure> |
| 128 | + |
| 129 | +## Automatic number of topics |
| 130 | + |
| 131 | +SensTopic can learn the number of topics in a given dataset. |
| 132 | +In order to determine this quantity, we use a version of the Bayesian Information Criterion modified for NMF. |
| 133 | +This does not work equally well for all corpora, but it can be a powerful tool when the number of topics is not known a-priori. |
| 134 | + |
| 135 | +In this example the model finds 6 topics in the BBC News dataset: |
| 136 | + |
| 137 | +```python |
| 138 | +# pip install datasets |
| 139 | +from datasets import load_dataset |
| 140 | + |
| 141 | +ds = load_dataset("gopalkalpande/bbc-news-summary", split="train") |
| 142 | +corpus = list(ds["Summaries"]) |
| 143 | + |
| 144 | +model = SensTopic("auto") |
| 145 | +model.fit(corpus) |
| 146 | +model.print_topics() |
| 147 | +``` |
| 148 | + |
| 149 | +| Topic ID | Highest Ranking | |
| 150 | +| - | - | |
| 151 | +| 0 | liverpool, mourinho, chelsea, premiership, arsenal, striker, madrid, midfield, uefa, manchester | |
| 152 | +| 1 | oscar, bafta, oscars, cast, cinema, hollywood, actor, screenplay, actors, films | |
| 153 | +| 2 | mobile, mobiles, broadband, devices, digital, internet, computers, microsoft, phones, telecoms | |
| 154 | +| 3 | tory, blair, minister, ministers, parliamentary, mps, parliament, politicians, constituency, ukip | |
| 155 | +| 4 | tennis, competing, federer, wimbledon, iaaf, olympic, tournament, athlete, rugby, olympics | |
| 156 | +| 5 | gdp, stock, economy, earnings, investments, investment, invest, exports, finance, economies | |
| 157 | + |
| 158 | + |
| 159 | +## API Reference |
| 160 | + |
| 161 | +::: turftopic.models.senstopic.SensTopic |
0 commit comments