Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
219278d
Initial commit
maltevogl Sep 24, 2021
53173d6
init semantic layer tools package
maltevogl Sep 24, 2021
258d839
wip continue word scoring linkage
maltevogl Sep 27, 2021
1f05a8e
wip add tests and docs with tox
maltevogl Sep 28, 2021
bd14d0d
wip set readthedocs theme, add testing, fix linkage
maltevogl Sep 29, 2021
1cd58bc
add clustering and util fct
maltevogl Sep 30, 2021
d478b6a
wip update utils to run pipeline
maltevogl Sep 30, 2021
f1c4c41
wip updt origin, data testing complete, no tests written for clusteri…
maltevogl Oct 1, 2021
1ba08e4
fix multiprocessing
maltevogl Oct 4, 2021
7c983f8
add routine for cocitations
maltevogl Dec 9, 2021
baea9a1
add giant component writing
maltevogl Dec 10, 2021
532e56e
upd orig
maltevogl Dec 15, 2021
c885446
add leiden time clusters and streamgraph visuals, WIP
maltevogl Dec 15, 2021
b481ef9
finish streamgraph, add routine for reports, WIP
maltevogl Dec 16, 2021
8a1208f
upd origin, wip on reports multiprocessing
maltevogl Dec 17, 2021
3f6769f
finish mp reporting, add pipeline
maltevogl Dec 20, 2021
c4b60df
finish pipeline, minor improv of reportings
maltevogl Dec 21, 2021
e4e81d9
add embedding utility fct
maltevogl Dec 21, 2021
e1459bf
rm not necess imports
maltevogl Dec 22, 2021
2343430
fix csv export of embeddings
maltevogl Jan 3, 2022
9ed19a8
add util for clustering
maltevogl Jan 3, 2022
f8b297a
improve docs
maltevogl Jan 4, 2022
4849add
minor fixes
maltevogl Jan 5, 2022
c97976f
add doc
maltevogl Jan 6, 2022
62d500a
add readthedocs yaml
maltevogl Jan 6, 2022
5c129da
add req
maltevogl Jan 6, 2022
4b54746
add readme and license files from mainpage to docs
maltevogl Jan 6, 2022
0eb2e21
clean doc building
maltevogl Jan 7, 2022
2ed0677
bump version
maltevogl Jan 7, 2022
ffbadc1
fix link in docs
maltevogl Jan 24, 2022
1e497e9
linting and add cleaning to docs, extend docs for pipelines
maltevogl Feb 28, 2022
caee432
add req corpus for readthedocs
maltevogl Feb 28, 2022
3786649
bump version
maltevogl Feb 28, 2022
1702f8f
add docs req egg install
maltevogl Feb 28, 2022
92605e5
wip fix small data size vs cpu count
maltevogl Mar 2, 2022
016d02e
upd version
maltevogl Mar 2, 2022
2c1eea5
wip fix nodeid -> self.publicationIDcolumn, check for module import e…
maltevogl Mar 2, 2022
a874d14
fix: text column contains text in list form
maltevogl Mar 2, 2022
fb9b79e
fix: author and aff are joined by semicolon
maltevogl Mar 2, 2022
32d095b
add option to cluster full graphs
maltevogl Mar 3, 2022
788f07f
include generateTree add authors, tox dep include embeddml
maltevogl Mar 3, 2022
0d4371c
wip updt org
maltevogl Mar 7, 2022
69ea04d
add citationet working data generation
maltevogl Mar 15, 2022
eaa2d0f
wip add cleaning routine for title strings to make data json friendly
maltevogl Mar 17, 2022
b165148
add cleaning of "
maltevogl Mar 18, 2022
0c0419f
add first author name to filename
maltevogl Mar 21, 2022
1df9645
wip add time,filename output
maltevogl Mar 21, 2022
8ddb6ac
catch exception of no author or empty df
maltevogl Mar 22, 2022
ab08202
chg return type
maltevogl Mar 22, 2022
15d9666
add option for citationlimit
maltevogl Mar 22, 2022
77c76a4
return more informative feedback vals
maltevogl Mar 22, 2022
f940b72
chg output format of duration
maltevogl Mar 22, 2022
8c63199
wip fix doc for visual
maltevogl Mar 24, 2022
644dd31
Merge branch 'main' into dh-code-review
jdamerow May 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions src/semanticlayertools/clustering/leiden.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
import os
import time
import re
from tqdm import tqdm

import igraph as ig
import leidenalg as la


class TimeCluster():
"""Cluster time-sliced data with the Leiden algorithm.

Calculates temporal clusters of e.g. time-sliced cocitation or citation
data, using the Leiden algorithm . Two nodes are assumed to be identical in
different year slices, if the node name is the same.
This could be e.g. the bibcode or DOI.

Input files are assumed to include the year in the filename, have an ending
`_GC.net` to denote their giant component character and should be in Pajek
format.

The resolution parameter can be seen as a limiting density, above
which neighbouring nodes are considered a cluster. The interslice coupling
describes the influcence of yearly order on the clustering process. See doc
for the Leiden algorithm for more detailed info.

:param inpath: Path for input network data
:type inpath: str
:param outpath: Path for writing output data
:type outpath: str
:param resolution: Main parameter for the clustering quality function (Constant Pots Model)
:type resolution: float
:param intersliceCoupling: Coupling parameter between two year slices, also influences cluster detection
:type intersliceCoupling: float
:param timerange: The time range for considering input data (default=1945,2005))
:type timerange: tuple
:raises OSError: If the output file already exists at class instantiation

.. seealso::
Traag, V.A., Waltman. L., Van Eck, N.-J. (2018).
From Louvain to Leiden: guaranteeing well-connected communities.
Scientific reports, 9(1), 5233. 10.1038/s41598-019-41695-z
"""

def __init__(
self, inpath: str, outpath: str,
resolution: float = 0.003,
intersliceCoupling: float = 0.4,
timerange: tuple = (1945, 2005),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple[int, int] is better

useGC: bool = True,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is not documented, I'm uncertain what it does.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading further down, it probably means "use Giant Component". This makes sense, but before reading done, I suspect useGC had something to do with garbage collection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already fixed upstream

):
starttime = time.time()
self.inpath = inpath
self.outpath = outpath
self.res_param = resolution
self.interslice_param = intersliceCoupling
self.timerange = timerange

self.outfile = os.path.join(
outpath,
f'timeclusters_{timerange[0]}-{timerange[1]}_res_{resolution}_intersl_{intersliceCoupling}.csv'
)
if os.path.isfile(self.outfile):
raise OSError(f'Output file at {self.outfile} exists. Aborting.')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a ValueError is more customary here, an OSError is usually a wrapper for errors detected by OS APIs.


if useGC is True:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is True kind of threw me off (why not if useGC), but apparently this checks that useGC is actually true, not a truthy value. Seems like a nice old-school way of making sure something is a boolean. However, you're using typehints, you can probably assume it's boolean (if someone passes a non-boolean on purpose, well - that's on them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, its something that comes as feedback from some linters...

edgefiles = [x for x in os.listdir(inpath) if x.endswith('_GC.net')]
elif useGC is False:
edgefiles = [x for x in os.listdir(inpath) if x.endswith('.ncol')]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check for True and False, it means there's a third option. I think you should raise an exception in that case, or just use else instead of elif.


self.graphDict = {}

for idx in tqdm(range(len(edgefiles)), leave=False):
try:
year = re.findall(r'\d{4}', edgefiles[idx])[0]
except Exception:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the second time I see this code - going over files, making sure there are some at least 4-digit numbers in them, and returning the first one. In this case, I usually like creating a small utility function that finds you the files you need and yields their names. That way, you won't have to repeat the filename check twice, and raise the proper exception twice.

raise
if timerange[0] <= int(year) <= timerange[1]:
if useGC is True:
graph = ig.Graph.Read_Pajek(os.path.join(inpath, edgefiles[idx]))
elif useGC is False:
graph = ig.Graph.Read_Ncol(os.path.join(inpath, edgefiles[idx]))
self.graphDict[year] = graph

self.optimiser = la.Optimiser()

print(
"Graphs between "
f"{min(list(self.graphDict.keys()))} and "
f"{max(list(self.graphDict.keys()))} "
f"loaded in {time.time() - starttime} seconds."
)

def optimize(self, clusterSizeCompare: int = 1000):
"""Optimize clusters accross time slices.

This runs the actual clustering and can be very time and memory
consuming for large networks. Depending on the obtained cluster results,
this method has to be run iteratively with varying resolution parameter.
Output is written to file, with filename containing chosen parameters.

The output CSV contains information on which node in which year belongs
to which cluster. As a first measure of returned clustering, the method
prints the number of clusters found above a threshold defined by
`clusterSizeCompare`. This does not influence the output clustering.

:param clusterSizeCompare: Threshold for `interesting` clusters
:type clusterSizeCompare: int
:returns: Tuple of output file path and list of found clusters in tuple format (node, year, cluster)
:rtype: tuple

.. seealso::
Documentation of time-layer creation routine:
`Leiden documentation <https://leidenalg.readthedocs.io/en/latest/multiplex.html#temporal-community-detection>`_
"""
starttime = time.time()

layers, interslice_layer, _ = la.time_slices_to_layers(
list(self.graphDict.values()),
interslice_weight=self.interslice_param,
vertex_id_attr='name'
)
print('\tSet layers.')

partitions = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments would be very useful to the reader who (like me) doesn't know what CPMVertexPartition or optimize_partition_multiplex do. Just a short description, not the documentation of these functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will add them

la.CPMVertexPartition(
H,
node_sizes='node_size',
weights='weight',
resolution_parameter=self.res_param
) for H in layers
]
print('\tSet partitions.')

interslice_partition = la.CPMVertexPartition(
interslice_layer,
resolution_parameter=0,
node_sizes='node_size',
weights='weight'
)
print('\tSet interslice partions.')

self.optimiser.optimise_partition_multiplex(
partitions + [interslice_partition]
)

subgraphs = interslice_partition.subgraphs()

commun = []
for idx, part in enumerate(subgraphs):
nodevals = [
(
x['name'],
list(self.graphDict.keys()).pop(x['slice']),
idx
) for x in part.vs
]
commun.extend(nodevals)

with open(self.outfile, 'w') as outfile:
outfile.write('node,year,cluster\n')
for elem in commun:
outfile.write(
f"{elem[0]},{elem[1]},{elem[2]}\n"
)
largeclu = [
(x, len(x.vs)) for x in subgraphs if len(x.vs) > clusterSizeCompare
]
print(
f'Finished in {time.time() - starttime} seconds.'
f"Found {len(subgraphs)} clusters, with {len(largeclu)} larger then {clusterSizeCompare} nodes."
)

return self.outfile, commun
Loading