-
Notifications
You must be signed in to change notification settings - Fork 6
Description
TLDR; checkout this no-code webgraph exploration app and the ccpywebgraph package it uses.
Disclaimer: I was motivated to make these because much of my research has depended on this functionality. I am hoping these tools make it easier for others to conduct similar research, or simply reproduce our findings if they wish to. That being said, I do not know what the appropriate way to share these things with the open source community is, or if their is a preference to incorporate them some way into the commoncrawl utilities in a more official manner. This issue is just to call attention to them so those who are interested can figure out the best place to put them, if indeed they wish for them to be put anywhere! A good deal of the code for both the package and the demo was written by LLMs.
Why aren't network scientists using cc-webgraph (more)?
cc-webgraphs are an incredible resources for researchers, but they are under explored. After attempting to encourage other researchers to take advantage of the webgraph framework and cc-webgraph, I chalk this down to the following problems:
- System requirements for storing and processing large graph datasets are uncertain and the task seems daunting.
- Interfacing with the Webgraph library (relying on bash / java knowledge) implies a level of systems knowledge that is rare in the world of data and network science researchers.
- The ability to easily interact with webgraph data (outside of CommonCrawl) is gatekeeped by commercial SEO data providers. Backlink / outlink APIs from Ahrefs and Semrush are (prohibitively) expensive for researchers, whose work does not fit the typical use cases of SEO toolsets anyway.
Interactive demo
Indicative of the first issue, any processing over webgraphs I had to do for my past research was done on a compute server with a lot of resources available. After trying out the interactive jshell demo, I realized the first point is quickly becoming moot. By generating mapping / offset files, the graphs don't need to be kept in memory, and so RAM requirements are now negligible. Of course, I'm sure that the webgraph framework had this feature for some time, and I have only just now discovered it. Segway into issue #2.
To help those without systems knowledge explore the cc webgraph, I decided to make a quick no-code webgraph exploration app. I only realized that this was feasible after trying out jshell and seeing that I could quickly query and extract subgraphs of interest, a task that used to take a lot more time and effort. Currently, the demo is running with a max instance of 1, but since it's running over Google cloud run, it has the potential to scale. I have kept it at 1 to limit the running costs - I have some GCP credits to run it for some time but I don't really know what to expect in terms of usage in the long run. Also, although the app can be quite slow when trying to add many domains to the visualization, this is mostly an issue with the rendering engine (dash-cytoscape), and not with cc-webgraph. Note, you can download the graph that you construct in the visualizer in some basic formats.
pyccwebgraph package
I figured that, beyond the narrow scope of the demo, there are a bunch of other network science packages that could be useful for analyzing webgraphs. So I (retroactively) decided to extract the setup logic and a lightweight wrapper around cc-webgraph methods into a standalone package.
The pyccwebgraph package uses Py4J to bridge Python and Java, rather than alternatives like Jython (which is used by py-web-graph). Some problem with Jython include relying on Python 2, and an inability to use CPython libraries (NumPy, pandas, NetworkX, etc.). So py-web-graph won't work for the kind of network analysis we are interested in here.
Other then Py4J, I also considered using JPype. That would start the JVM inside the Python process rather than using separate processes as Py4J does. I was reluctant to use it though, as I believe it would mean that crashing the JVM would crash python. Also, using separate processes may be simpler for instantiating multiple instances of the JVM for analyzing multiple webgraphs at the same time (for doing some dynamic network analysis, for example). Of course, there are all sorts of alternatives to the current setup of pyccwebgraph. I am aware that there is an effort to rewrite the webgraph library in RUST, so in the future, using relying on the JVM in any way may be a poor choice.
What to do with all this?
I hope these tools will be useful to others. In going ahead and publishing the demo and package, the goal was just to demonstrate the usefulness. However, I do not wish to assume ownership over something that rightfully belongs elsewhere. So, with that in mind, if the demo or package are deemed useful, where is the best place to put them?
edit: added link to pyccwebgraph