Path extraction using regular expressions on file/http response content. by thiezn · Pull Request #17 · assetnote/commonspeak2

thiezn · 2023-02-12T13:18:03Z

Hi,

Here's a pull request that adds the capability to provide a regular expression on the github file or httparchive http response content. The result will be a wordlist of unique paths to the files/urls that matches the regular expression.

Note that this is a large amount of data to process so can get pretty expensive quickly. To cut cost I've added a --sample-rate parameter that will only query a subset of the github or httparchive datasets.

I was not able to test the code fully as I don't have readonly permission on the gs://commonspeak-udf/URI.min.js file. The queries are working ok when I remove the path parsing temporary function from the httparchive sql query. It's also the first time I've touched golang code so forgive me if I've done anything stupid here.

To be honest, the cost/benefit of these queries probably don't add up so feel free to close off this pull request. Since I spent some time writing the code I'd thought I'd at least do a pull request in case it's interesting for you guys.

Example queries on github

An example use case is to extract all paths that contain so called PHP superglobals ($_GET, $_POST, etc). These files take input from a web browser so are more likely susceptible to vulnerabilities. To extract this it makes more sense to leverage the github source as they will contain the raw PHP files. HTTP responses will only contain the processed .php files.

Example run with a sample rate of 0.01%. This cost around 213.48Gb on BigQuery

./commonspeak2 --credentials credentials.json --project <GOOGLE_PROJECT> --verbose body-wordlist -r '\$_GET|\$_POST|\$_PUT|\$_FILES|\$_REQUEST|\$_SESSION|\$_COOKIE|\$_SERVER' --sample-rate 0.01 --output github_php_superglobals_top_1m_2023_02_12.txt --limit 1000000 --sources github

Example run without any sampling rate costs 2.64TB to run. At the moment it's $5,- per TB so adds up to $15,-

./commonspeak2 --credentials credentials.json --project <GOOGLE_PROJECT> --verbose body-wordlist -r '\$_GET|\$_POST|\$_PUT|\$_FILES|\$_REQUEST|\$_SESSION|\$_COOKIE|\$_SERVER' --output github_php_superglobals_top_1m_2023_02_12.txt --limit 1000000 --sources github

Another example would be to do something similar for known javascript files using a regex like 'eval(.*)|.setInterval(|.setInterval(|dangerouslySetInnerHTML|bypassSecurityTrustAs'. Or perhaps you use a regex string to match top level domains of known bug bounty targets.

Example queries on httparchive

Similar to the github example, we can leverage the httparchive set in the same way. For example you could create a wordlist with paths that contain Java springboot error messages. Springboot will generate so called Whitelabel Error Pages when no explicit error page has been defined.

Example run with a sample rate of 0.01%. This cost around 3.08Gb on BigQuery

./commonspeak2 --credentials credentials.json --project <GOOGLE_PROJECT> --verbose body-wordlist -r '<div>There was an unexpected error \(type=.*\)\.' --sample-rate 0.01 --output httparchive_springboot_error_pages_top_1m_2023_02_12.txt --limit 1000000 --sources httparchive

Example run without any sampling rate costs 31.97 TB to run. At the moment it's $5,- per TB so adds up to $160,-

./commonspeak2 --credentials credentials.json --project <GOOGLE_PROJECT> --verbose body-wordlist -r '<div>There was an unexpected error \(type=.*\)\.' --output httparchive_springboot_error_pages_top_1m_2023_02_12.txt --limit 1000000 --sources httparchive

Kind regards,
Thiezn

thiezn · 2023-03-06T15:14:28Z

@infosec-au I've run the BigQuery for php myself today on the github dataset which gave me about 1.5 million results. If you are interested I can share the wordlist itself to include on assetnote.

m-mortimer added 2 commits February 12, 2023 14:07

"added body-wordlist option"

48180a6

"fix glide.yml"

edcdc21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path extraction using regular expressions on file/http response content.#17

Path extraction using regular expressions on file/http response content.#17
thiezn wants to merge 2 commits intoassetnote:masterfrom
thiezn:master

thiezn commented Feb 12, 2023 •

edited

Loading

Uh oh!

thiezn commented Mar 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thiezn commented Feb 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example queries on github

Example queries on httparchive

Uh oh!

thiezn commented Mar 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thiezn commented Feb 12, 2023 •

edited

Loading