Skip to content

Path extraction using regular expressions on file/http response content.#17

Open
thiezn wants to merge 2 commits intoassetnote:masterfrom
thiezn:master
Open

Path extraction using regular expressions on file/http response content.#17
thiezn wants to merge 2 commits intoassetnote:masterfrom
thiezn:master

Conversation

@thiezn
Copy link

@thiezn thiezn commented Feb 12, 2023

Hi,

Here's a pull request that adds the capability to provide a regular expression on the github file or httparchive http response content. The result will be a wordlist of unique paths to the files/urls that matches the regular expression.

Note that this is a large amount of data to process so can get pretty expensive quickly. To cut cost I've added a --sample-rate parameter that will only query a subset of the github or httparchive datasets.

I was not able to test the code fully as I don't have readonly permission on the gs://commonspeak-udf/URI.min.js file. The queries are working ok when I remove the path parsing temporary function from the httparchive sql query. It's also the first time I've touched golang code so forgive me if I've done anything stupid here.

To be honest, the cost/benefit of these queries probably don't add up so feel free to close off this pull request. Since I spent some time writing the code I'd thought I'd at least do a pull request in case it's interesting for you guys.

Example queries on github

An example use case is to extract all paths that contain so called PHP superglobals ($_GET, $_POST, etc). These files take input from a web browser so are more likely susceptible to vulnerabilities. To extract this it makes more sense to leverage the github source as they will contain the raw PHP files. HTTP responses will only contain the processed .php files.

Example run with a sample rate of 0.01%. This cost around 213.48Gb on BigQuery

./commonspeak2 --credentials credentials.json --project <GOOGLE_PROJECT> --verbose body-wordlist -r '\$_GET|\$_POST|\$_PUT|\$_FILES|\$_REQUEST|\$_SESSION|\$_COOKIE|\$_SERVER' --sample-rate 0.01 --output github_php_superglobals_top_1m_2023_02_12.txt --limit 1000000 --sources github

Example run without any sampling rate costs 2.64TB to run. At the moment it's $5,- per TB so adds up to $15,-

./commonspeak2 --credentials credentials.json --project <GOOGLE_PROJECT> --verbose body-wordlist -r '\$_GET|\$_POST|\$_PUT|\$_FILES|\$_REQUEST|\$_SESSION|\$_COOKIE|\$_SERVER' --output github_php_superglobals_top_1m_2023_02_12.txt --limit 1000000 --sources github

Another example would be to do something similar for known javascript files using a regex like 'eval(.*)|.setInterval(|.setInterval(|dangerouslySetInnerHTML|bypassSecurityTrustAs'. Or perhaps you use a regex string to match top level domains of known bug bounty targets.

Example queries on httparchive

Similar to the github example, we can leverage the httparchive set in the same way. For example you could create a wordlist with paths that contain Java springboot error messages. Springboot will generate so called Whitelabel Error Pages when no explicit error page has been defined.

Example run with a sample rate of 0.01%. This cost around 3.08Gb on BigQuery

./commonspeak2 --credentials credentials.json --project <GOOGLE_PROJECT> --verbose body-wordlist -r '<div>There was an unexpected error \(type=.*\)\.' --sample-rate 0.01 --output httparchive_springboot_error_pages_top_1m_2023_02_12.txt --limit 1000000 --sources httparchive

Example run without any sampling rate costs 31.97 TB to run. At the moment it's $5,- per TB so adds up to $160,-

./commonspeak2 --credentials credentials.json --project <GOOGLE_PROJECT> --verbose body-wordlist -r '<div>There was an unexpected error \(type=.*\)\.' --output httparchive_springboot_error_pages_top_1m_2023_02_12.txt --limit 1000000 --sources httparchive

Kind regards,
Thiezn

@thiezn
Copy link
Author

thiezn commented Mar 6, 2023

@infosec-au I've run the BigQuery for php myself today on the github dataset which gave me about 1.5 million results. If you are interested I can share the wordlist itself to include on assetnote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants