Scraping(JS/Source code)

Source Code Recon

JavaScript files are used by modern web applications to provide dynamic content which contains various functions & events. Each website includes JS files and are a great resource for finding those internal subdomains used by the organization.

Tools: πŸ› 

Gospider is a fast web spidering tool capable of crawling the whole website within in a short amount of time. This means gospider will visit/scrap each and every URL mentioned in the JS file and source code. So, since source code & JS files make up a website they may contain links to other subdomains too.

Installation:

go get -u github.com/jaeles-project/gospider

This is a long process so Brace yourself !!! πŸ’ͺ

Running:

This process is divided into3⃣steps:

1) Web probing subdomains

  • Since we are crawling a website, gospider excepts us to provide URL's, which means in the form of http:// https://

  • So first, we need to web probe all the subdomains we have gathered till now. For this purpose, we will use httpx .

  • So, lets first web probe the subdomains:

cat subdomains.txt | httpx -random-agent -retries 2 -no-color -o probed_tmp_scrap.txt
  • Now, that we have web probed URLs, we can send them for crawling to gospider.

gospider -S probed_tmp_scrap.txt --js -t 50 -d 3 --sitemap --robots -w -r > gospider.txt

Caution: This generates huge traffic on your target

Flags:

  • S - Input file

  • js - Find links in JavaScript files

  • t - Number of threads (Run sites in parallel) (default 1)

  • d - depth (3 depth means scrap links from second-level JS files)

  • sitemap - Try to crawl sitemap.xml

  • robots - Try to crawl robots.txt

2) Cleaning the output

The parth portion of an URL shouldn't have more than 2048 characters. Since, we gopsider

sed -i '/^.\{2048\}./d' gospider.txt

The Point to note here is we have got URLs from JS files & source code till now. We are only concerned with subdomains. Hence we must just extract subdomains from the Gospider output.

This can be done using Tomnomnom's unfurl tool. It takes a list of URLs as input and extracts the subdomain/domain part from them. You can install Unfurl using this command go get -u github.com/tomnomnom/unfurl

cat gospider.txt | grep -Eo 'https?://[^ ]+' | sed 's/]$//' | unfurl -u domains | grep ".example.com$" | sort -u scrap_subs.txt

Break down of the command: a) grep - Extract the links that start with http/https b) sed - Remove " ] " at the end of line c) unfurl - Extract domain/subdomain from the urls d) grep - Only select subdomains of our target e) sort - Avoid duplicates

3) Resolving our target subdomains

  • Now that we have all the subdomains of our target, it's time to DNS resolve and check for valid subdomains.

( hoping you have seen the previous techniques, and you know how to run puredns)

puredns resolve scrap_subs.txt -w scrap_subs_resolved.txt -r resolvers.txt 

I love this technique as, it also finds hidden Amazon S3 buckets used by the organization.If such buckets are open and expose sensitive data than its a WIN WIN situation for us. Also the ouput of this can be sent to secretfinder tool, whihc can find hidden secrets,exposed api tokens etc.

Last updated