Scraping(JS/Source code)

Source Code Recon

JavaScript files are used by modern web applications to provide dynamic content which contains various functions & events. Each website includes JS files and are a great resource for finding those internal subdomains used by the organization.

Tools: πŸ› 

1) Gospider​

  • Author: Jaeles​
  • Language: Go
​Gospider is a fast web spidering tool capable of crawling the whole website within in a short amount of time. This means gospider will visit/scrap each and every URL mentioned in the JS file and source code. So, since source code & JS files make up a website they may contain links to other subdomains too.

Installation:

go get -u github.com/jaeles-project/gospider
This is a long process so Brace yourself !!! πŸ’ͺ

Running:

This process is divided into3⃣steps:

1) Web probing subdomains

  • Since we are crawling a website, gospider excepts us to provide URL's, which means in the form of http:// https://
  • So first, we need to web probe all the subdomains we have gathered till now. For this purpose, we will use httpx .
  • So, lets first web probe the subdomains:
cat subdomains.txt | httpx -random-agent -retries 2 -no-color -o probed_tmp_scrap.txt
  • Now, that we have web probed URLs, we can send them for crawling to gospider.
gospider -S probed_tmp_scrap.txt --js -t 50 -d 3 --sitemap --robots -w -r > gospider.txt
Caution: This generates huge traffic on your target

Flags:

  • S - Input file
  • js - Find links in JavaScript files
  • t - Number of threads (Run sites in parallel) (default 1)
  • d - depth (3 depth means scrap links from second-level JS files)
  • sitemap - Try to crawl sitemap.xml
  • robots - Try to crawl robots.txt

2) Cleaning the output

The parth portion of an URL shouldn't have more than 2048 characters. Since, we gopsider
sed -i '/^.\{2048\}./d' gospider.txt
The Point to note here is we have got URLs from JS files & source code till now. We are only concerned with subdomains. Hence we must just extract subdomains from the Gospider output.
This can be done using Tomnomnom's unfurl tool. It takes a list of URLs as input and extracts the subdomain/domain part from them. You can install Unfurl using this command go get -u github.com/tomnomnom/unfurl
cat gospider.txt | grep -Eo 'https?://[^ ]+' | sed 's/]$//' | unfurl -u domains | grep ".example.comquot; | sort -u scrap_subs.txt
Break down of the command: a) grep - Extract the links that start with http/https b) sed - Remove " ] " at the end of line c) unfurl - Extract domain/subdomain from the urls d) grep - Only select subdomains of our target e) sort - Avoid duplicates

3) Resolving our target subdomains

  • Now that we have all the subdomains of our target, it's time to DNS resolve and check for valid subdomains.
( hoping you have seen the previous techniques, and you know how to run puredns)
puredns resolve scrap_subs.txt -w scrap_subs_resolved.txt -r resolvers.txt
I love this technique as, it also finds hidden Amazon S3 buckets used by the organization.If such buckets are open and expose sensitive data than its a WIN WIN situation for us. Also the ouput of this can be sent to secretfinder tool, whihc can find hidden secrets,exposed api tokens etc.
Copy link
On this page
Source Code Recon
Tools: πŸ› 
1) Gospider
Installation:
Running:
1) Web probing subdomains
2) Cleaning the output
3) Resolving our target subdomains