Crawl a Site for 404s, or Other, Non-200 HTTP Responses

June 24, 2019

Why would you want to find 404s on a site?

A 404 HTTP response is the code a web-server sends back if the thing you are looking for is missing. Usually you are looking for a particular web page that doesn’t exist.

On your own site:

check for typos in links
check external sites linked to still up
check for internal pages misbehaving
check for missing site assets, e.g. images

On other sites:

find candidates for broken link building SEO

How do you find 404s on a site?

This method should work on Mac and Linux terminal shells and uses wget.

Open a terminal and type the following:

$ wget --spider -r -p http://www.site-to-target.com 2>&1 | grep -B 2 ' 404 '

If anything on the site returns a 404, then the output for each will look like:

--2016-03-22 07:59:01--  http://www.site-to-target.com/missing-page
Reusing existing connection to www.site-to-target.com:80.
HTTP request sent, awaiting response... 404 Not Found

Instead of greping for the 404s, you can use wget’s output flag to write the output to a file that you can examine for whatever HTTP response you see fit:

$ wget --spider -r -o ~/site-responses.log -p http://www.site-to-target.com 2>&1

# >>> ..saves crawl responses to a file called site-responses.log in your home folder

I found this little tip on a great article called: A Technical Guide to SEO. Hat-tip to Mattias Geniar.

To Top ^