I’m presenting an instance in which the validation of a specific feature resulted in the identification of another, significantly more influential behavior on the website, which may require further attention and management.
I was testing the following website: https://www.nimber.pt/
My initial goal was to test the website’s sitemap. Here is the path that I was considering:
https://www.nimber.pt/sitemap.xml
Initial check
After the first try, I landed on the page displaying a 404 Not Found content. At this moment I thought that the sitemap must be set under a different path. Alternatively, there is no sitemap used for this website at all. In order to check if there is any sitemap set, I used the SEOmator tool.
data:image/s3,"s3://crabby-images/343d5/343d55082205ca9bff40e00b8beba78df6eb5f0d" alt=""
This tool suggested that there is an actual sitemap page submitted under the path I tried on my first try. And in addition to that, this tool indicates that the status is 200.
Test with Google Rich Results Test
I’ve run the same path through Rich Results Test as well.
https://search.google.com/test/rich-results/result?id=U_UTJijkQ4XgiRlh3fVbpg
The results look like this:
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Fri, 31 Jan 2025 18:30:18 GMT
Content-Type: text/html
Last-Modified: Fri, 22 Nov 2024 17:48:16 GMT
Transfer-Encoding: chunked
Connection: keep-alive
ETag: W/"6740c3e0-272"
Strict-Transport-Security: max-age=31536000
Content-Encoding: gzip
This curl result just confirmed that the actual page response is 200. We can understand this as regardless of the page content indicating that the page is not found, the search engine still considers it to be an existing page.
Check the page status in the Network results
As the next step I’ve decided to check the network results for the sitemap.xml page. It confirmed that its status is 200, although the content on the page suggests that the actual page is not found (or that it doesn’t exist).
data:image/s3,"s3://crabby-images/479b9/479b92764bba5d3cf83820b4a2673c9a1a3b46c1" alt=""
This was already a clear sign that the potential issue was the so-called Soft 404.
What is a Soft 404?
404 is a HTTP response code which means the ‘Not Found’. When Server unable to find a requested resource from the client, it returns a HTTP Response code 404.
Soft 404 is when the server redirects to a customized page to stop the user from viewing the requested resource. There might be cases like a 404 has happened or it may not happened. The page requested is available, but that page is not working correctly, So rather than, viewing the partly working page, user is redirected to a custom page without providing the resource asked for. In Google’s own words: “This is like a giraffe wearing a name tag that says ‘dog’. Just because the name tag says it’s a dog doesn’t mean it’s a dog. Similarly, just because a page says 404, doesn’t mean it’s returning a 404 status code.” Soft 404 means the user will be redirected to a custom page or other location because of any reason (might be 404 or anything else).
Basically, Soft 404 is a page that looks like a 404 but returns a HTTP status code 200. I mean, the website might show a “Page Not Found”-style page, but search engines such as Google will see the page and think that it’s an actual live page.
HTTP status code 200 means the web server successfully processed the request and transmitted content to your browser. That means the request you send to the server works well and transfers your data. Soft 404 means when a non-existent page (a page that has been deleted/removed) displays a ‘page not found’ message to anyone trying to access it but fails to return an HTTP 404 status code. that means the page will load correctly, but specific data will not load in your browser.
Let’s check if there are clear indicators that we have encountered a Soft 404
A clear indicator that there is a Soft 404 behavior (or error) on the website can be if we run any obviously non-existing page, basically inserting dummy URL-s, that would, in standard conditions, return as a 404 Not Found Response. In this case, it will always return with content displaying a 404 Not Found, while the page’s status will be 200 OK.
Here is an example of a random URL I’ve used for testing purposes: https://www.nimber.pt/randomurlfortestingpurposes
data:image/s3,"s3://crabby-images/c4182/c41829203803dd079e3bf636fe796aa8f1ce8f5f" alt=""
The results are just as we have expected. The content on the page indicates that the page has responded with 404 Not Found, while the actual response was 200.
What issues may Soft 404 cause
Besides the obvious reason that it will cause you to experience lower rankings in search engine results pages (SERPs), soft 404 errors on your website can lead to various other issues.
- For instance, when Googlebot encounters a page that returns a soft 404 error but finds the content is not genuinely absent, it may conclude that your site is generating misleading 404 errors. This misinterpretation could result in penalties from Google.
- A significant concern is the negative impact on user experience. Since soft 404 URLs remain visible in search results, users may inadvertently be directed to pages that do not exist.
- If a user clicks on a link that leads to a soft 404 error, they may conclude that the page is unavailable and subsequently leave your site. This behavior can adversely affect your site’s bounce rate and reduce the time users spend on your platform.
- Moreover, there are potential repercussions for the overall performance and functionality of the website.
- Although 404 error pages consume less server space compared to content-rich pages, they still utilize bandwidth. If search engines direct traffic to a non-existent page, your site continues to incur losses, which can hinder website speed and overall performance.
Alternative approaches – Sloppy design
Still, some experts, such as David from StackOverflow, say that this website behavior doesn’t necessarily mean that it is a critical error. This can be viewed rather as a design choice, and while it’s not ideal, it’s not a failure on the server’s part. It’s just sloppy design, but it doesn’t break the functionality. So, we may conclude that Nothing is broken. The system is working as designed, even though the design might not be ideal.
My suggestion
If I could suggest a fix, the right approach would be:
- Return a 404 status code instead of 200 when a page is not found.
- Ensure that the website serves correct error codes along with the appropriate error page content.
This ensures that crawlers, users, and SEO tools can understand the correct status of the page (e.g., missing content), and it will help avoid misleading search engine crawlers or tools into thinking the page is still active.