Case insensitive URL permutations all served as 200 instead canonical redirect
-
My setup
- WordPress 6.1.1 (current)
- Redirection 5.3.6 (current)
- Apache Web Server
- Default .htaccess in WordPress root directory
- WordPress > Tool > Redirection > Options > Default URL settings > Case insensitive matches: Can be ON and OFF.
Regardless whether ON or OFF, the following happens:
- /test-page/ is the real slug
- /Test-page/ gets served as 200 with the same content
- /TEst-page/ gets served as 200 with the same content
- /TESt-page/ gets served as 200 with the same content
Maybe WordPress core handles this meanwhile and bypasses your plugin? Or the webserver does something?
I appreciate that the correct URL in the wrong capitalization leads not to a 404 error and hence user frustration, regardless whether WP core or redirection plugins ensures this. But instead of 200 it should be a redirection (HTTP 3xx class).
If the slug has capitalization on purpose, e.g. /Test-Page/ then requesting /TEST-Page/ should redirect to /Test-Page/ to guarantee canonical links.
Thanks for any help.
-
WordPress sets the canonical URL for you. This isn’t something you need to worry about, and trying to ‘fix’ it with redirects will likely cause issues.
I know that WordPress itself creates the canonical URL for posts/pages in all lowercase. But if someone does not copy/paste the URL, but:
1) types it in somewhere manually or copy/pastes and in the course of action autocorrect kicks in (can sometimes happen) and the result is a mixed case URL e.g. /Test-Page/ then…
2) or I myself in printed or outdoor advertising material use a different capitalization — for better readability e.g. /TEST-page/ or stylistic reasons e.g. a font which only has uppercase, e.g. /TEST-PAGE/ — then…
→ … I want a redirect to happen to the canonical URL (which is in all lowercase) instead of /Test-Page/ or /TEST-PAGE/ being served as normal (200).
3) A minor subsequent concern: If the different link versions get popular on their own on the Internet, and my website serves alls versions as valid, then to web spiders all that pages exist. And that could maybe be counted as duplicate content, which possibly could reduce page rank. I rather doubt that, and believe spiders are intelligent enough to realize this is just a glitch, but maybe that carelessness also causes some decimal point reduction in score rank.
Reason 2 is under my personal control (or the people I delegate to). But reason 1 is not under my control. And I’d like to not loose this traffic if possible.- This reply was modified 1 year, 10 months ago by abitofmind.
- This reply was modified 1 year, 10 months ago by abitofmind.
- This reply was modified 1 year, 10 months ago by abitofmind.
You are not going to lose traffic. The canonical URL defines the correct URL and they are all treated the same – it is not duplicate content.
This seems very much like a theoretical problem rather than an actual one. It’s best to let WordPress do what it currently does.
If you are advertising a site then you should use the correct URL – people understand URLs and don’t need them to be changed. If you are worried then you can use a short URL that redirects to the correct one.
If you still want particular URLs redirected then you can add specific redirects for those.
You could try and create a redirect that matches any URL with upper case characters and redirect them to the same as lowercase, but it’s not something I would recommend on a global level as you will likely cause other problems.
This page may help: https://redirection.me/support/redirect-regular-expressions/
Let me rephrase my question very concretely:
Why is request /TEST-PAGE/ served as /TEST-PAGE/ (HTTP 200) rather than redirected to /test-page/ (HTTP 301) ?- This reply was modified 1 year, 10 months ago by abitofmind.
- This reply was modified 1 year, 10 months ago by abitofmind.
I understood your original message and answered it here https://www.ads-software.com/support/topic/case-insensitive-url-permutations-all-served-as-200-instead-canonical-redirect/#post-16372669
Thanks that you had taken the time for answering all my single concerns!
- Nevertheless your answer “The canonical URL defines the correct URL and they are all treated the same – it is not duplicate content.” just was calming down a concern, but not really addressing it, that is giving the reason WHY, giving the background info on it.
- To me “same content can be reached via different URLs” seems like the definition of “duplicate content”. Please explain why that is supposed NOT to be duplicate content? Is there a way behind the scenes which tells search engines this is only a URL alias and not duplicate content?
- I looked at the HTTP response headers for the different URL variants: /test-page/ ; /Test-Page/ ; /TEST-page/ ; etc.
- The response headers are indeed almost identical. Only “Date:” and “Content-Length:” differ. “Date:” differs because I requested them at different times (one after each other), and due to the different “Date:” stamp the responses with “Content-Encoding: gzip” compress slightly different, hence the “Content-Length:” differs by a few single bytes between the responses.
What seems to identify the URLs variants with the different capitalization as being the same (=not duplicate content) possibly is the “Link:” header, which is like this:
Link: <https://mydomain.com/wp-json/>; rel="https://api.w.org/", <https://mydomain.com/wp-json/wp/v2/pages/2>; rel="alternate"; type="application/json", <https://mydomain.com/?p=2>; rel=shortlink
They all have the numeric page ID 2 (https://mydomain.com/?p=2).
mdn web docs on the Link HTTP header :
- It’s still experimental.
- To my understanding it mentions nothing which says that the Link header can be used as a mechanism to identify the same content under different alias URLs.
Could you please elaborate?
rel=shortlink in the HTTP “Link:” header is what the beautiful URL in its different capitalizations has in common.
Is this the standard mechanism of websites to signal to web spiders “this is the same unique identifier behind the beautiful URL(s)” ?
Is there a way behind the scenes which tells search engines this is only a URL alias and not duplicate content
Yes, because WordPress sets the canonical URL. It’s the same reason why adding query parameters to a URL doesn’t affect anything. This is something it does by default and you don’t need a plugin to do it for you.
Thanks for stating what WordPress Core does and that no plugin is needed to achieve canonical URLs!
Your linked article explains canonical URLs well. Realized that the conceptions which I had on the topic of “duplicate content” and “canonical URLs”, were correct but incomplete.
What Are Canonical Tags?
Canonical tags are bits of HTML that tell Google and other search engines which page is the canonical version.
https://www.semrush.com/blog/canonical-url-guide/#what-are-canonical-tagsI had known the term “Canonical URL” for years but always only interpreted it as “that’s the beautiful URL instead of the numerical one, or the new URL to which an outdated URL is redirected to”. Which itself is true but incomplete. Did not know there is an actual HTML part that deals with this. WordPress inserts a bunch of link elements with rel attributes into the HTML head indeed, also “rel=canonical”, see the sample:
<!DOCTYPE html> <html lang="en-US"><head> <meta name="generator" content="WordPress 6.1.1"> ... <link rel="canonical" > <link rel="shortlink" > <link rel="alternate" type="application/json+oembed" > <link rel="alternate" type="text/xml+oembed" >
The article later mentions:
You can also set canonical URLs:
- In your HTTP header. Add the “rel=canonical” HTTP header response.
- In your sitemap. The sitemap suggests to Google that all listed pages are canonical, but Google will understand which ones are duplicates.
- By using 301 redirects. Google understands the redirected page as a canonical.
I had falsely believed that those 301 redirects are a necessity to avoid circulation of duplicate content links on the web. Now it is clear to me it is only one of multiple possible methods.
From a UX perspective I’d say the redirect is the best method, because it is noticeable and observeable by humans on the surface. If you see that /TEST-PAGE/ in your browser’s address bar get’s redirected to e.g. /Test-Page/ then you as a user realize: Aha that URL I got redirected to is obviously the canonical URL which I shall save into my bookmarks / share via social media / send as an email.
The other methods are all “beyond the surface” and need the use of extra tools / DevTools. No wonder they are less known, to subject matter layman like me.
No concerns anymore regarding SEO and people not reaching content.
Last question: As I pointed out the 301 redirection is the strongest signal for a canonical URL as it is noticeable to users. If I still insisted that I want this behavior for the best UX possible, how/where in WordPress could it be achieved that accessing the non-canonical URL performs a 301 redirect to the canonical-URL ?
I don’t especially agree with that, and a redirect is mostly invisible to users. They generally do not notice, understand, or care that a URL is redirected, and may not even be using a browser, app, or device where it is visible.
If you still want to persist then the best thing is to redirect specific URLs that are causing problems. You could use a regular expression (https://redirection.me/support/redirect-regular-expressions/) to match everything with uppercase characters, but it is possible you may accidentally break things.
Thanks for your reply!
So if I want an enforced redirect to the canonical URL (which is always all lowercase, right?) then the easiest solution is simply a catch all RegEx
(.*[A-Z]+.*)
meaning “requested URL part after the domain part contains at least one uppercase letter” and redirect it to the substituted RegEx\L$1
meaning “the matched string transformed to lowercase” (see demo).- Every request gets through the plugin → Happens anyhow. Just one more RegEx rule to evaluate per request. No significant performance impact I suppose.
- Only those with at least one uppercase letter get transformed. This is a bit more performance intensive. → And that will happen only very rarely for those few malformed URLs (coming mostly from human error).
If I do not want this globally I could limit the matching RegEx to only a certain section/prefix e.g.
/blog/(.*[A-Z]+.*)
or certain keywords.- This reply was modified 1 year, 10 months ago by abitofmind.
SOME MORE BACKGROUND INFO ON MY SETUP:
My media library is served from a different subdomain where I do perform no rewriting. My main domain directory only contains the WordPress source code directory. The enforced RegEx redirection with the Redirection plugin only happens on that main domain.
SOME MORE LEARNING:
In the default WordPress .htaccess config these two lines:
RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d
ensure that the Webserver will serve any real files or directories with any capitalization as-is and only then pass on to WordPress index.php and its routing. So the enforced redirection should only happen for post/page URLs.
So in theory any real file, which may have uppercase letters in it, shall be served correctly.
But nevertheless, some plugins may have something in database only, and generate a fake file by using purely WordPress beautiful URL routing. And that fake file may have uppercase letters. And then my redirection would hinder that file getting served.
Also I noted that requests by hackers & script kiddies for inexistent files by the big majority end up in the 404 log. But in a few cases the request for an inexistent file happened to contain an uppercase letter, and those got redirected by my RegEx. Correct behavior: Webserver found no real file. Got to WordPress routing. Redirection plugin matches the uppercase letter and redirects.
So as a minimal error prevention (for possible legit requests to fake files generated from the DB, e.g. potentially dynamic CSS files) I extended my RegEx to only check for uppercase letters if the request does NOT start with
/wp-
which applies to 99.99% of all WordPress source files AFAIK. Simply by a so called negative lookahead. So the updated search pattern is^/(?!wp-)(.*[A-Z]+.*)
and the replacement pattern still/[lower]$1[/lower]
. Works as it should.But I can only repeat John’s recommendation to only set up an enforced redirection to lowercase if you know what you are doing. Many possible errors may result from it.
- This reply was modified 1 year, 9 months ago by abitofmind.
- This reply was modified 1 year, 9 months ago by abitofmind.
- This reply was modified 1 year, 9 months ago by abitofmind.
- The topic ‘Case insensitive URL permutations all served as 200 instead canonical redirect’ is closed to new replies.