• I’m using WordPress; not sure how to disable Google Bot to crawl non-existent URLs.
    My Permalink is: “/%category%/%postname%.html”

    The URL that users see is : domain.com/category/postname.html
    But the URL that Google Bot crawled and error as 404 is : domain.com/postname.html

    Does anyone know how to disallow Google Bot to crawl those URL ?
    I saw in the Webmaster Tool : Link From is unavailable –> not sure what we should do to see the caused of problem ?

Viewing 6 replies - 1 through 6 (of 6 total)
  • Make a robot.txt to block bots.

    Webmaster tool -> Site configuration -> Crawler access -> Generate robot.txt

    After blocking the URLs, remove them through the Remove URL tab if they’ve already been indexed.

    Thread Starter onlinemarketingclub

    (@onlinemarketingclub)

    Hello Joseph,

    Thanks for your info.

    The problem is there’re so many many urls, there may generated from some plug-in like link checking, sitemap, the 404 in this cased will caused from un-install those plugin after google had already crawled

    i have 3000 page crawl with 404 error, those pages are non existing
    and i didn’t create those pages, not sure how to fix the problem in this case ?

    Thanks everyone.

    The URL of those 3000 pages must have some patterns right? robot.txt uses * wildcard so you don’t have to actually type in all the URLs. For example, the following blocks all URLs ending with .html in the root directory only:

    Disallow: /*.html
    Allow: /*/*.html

    The removal tool on the other hand only accept specific URL but I think, not 100% sure on this, Google will automatically remove them after some time once they’re blocked.

    Thread Starter onlinemarketingclub

    (@onlinemarketingclub)

    Hello Joseph,

    Thank you ever so much for your kindly help

    This morning i found more domain.com/postname.html with 404

    Now, Let me try those command to block them
    *********************
    Disallow: /*.html
    Allow: /*/*.html
    *********************

    Yes, those 3000 page with 404 may caused from any of link-pattern-plug-in like :
    Google XML site map, Link Checking, Bot Crawling as my permalink structure ( or WP Config.) may not compatible with them. What so ever cased now i only have 2 new plugin that i do need : W3 Total Cache, Disable RSS

    Today i found so many urls on GWMT saying restriction by robots.txt
    (after i add dissalow: /feed ) not sure what to do next ?

    Could i have your MSN or Skype ?

    Thanks a million

    Thread Starter onlinemarketingclub

    (@onlinemarketingclub)

    Not sure if W3 Total Cache might cased the domain.com/postname.html as its cache ?

    If the URLs restricted by robots.txt are the ones you want to block then it’s working as it should. You don’t have to do anything else unless you don’t want to wait for Google to update it’s database automatically and want to request for URL removal manually.

    Not sure if W3 Total Cache might cased the domain.com/postname.html as its cache ?

    I have no idea what’s causing Google to crawl those URLs.

    I think it’d be best for you to continue asking questions on this forum, so others can join in, instead of contacting me privately because I’m not a professional and my knowledge in these areas is limited. I’m just another WP user with some interest in programming.

Viewing 6 replies - 1 through 6 (of 6 total)
  • The topic ‘I'm using WordPress; not sure how to disable Google to crawl non-existent URLs ?’ is closed to new replies.