• How does a site spider a wordpress blog? Could I disallow everything in my wordpress directory and still have the spider find the different artilces–because they are called from the database, or do I need part of the wordpress directory to be accessable to a site spider?

Viewing 9 replies - 1 through 9 (of 9 total)
  • A spider won’t really know whether it is called from a database or a static file. If you disallow acces to the directory where your blog is, the spiders will not index it.

    A spider responds to the file (robots.txt) inside of your root directory ( generally the good ones will, some will ignore this ). Also, the spider isn’t going to know if it’s retrieved from a database, just think of it as a user that stores the content they see.

    Thread Starter jdcfsu

    (@jdcfsu)

    Ok, but then what should I disallow in the Robots.txt file? Surly I don’t want the wp-admin and parts of the wp-content indexed for search engines. So what should be allowed and what should not?

    jdcfsu, spidering is done via links, therefore content inside a directory is only as accessable as the links that you provide. Meaning that just because a directory exists, doesnt mean it will spidered. Spiders are also subject to the same rules a user would be, ie, an area requiring authentication (wp-admin, for instance), is not any more accessable to a spider.

    Restricting spiders is very simple though; this is the structure you will need to follow:

    User-agent: Googlebot <– this can also be * to cover all spiders
    Allow: /
    Disallow: /some-dir/
    Disallow: /some-other-dir/

    A good tutorial on using robots.txt is here:

    https://www.searchengineworld.com/robots/robots_tutorial.htm

    Thread Starter jdcfsu

    (@jdcfsu)

    Right, I know how to disallow directories though I am unclear on what I should disallow. My site directory is laid out as follows:
    /index.php
    /wordpress/

    My question is if I disallow the wordpress directory will the spider still see the content pages because they are generated via php upon visiting the main page?

    Short answer to your question: No.

    The spider doesn’t give a rat’s dump on how the pages are generated — it just follows the links. If you exclude the directory, then links won’t be followed.

    Thread Starter jdcfsu

    (@jdcfsu)

    Ok, then that brings me to part two of my question: what should be allowed so that the spiders can see it? Is it the cache folder inside the wp-content directory? I’m trying to figure out where this stuff is kept.

    well, you don’t care where it is “kept”. What you care about is where the urls point. The spider is essentially like any user in front of a web browser — it goes where the links point.

    If you want to allow your blog’s content to be spidered, you can just ignore it in the robots.txt file and most likely, it’ll get crawled.

    Your admin stuff won’t get crawled unless the spider magically has your username and password.

    Ok, say your blog and wordpress installation are in the same directory: /blog

    So, the robot needs to access: /blog but it does not need to access the wp-admin area: /blog/wp-admin/ and so on and so forth.

    The only thing that matters in the actual URI’s to access it. So just disallow access to whatever specific directories you wish to exclude.

    As was previously stated, it can try and index it all, but unless it knows your username and password.

Viewing 9 replies - 1 through 9 (of 9 total)
  • The topic ‘Search Spiders Robots.txt’ is closed to new replies.