• Resolved miketemby

    (@miketemby)


    Hello,

    I have recently setup the Crawler on one of my sites to improve cache freshness but I have three issues.

    1:
    I have setup the crawler to run using a cpanel cron job and turned off WP cron. I have the cpanel cron job hitting wp-cron.php at 00:10:00 and 00:40:00 so it crawls twice 30 minutes apart between 12am and 1am.
    I have another cpanel cron task setup that runs a custom script that just fires \LiteSpeed\Purge::purge_all();. this happens at 00:01:00.
    The purpose of this is to purge the whole cache, then crawl the site to re-create it.

    The problem I am having is that the first crawler pass after a Purge All always just crawls one page and stops with end_reset response. This happens if I run it manually or via cron job. The second crawl works fine and crawls all 14 pages.
    Why does a purge all prevent the crawler from crawling ALL pages?

    2:
    I have tried using the built in Scheduled Purge but at first it appeared to not work. i.e. the cache was never purged. However while looking through the crawler logs this morning, I noted the following amongst the log Cache_control TTL is limited to 2567 due to scheduled purge rule…. it has now occured to me that the Scheduled Purge actually doest purge the cache, but simply sets TTL so that next time it is crawled after that time the cache is expired so it is refreshed for that page.. is this correct? If so the documentation should really be updated to reflect this as it is not in anyway clear.

    3:
    Scheduled Purge appears to use un-localised server time (UTC), rather than the local timezone. This causes problems when combining it with cPanel cron tasks which do use local time…. It is not reasonable to expect users to update the Scheduled Purge time twice per year when Daylight Savings begins and ends. It should use local time.

    • This topic was modified 3 years ago by miketemby.
Viewing 15 replies - 1 through 15 (of 16 total)
  • Thread Starter miketemby

    (@miketemby)

    ..above should say stopped_reset not end_reset

    Plugin Support qtwrk

    (@qtwrk)

    Hi,

    it is designed that when purge all happens , it will stop the crawler , because the crawled page is not longer cached and needs to re-cache from start

    when you set schedule purge , you need to purge existing cache and then next cache generation will be marked to expire at “scheduled” time.

    in your case, I would suggest to set cache TTL to 23.5 hours instead of schedule purge it , should be easier

    ideally , crawler starts at 00:10 , assuming it can finish all pages before 00:40 , then the page will be cached for 23 hours onward

    and second day , before 00:10 , it has expired already and crawler is really to re-crawl them

    the timestamp date( 'Y-m-d H:i:s', time() + LITESPEED_TIME_OFFSET ), and define( 'LITESPEED_TIME_OFFSET', get_option( 'gmt_offset' ) * 60 * 60 ) ; it is retrived from wordpress timezone option

    best regards,

    Thread Starter miketemby

    (@miketemby)

    the timestamp date( ‘Y-m-d H:i:s’, time() + LITESPEED_TIME_OFFSET ), and define( ‘LITESPEED_TIME_OFFSET’, get_option( ‘gmt_offset’ ) * 60 * 60 ) ; it is retrived from wordpress timezone option

    Clearly this is not the case. I have Melbourne time zone set in WP.
    Local time currently is 8:56am. Scheduled Purge shows the current “server time” as 9:56pm i.e. 11 hours different which is equal to my timezone….
    For me to set the scheduled purge to occur at 12:01am, I have to put 1:01pm in the field…
    Last night, the Scheduled Purge field contained the value 2:23pm and this is what the Crawler Log shows:
    03/25/22 00:40:39.009 [103.42.111.114:33856 1 zLD] [Ctrl] X Cache_control TTL is limited to 2541 due to scheduled purge rule
    So at 12:40am it wanted to put a 42 minute TTL on the cached page so it expired at 1:23AM… not 2:23pm which is what it should do if it applied the logic you mentioned above.

    In fact, now that i’m looking at it, it seems like it’s adding my timezone offset (+11), to the time that I put in the field, because 2:23pm AEDT = 3:23am UTC but it has set TTL to expire are 1:23am.. so it’s wrong two ways…

    it is designed that when purge all happens , it will stop the crawler , because the crawled page is not longer cached and needs to re-cache from start

    ..wait…what? So the Crawler.. that is there for the sole purpose of Crawling the site so it IS cached, decides that if there is not cahce, it wont cache, because theres no cache….??? Can you please explain this in more detail? Why would you not want the crawler to crawl and cache pages that ARE NOT cached? What am i missing here.. that’s its entire point is it not?

    Plugin Support qtwrk

    (@qtwrk)

    if you create a php page , with code

    <?php
    require( './wp-load.php' );
    echo get_option( 'gmt_offset' );
    echo '<br>';
    echo date( 'Y-m-d H:i:s');
    echo '<br>';
    echo date( 'Y-m-d H:i:s', time() + get_option( 'gmt_offset' ) * 60 * 60 );
    
    

    access it by browser, what does it show ?

    for crawler with purge , let’s say you have 100 pages , crawler starts from page 1 , then page 2 , page 3, page 4 ….etc , let’s imagine at page 50 , you or something triggered a purge all

    before purge , page 1 – 50 are cached already , then at page 50 , purge all happened , then page 1 – 50 are not cached anymore , so crawler stops , reset and waiting to start from page 1 again as page 1 – 50 are no longer cached.

    Thread Starter miketemby

    (@miketemby)

    adding to the above, following last nights crawler run at 12:40am during which pages were given a TTL of 42 minutes, I now find that obviously my pages are not cached because they expired 7 hours ago but now when I manually run the crawler – again it just hits one page and returns the same message as if the Purge All had been run.

    Last interval: 44s ago
    
    Ended reason: stopped_reset
    
    Last crawled: 1 item(s)

    I need to then run it again for it to crawl the 14 mapped pages.

    This makes no sense – please explain the logic behind this. It seems quite simple to me, a crawler should find pages not cached and crawl them to rebuild the cache. But LS Cache seems to do the opposite. It finds uncached pages and decides it should stop because they are not cached…

    Thread Starter miketemby

    (@miketemby)

    for crawler with purge , let’s say you have 100 pages , crawler starts from page 1 , then page 2 , page 3, page 4 ….etc , let’s imagine at page 50 , you or something triggered a purge all

    before purge , page 1 – 50 are cached already , then at page 50 , purge all happened , then page 1 – 50 are not cached anymore , so crawler stops , reset and waiting to start from page 1 again as page 1 – 50 are no longer cached.

    NO, that’s not whats happening. Purge All is run before the crawler starts. When crawler starts AFTER that, it doesnt work… it just crawls 1 page and stops and outputs

    Last interval: 44s ago
    
    Ended reason: stopped_reset
    
    Last crawled: 1 item(s)

    As above, the same thing happens if TTL has expired.

    • This reply was modified 3 years ago by miketemby.
    Thread Starter miketemby

    (@miketemby)

    The PHP file outputs the following:

    11
    2022-03-24 22:33:34
    2022-03-25 09:33:34

    See it for yourself here:
    https://gippsafe.com.au/time.php

    Plugin Support qtwrk

    (@qtwrk)

    okay , just found the bug , the schedule time display didn’t add the offset

    will ask our dev to add it up

    meanwhile , please provide the report number , you can get it in toolbox -> report -> click “send to LiteSpeed”

    Thread Starter miketemby

    (@miketemby)

    please provide the report number

    DHYKFNUM

    Can you also please respond to comments about the crawler not running after purge or TTL expiry.

    Plugin Support qtwrk

    (@qtwrk)

    I need the report to check crawler setting before respond to that

    try set crawler interval to 61 , then try run crawler after purge or ttl expration , you may need to click “manual run” twice.

    Thread Starter miketemby

    (@miketemby)

    you may need to click “manual run” twice.

    Why twice?
    If it is expired – it should crawl them – that’s the point that I am trying to get you to explain. It doesn’t make sense that it stops the first time and then requires a second run to work.

        crawler = true
        crawler-usleep = 500
        crawler-run_duration = 400
        crawler-run_interval = 600
        crawler-crawl_interval = 1200
        crawler-threads = 3
        crawler-timeout = 30
        crawler-load_limit = 1
        crawler-sitemap = https://gippsafe.com.au/sitemap_index.xml
        crawler-drop_domain = true
        crawler-map_timeout = 120
        crawler-roles = array (
    )
        crawler-cookies = array (
    )
    Plugin Support qtwrk

    (@qtwrk)

    sometimes the first attempt won’t work due to residual data from last time , then click it again should trigger it

    the crawler itself does NOT care the cache status , it only cares one thing during the crawler , the purge call

    the crawler will loop though the URL list regardless the cache is expired or still existing , if expired ones , it crawls through and marks them as blue (cache was miss but now cached) , or on existing caches , it crawls through and marks them as green (hit cache) , or if page returns x-litespeed-cache-control: no-cache header then marks it as blacklist and bypass it for next time.

    Thread Starter miketemby

    (@miketemby)

    sometimes the first attempt won’t work due to residual data from last time , then click it again should trigger it

    This is a pretty vague response. What residual data specifically? Are you saying there is a bug that causes this that needs to be addressed?

    • If the cache is purged (before it runs), it should crawl it.
    • If the cache is expired, it should crawl it.

    The fact that it doesn’t consistently do that suggests there is a bug. Again, the sole purpose of a crawler is to crawl and therefore cache pages.

    or if page returns x-litespeed-cache-control: no-cache header then marks it as blacklist and bypass it for next time.

    This is highly problematic. I’ll give you a specific example: Divi theme dynamically sets a no-cache header if the Divi css cache has been cleaned out, because the first page load is used to calculate critical css, which is then loaded on the second page load. So Divi sets no-cache header on page load one but does not do so on subsequent page loads.
    The behaviour of the crawler, blocklisting pages as soon as it hits a page with no-cache header, and not crawling them again means it will never cache the pages.
    This function at a bare minimum, should be configurable. I do not want it to blocklist pages when it hits a no-cache header.

    Plugin Support qtwrk

    (@qtwrk)

    Hi,

    yes , crawler has some minor bug that needs to be fixed , but currently we don’t have enough manpower to cover it as the main function works , so it’s low priority task.

    If the cache is purged (before it runs), it should crawl it.
    If the cache is expired, it should crawl it.

    crawler does this , as long as there is no purge call happened in the middle of crawling , it will continue to loop through all the pages in sitemap , regardless if it’s cached or expired/purged before crawler starts.

    for the Divi example , it is a good point , I will forward to our devs as feature suggest

    Best regards,

    Thread Starter miketemby

    (@miketemby)

    for the Divi example , it is a good point , I will forward to our devs as feature suggest

    That’s great – thank you.

    crawler does this , as long as there is no purge call happened in the middle of crawling , it will continue to loop through all the pages in sitemap , regardless if it’s cached or expired/purged before crawler starts.

    As i’ve described above a couple of times – this is not my experience. Each time I purge the cache (well before running the crawler), and then run the crawler it hits one page and stops with the error mentioned. In no scenario have I run the purge during the crawl. I am describing the behaviour when purge runs before crawl.

    
    Last interval: 44s ago
    
    Ended reason: stopped_reset
    
    Last crawled: 1 item(s)
    
    • This reply was modified 3 years ago by miketemby.
    • This reply was modified 3 years ago by miketemby.
Viewing 15 replies - 1 through 15 (of 16 total)
  • The topic ‘Crawler and Scheduled Purge’ is closed to new replies.