spiders crawling multiple calendar URLs
-
Our server team has identified the following issue, and we wondered if you had any additional ideas to help block this kind of crawling. At current the crawl spiders are causing the site to become unresponsive at times, due to the volume of requests:
“The calendar plugin currently in use, All-in-One Event Calendar, has been causing issues due to spiders crawling the site. The plugin exposes valid paths formatted like the following:
/events-2/action~oneday/exact_date~1443502800/cat_ids~123,287,124/tag_ids~286,138,365/request_format~html/
Every link on the calendar represents a permutation on that path, with changes in the categories action, exact_date, cat_ids, tag_ids, and request_format. So, for example, you can switch between the following
action
views: oneday, week, month, agenda, posterboard, & stream. That means that if the spider lands on the above path, it will now also crawl:/events-2/action~week/exact_date~1443502800/cat_ids~123,287,124/tag_ids~286,138,365/request_format~html/
/events-2/action~month/exact_date~1443502800/cat_ids~123,287,124/tag_ids~286,138,365/request_format~html/
/events-2/action~agenda/exact_date~1443502800/cat_ids~123,287,124/tag_ids~286,138,365/request_format~html/
/events-2/action~posterboard/exact_date~1443502800/cat_ids~123,287,124/tag_ids~286,138,365/request_format~html/
/events-2/action~stream/exact_date~1443502800/cat_ids~123,287,124/tag_ids~286,138,365/request_format~html/The action permutation, though, is the least of our concerns. The
exact_date
attribute will have one permutation for every single day. That means if it lands on the month action, it’s automatically going to now have ~30 more links to crawl. You then have to multiply these permutations by all combination of cat_ids and tag_ids.This quickly escalates to spiders having essentially an unlimited number of possible paths to crawl. This breaks caching completely.
All that said, there is one standard way of preventing spiders from crawling these paths. Adding a rule to your robots.txt file can tell spiders to ignore all paths under /events-2/. Unfortunately, bad bots will not respect robots.txt, and will index these paths anyway.
Worse yet, bad bots tend to flood servers with traffic and there’s no way to identify them automatically. The last couple of times this has happened, we were able to identify specific attributes in the requests, and blocked those requests completely. Unfortunately, they only have to alter that particular attribute and we’ll have to re-identify.
We are working to properly block these attacks moving forward. The main hurdle in blocking these requests is that they are all to valid paths. We can’t block them completely or your calendar plugin will simply stop working. Blocking based on number of requests is likewise dangerous because users of the site might legitimately flip through many of the paths to get to the view they want.”
https://www.ads-software.com/plugins/all-in-one-event-calendar/
- The topic ‘spiders crawling multiple calendar URLs’ is closed to new replies.