Site scraper
-
Hi,
I made a topic here https://www.ads-software.com/support/topic/how-to-block-site-scraper/
I am still having issue with site scrapper. I don’t know how to find the ip address of the scrapper.This is the website footballhighlightsvideo[dot]com which is craping my content.
I have installed the cidram but that website still post the article I post.
thanks
-
Hi @suprim,
I’ve released a new version just today, which means new signatures, new things blocked, etc. That possibly may help, but of course, nothing can be guaranteed without us being able to identify the specific scrapers in the first place.
It’s possible you may be able to identify the specific scrapers responsible for scraping your pages by setting up some traps for them, but this would involve a reasonable amount of work outside of CIDRAM itself, would involve some active monitoring and so on.
Could I ask, are they copying the page exactly, word for word, or are they copying just sections of the page, mixing it all up and so on?
If they’re copying it exactly, word for word, then setting up an intentionally unprotected page somewhere with an arbitrary or faked article and sneaking in a sentence or two somewhere within the article, worded slightly differently for different request origins (i.e., permutations based on request origin) is one possible way of figuring out who’s doing the scraping; when they eventually find and scrape this arbitrary or faked article, their copy can be checked, the specific permutation of the sentences snuck in there earlier checked, and then reversed back to a specific address based on however the permutations are generated in the first place.
I’ve done things like that before, and it’s not too difficult, but it’s definitely a bit more work than simply setting up a plugin and letting it do its thing, so not generally the kind of thing that everyone wants to be doing.
Let me know if you wanted or needed any help in taking this further, and we could explore some different ideas to try to figure out what would work best for your situation. In short though.. It sounds like some work will be needed in order to properly identify the scrapers.
- This reply was modified 5 years ago by Maikuolan.
Hi The new update seems to fix the issue and the website are no longer scrapping the content. But I am having another issue with cpu usage. Does CIDRAM affect on cpu?
Thank you
I’m glad to hear that problem has been fixed. ??
And, apologies again for the delayed reply; I never received any notification about your reply from WordPress, and just saw it by chance tonight when checking the support page.
> But I am having another issue with cpu usage. Does CIDRAM affect on cpu?
I’ve never experienced adverse CPU effects due to CIDRAM before, and I’ve tried to keep its memory requirements, footprint and so on to a minimum throughout its development, so generally, it shouldn’t, but of course, YMMV.
Best bet would be to just go through all your plugins (including CIDRAM), themes, customisations and so on, manually disable each one, one at a time, test your website out directly to confirm whether or not it plays nicely with your CPU while said plugin/theme/etc is disabled, continuing onward until the problem no longer persists, at which point, you’ll have a stronger idea about which plugins/themes/etc are likely culprits for causing any possible CPU usage spikes.
If you manage to get through the whole lot, disabling everything, and yet the CPU usage spikes persist, it could be a sign of something else, maybe something wrong with the server, something else on the account, something compromised, etc (at which point, it would be wise to take the problem up with your hosting provider, so that they can investigate it more deeply).
If, after going through everything, CIDRAM ends up on the list of likely culprits, let me know and I’ll take a closer look at it from my end. ??
Hi Maikuolan,
Yeah the cpu usage was not from the CIDRAM, but there is another issue. I noticed a sudden traffic drop on my website after enabling CIDRAM. At first I thought it was normal but then I received an email from user saying he is blocked from site then I realised the traffic drop was because my viewers are getting blocked from site.
Kind Regards,
I’m glad to hear that CIDRAM wasn’t the culprit for the CPU usage. ??
> but then I received an email from user saying he is blocked from site then I realised the traffic drop was because my viewers are getting blocked from site.
In these cases, if it’s possible at all to ask the user what they see as the “why reason” for when they’re blocked, we can use that information to instruct CIDRAM to not block requests from their particular IP or network, e.g., by using the ignore.dat file, by writing bypasses, by auxiliary rules, etc (whichever means is preferable and most convenient), or, if it’s something that shouldn’t be blocked for anyone’s CIDRAM installations anywhere, I can delist their IP or network from the CIDRAM signature files, or do whatever else might be best as a means to unblock them from CIDRAM generally.
Not super easy to resolve the problem without knowing that information though, unfortunately.
If contacting them to ask isn’t possible, or isn’t convenient, it might be possible to determine the information we need by analysing CIDRAM’s logs (sometimes certain patterns emerge that can help us to get that information indirectly).
Of course, the biggest issue with modifying ignore.dat, writing custom bypasses, or performing any other customisations for that matter at the moment is that WordPress will reinstall the entire installation when updating from the WordPress plugins dashboard or from WordPress itself. I’m hoping to be able to somehow resolve that reasonable soon, but until then, updating CIDRAM from CIDRAM’s own front-end isn’t subject to that particular problem, thus providing a way to update while retaining all customisations.
Anyway, for the moment, and until I release the next version, that’s likely the best way to resolve the problem for that particular user, I think.
Kind regards,
Caleb M / Maikuolan.- This reply was modified 4 years, 11 months ago by Maikuolan. Reason: Fixed typo
Hi Caleb,
This is the email I got.
I'm in Hong Kong and I can't access the site... Due to this error... Your access to this page was denied because your IP address belongs to a network considered high-risk for spam. ID: 1575380420-440889-4598152661 Script Version: CIDRAM v2.2.1 Date/Time: Tue, 03 Dec 2019 13:40:20 +0000 IP Address: 58.152.135.x Signatures Count: 1 Signatures Reference: 58.152.0.0/15 Why Blocked: Spam risk ("PCCW Global", L1698:F3, [HK])! User Agent: Mozilla/5.0 (Linux; Android 9; HMA-L29) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36 Reconstructed URI: https://highlightsfootball.net/ Generated by CIDRAM v2.2.1
I hope this helps.
Kind Regards,
- This reply was modified 4 years, 11 months ago by suprim.
Cheers. ??
I’ve revisited those particular signatures just now and I’m in the process of preparing an update. I’ll continue to list a small handful of PCCW IPs in the next version release, but the particular IP address in question won’t be blocked anymore (along with a majority of that network’s addresses).
Kind regards,
Hi Maikuolan,
Thank you.
The website seem to start scrapping again must be doing from new ip.
- The topic ‘Site scraper’ is closed to new replies.