Hairy Automattic crawler
-
Now, what’s up with this sucker? Suddenly there is a bot coming from WordPress of all places, scanning this and that…
myhost.tld 72.233.119.245 72.233.119.245 – – [08/Jan/2012:16:27:40 +0100] “GET /wp-admin/js/post.js HTTP/1.0” 401 1809 b 2299 mics “-” “Automattic Analytics Crawler/0.1; https://wordpress.com/crawler/” “US” “United States” “Plano”
Does it look for robots.txt, let alone cares what it says? No.
Don’t. Do. That.
-
You might want to bring this up on the WordPress.COM forums instead.
Same thing about 15 minutes ago, generating 404s all over the place, blocked with mod_security ??
Note the bad spelling, but the originating IP address: 72.233.119.245 resolves to a server provided by Softlayer, to whom wordpress.com also resolves, so who knows?
First, I apologize for the bad behavior.
I worked on this crawler at Automattic. Yesterday we paused it and made an update to obey robots.txt blocks. We should have done that before it started, but I forgot.
The change is now made and the crawler will be better-behaved in the future. Thanks for reminding us. Let me know if you have any questions (I’m subscribed to this thread by email now). If you want to write anything privately, the contact form at https://wordpress.com/crawler goes directly to me.
Really interesting Evan, Chief Dog Walker ?? Thanks very much for your direct reply! Apart from “The Automattic Analytics Crawler is a utility we use to discover how people use WordPress.” May I ask, what data do you collect and how do you use it, and is the data made available to anyone outside of wordpress? Sounds funny but, did you fix the 404 errors on your ‘crawling’?
Right now we’re recording whether or not a site uses WordPress and which version it’s using if we can figure it out. There are lots of companies that gather and publish statistics about WordPress usage, but there’s not much clarity into how they come up with their numbers. The 404’s aren’t really a bug per se. One of the ways we can check for version numbers is by comparing static files from WordPress core to files on a given site — but if they don’t exist, they generate 404’s.
The goal is to better understand the market of WordPress usage so we can make the best products possible for them. Some that we’re already making are VaultPress, VideoPress, and Akismet. This is just one step to help us learn as much as possible.
By the way, I noticed you commented on a “misspelling” earlier. I assume you mean the second T in Automattic. That’s actually not a misspelling, it’s the name of the company that runs WordPress.com and some other WordPress-related products.
Need a dog walker? ??
Sorry to bump this thread, but I’m now seeing attempted logins for “username” several times per day from this IP.
Is this consistent with expected crawler behavior? If so, what is the reason? We use fail2ban and if the crawler keeps this up, it’ll get blocked outright.
philhagen, can you post whatever data you have about the requests you’re seeing? It’s probably something else, for two reasons. First, our crawler doesn’t know anything about logging in or even how to submit a form. Second, our crawler hasn’t run in several months.
I’m happy to help debug if you can post some more info.
Thanks for the quick reply, Evan.
I am using custom mods to the “Simple Login Log” plugin[1] to push log messages via syslog for aggregation and active defense (fail2ban among others). I’ve received 47 login attempts to one particular WP instance since Sept 10, 2012, at a rate of around 4-5 per day on most days.
The login attempt is to the “username” user, and attempts have come from the following IPs:
– 72.233.119.245 (22x)
– 74.200.247.240 (25x)There is no “username” user on the site, so the logins have obviously failed.
The most syslog entries for these attempts are below (all times EDT).
Oct 7 14:26:47 serverhostname httpd.itk: WordPress login: [email protected] () from 74.200.247.240 -> Failed
Oct 7 14:40:47 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> Failed
Oct 7 17:44:29 serverhostname httpd.itk: WordPress login: [email protected] () from 74.200.247.240 -> Failed
Oct 7 17:57:06 serverhostname httpd.itk: WordPress login: [email protected] () from 74.200.247.240 -> Failed
Oct 7 18:23:00 serverhostname httpd.itk: WordPress login: [email protected] () from 74.200.247.240 -> Failed
Oct 8 08:01:00 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> Failed
Oct 8 08:48:03 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> Failed
Oct 8 09:02:19 serverhostname httpd.itk: WordPress login: [email protected] () from 74.200.247.240 -> Failed
Oct 8 09:43:37 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> Failed
Oct 8 15:51:16 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> Failed
Oct 9 08:38:18 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> Failed
Oct 9 08:54:55 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> Failed
Oct 9 09:07:59 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> Failed
Oct 9 11:52:30 serverhostname httpd.itk: WordPress login: [email protected] () from 72.233.119.245 -> FailedI don’t log the actual password value attempted, for security purposes.
Upon further review, these entries appear to be coincident with POSTs those IPs make to /xmlrpc.php[2][3]
This customer does have Jetpack installed, and my cursory review did not find any problems, though it reflects that it’s “Connected to WordPress.com” but still asking to “Link accounts with WordPress.com”. I have a feeling this means the user has not fully configured Jetpack, but nothing jumped out at me in that regard…
Let me know what other information may help nail this down – I really appreciate your assistance.
[1] https://www.ads-software.com/extend/plugins/simple-login-log/
[2] Oct 9 11:52:30 serverhostname httpd: sitename.com 72.233.119.245 – – [09/Oct/2012:11:52:29 -0400] “POST /xmlrpc.php?for=jetpack&token=<redacted>×tamp=1349797949&nonce=<redacted>&body-hash=<redacted>&signature=<redacted> HTTP/1.0” 200 421 “-” “WordPress/3.5-alpha-21535; https://sitename.com”
[3] Oct 9 11:52:30 serverhostname httpd: sitename.com 72.233.119.245 – – [09/Oct/2012:11:52:30 -0400] “POST /xmlrpc.php HTTP/1.0” 200 422 “-” “The Incutio XML-RPC PHP Library”Those are almost certainly our Jetpack Servers communicating with your site.
We don’t have access to (and do not want access to) your account’s password, so Jetpack uses it’s own OAuth-based authentication layer to interact with your site.
When you combine how most XML-RPC endpoints work with how WordPress handles authentication, though, it turns out that, even though Jetpack uses it’s own OAuth-based authentication, we still have to send a username and password with the request. The Jetpack plugin tells WordPress to completely ignore that username and password, but we still have to send it to get things in WordPress working correctly.
So we send a username of “username” and a password of “password”. That’s what your logs are showing.
We could send something else to make it more obvious what’s happening like a username of “jetpack”, for example. These aren’t real log in attempts, though, they just use some of the same WordPress hooks as log in attempts do, which is why the Simple Login Log plugin thinks they are log in attempts.
It’s not clear to me if “Failed” means the OAuth-based authentication Jetpack uses actually failed or if Simple Login Log just doesn’t understand what to make of these requests.
Interesting…
We have a lot of other Jetpack-enabled WP instances on the same server, and only one is throwing up this error. Therefore, I’m quite confident the problem is in the setup somewhere. I will track this down more and if it’s anything other than the client’s human error, I’ll report back here.
Definitely understand (and appreciate!) the use of OAuth versus U/P. And that XMLRPC OAuth attempts are treated the same as U/P login attempts is also good to know – helps to flag potential abuse…
Using “jetpack” instead of “username” would be helpful, FWIW, though not really critical in this case.
Thanks!
- The topic ‘Hairy Automattic crawler’ is closed to new replies.