Forum Replies Created

Viewing 1 replies (of 1 total)
  • I’m also seeing 403 error codes on “good” links and feel that Link Checker should have some automation to detect known problem conditions. From what I’ve seen, sites using Cloudflare are returning a 403 error with a web page that includes an embedded captcha that auto-redirects when a human browser is used to access the site but fails with error when automation attempts to go to the site. A prime of example of this is pixabay.com (I’ve seen it on multiple other sites as well, so it’s by no means unique to an individual site). When I go to the site with my browser, it works just fine. When I poll the site with automation, I get:

    
    quark ~ $ curl -v https://pixabay.com/
    *   Trying 104.18.21.183:443...
    * Connected to pixabay.com (104.18.21.183) port 443 (#0)
    * ALPN, offering h2
    * ALPN, offering http/1.1
    * successfully set certificate verify locations:
    *   CAfile: /etc/ssl/certs/ca-certificates.crt
      CApath: none
    * TLSv1.3 (OUT), TLS handshake, Client hello (1):
    * TLSv1.3 (IN), TLS handshake, Server hello (2):
    * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
    * TLSv1.3 (IN), TLS handshake, Certificate (11):
    * TLSv1.3 (IN), TLS handshake, CERT verify (15):
    * TLSv1.3 (IN), TLS handshake, Finished (20):
    * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
    * TLSv1.3 (OUT), TLS handshake, Finished (20):
    * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
    * ALPN, server accepted to use h2
    * Server certificate:
    *  subject: C=US; ST=CA; L=San Francisco; O=Cloudflare, Inc.; CN=pixabay.com
    *  start date: Jun 12 00:00:00 2020 GMT
    *  expire date: Jun 12 12:00:00 2021 GMT
    *  subjectAltName: host "pixabay.com" matched cert's "pixabay.com"
    *  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
    *  SSL certificate verify ok.
    * Using HTTP2, server supports multi-use
    * Connection state changed (HTTP/2 confirmed)
    * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
    * Using Stream ID: 1 (easy handle 0x55fdba09b8b0)
    > GET / HTTP/2
    > Host: pixabay.com
    > user-agent: curl/7.70.0
    > accept: */*
    > 
    * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
    * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
    * old SSL session ID is stale, removing
    * Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
    < HTTP/2 403 
    < date: Tue, 23 Jun 2020 16:01:22 GMT
    < content-type: text/html; charset=UTF-8
    < cf-chl-bypass: 1
    < set-cookie: __cfduid=de3e3ee21c59642403dc7bbc79f8356291592928082; expires=Thu, 23-Jul-20 16:01:22 GMT; path=/; domain=.pixabay.com; HttpOnly; SameSite=Lax; Secure
    < cache-control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
    < expires: Thu, 01 Jan 1970 00:00:01 GMT
    < x-frame-options: SAMEORIGIN
    < cf-request-id: 038382b12a00000d12f03b7200000001
    < expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    < server: cloudflare
    < cf-ray: 5a7f6d61ddb10d12-ATL
    < alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400
    < 
    * Connection #0 to host pixabay.com left intact
    <!DOCTYPE html>
    <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
    <!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
    <!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
    <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
    <head>
    <title>Attention Required! | Cloudflare</title>
    <meta name="captcha-bypass" id="captcha-bypass" />
    <meta charset="UTF-8" />
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
    <meta name="robots" content="noindex, nofollow" />
    <meta name="viewport" content="width=device-width,initial-scale=1" />
    <link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
    <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
    
    [snip]
                
                <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
              </div>
    
              <div class="cf-column">
                <h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>
                
    [snip]
    
    </body>
    </html>
    

    Notably, it’s a captcha to detect humans. I think going to all the 3rd party sites who use Cloudflare and asking for specific whitelisting of blog sites linking to their webservers is unrealistic. Clearly, the site is alive and well, which is what the plugin is intended to check. The fact that the plugin uses HTTP error code to base the decision on doesn’t mean that the plugin is behaving correctly and giving the right answer in this case. It’s admittedly behaving as designed, but the design needs to evolve to actually solve the business case the plugin attempts to solve. I need a tool to check links to broken sites, not a tool to check HTTP error codes. That’s a mechanism that has served the plugin well up to now, but is increasingly longer sufficient. I’d love to have the plugin extended to also check the body text on error code pages so that it doesn’t (necessarily) flag on sites that include text like “captcha-bypass”. This should be pretty easy. Ideally, a more comprehensive solution using something like an automated Chrome driver (or perhaps Selenium crawl) could perhaps be used. But as it is, the plugin is reporting a huge number of false positives, which are no less false positives just because the plugin is seeing HTTP 403 replies.

    • This reply was modified 4 years, 8 months ago by wamcvey.
    • This reply was modified 4 years, 8 months ago by wamcvey.
Viewing 1 replies (of 1 total)