• Resolved shareasale-wp

    (@shareasale-wp)


    In the function extractAllUrls() it’s running a preg_match_all call that should also exclude parentheses and semicolons, not just hashtags (anchors) and question marks (query strings).

    Line 1225 of UrlRequest.php:

    preg_match_all(
                    '/' . str_replace('/', '\/', $baseUrl) . '[^"\'#\? ]+/i', // find this
                    $this->_response['body'], // in this
                    $matches // save matches into this array
                )

    Otherwise HTML like this will be crawled:

    style="background-image: url(https://www.example.com/wp-content/uploads/2018/08/image.jpg);"

    … and return https://www.example.com/wp-content/uploads/2018/08/image.jpg); including the parentheses and semicolon. This of course causes 404 errors in the static HTML output. Fortunately it’s a simple fix in the regex pattern:

    '/' . str_replace('/', '\/', $baseUrl) . '[^"\'#\?); ]+/i'

    Thanks!

Viewing 1 replies (of 1 total)
Viewing 1 replies (of 1 total)
  • The topic ‘Regex bug in UrlRequest.php’ is closed to new replies.