Regex bug in UrlRequest.php
-
In the function extractAllUrls() it’s running a preg_match_all call that should also exclude parentheses and semicolons, not just hashtags (anchors) and question marks (query strings).
Line 1225 of UrlRequest.php:
preg_match_all( '/' . str_replace('/', '\/', $baseUrl) . '[^"\'#\? ]+/i', // find this $this->_response['body'], // in this $matches // save matches into this array )
Otherwise HTML like this will be crawled:
style="background-image: url(https://www.example.com/wp-content/uploads/2018/08/image.jpg);"
… and return
https://www.example.com/wp-content/uploads/2018/08/image.jpg);
including the parentheses and semicolon. This of course causes 404 errors in the static HTML output. Fortunately it’s a simple fix in the regex pattern:'/' . str_replace('/', '\/', $baseUrl) . '[^"\'#\?); ]+/i'
Thanks!
Viewing 1 replies (of 1 total)
Viewing 1 replies (of 1 total)
- The topic ‘Regex bug in UrlRequest.php’ is closed to new replies.