Overhaul of file scanning and detection
-
I just spent a bunch of time sort of “overhauling” wp-media-cleaner’s scanning and detection.
I wrote a plugin recently for indexing PDFs and for the site I wrote it for I only wanted to index PDFs that were deemed as being attached. So I wrote a bunch of detection methods for finding files within content, shortcode, widgets, etc. I found wp-media-cleaner often missed some of this content so I took the time to apply what I had done with my plugin to yours. I also found and fixed a couple other issues along the way.
Besides finding content that it was reported as unused, the other big issue I wanted to address was most of your matches were using partial file names. e.g. If I file was “uploads/path/image.png”, in a few places you were only searching “image.png.” That would obviously match that filename if it were in any path on the server, or worse, even a file linked so an external host. I came up with the following regex for my plugin and implemented that here:
$regex = addcslashes('=[\'"](?:(?:http(?:s)?\\:)?//'.preg_quote($parsedURL['host']).')?(?:'.preg_quote($parsedURL['path']).'/)'.$regex_match_file.'(?:\\?[^\'"]*)*[\'"]', '/');
Using that regex, the code will match all of the following:
href=”https://www.host.com/uploads/path/image.png”
href=”https://www.host.com/uploads/path/image.png?somevar=1″
href=’https://www.host.com/uploads/path/image.png’
href=”//www.host.com/uploads/path/image.png”
href=”/uploads/path/image.png”
src=”/uploads/path/image.png”But it would not match…
href=”https://www.someotherhost.com/uploads/path/image.png”
href=”https://www.host.com/uploads/other_path/path/image.png”The regex is adapted for matching against metal fields which wont be wrapped in quotes:
$regex = addcslashes('(?:(?:(?:http(?:s)?\\:)?//'.preg_quote($parsedURL['host']).')?(?:'.preg_quote($parsedURL['path']).'/)|^)'.$regex_match_file, '/');
Which will match…
https://www.host.com/uploads/path/image.png
https://www.host.com/uploads/path/image.png
/uploads/path/image.png
path/image.pngetc.
Here is a file compare. https://www.diffnow.com/?report=58be2
And here is the file: https://www.dropbox.com/s/gkfdud7jye2zhiz/wp-media-cleaner.php?dl=0
function wpmc_wp_ajax_wpmc_scan_do
102: I can’t stand line like this without braces ?? …but I really the only reason I put them in was so I can add some debugging lines while I worked.function wpmc_get_galleries_images
172: 12 hours seemed rather long for the transientfunction wpmc_wp_ajax_wpmc_scan
183: This is my new transient for the post_content with shortcode extracted193-203: This set of code eliminates some double scanning the process was doing. I didn’t see a reason so scan files that were also going to be scanned from the media side as well. So I first store the IDs and Paths in an array (one of which gets passed back for the json) and the then they are used to eliminated from the $files array media that will be scanned anyway.
205-218: I also felt like there is no need to scan all of the alternate image sizes as long as the source file existed in the database.
function wpmc_delete
318: Just an extra check to make sure we aren’t deleting a local the file the actually has a media db entry. Earlier when I was testing this happened. Could have been caused by something else but I figured worth leaving in.function wpmc_check_is_ignore
440: From my understanding this is the way to properly escape-like in wordpress. Also, you didn’t have deleted = 0 in the query before. So it would ignore a new file that had the same path as an old one that was trashed.function wpmc_check_in_gallery
475-482: added wpmc_clean_uploaded_filename here so we can pass the full path and get better matchingfunction wpmc_check_db_has_featured
487: wpmc_check_db_has_featured — There is no need for wpmc_check_db_has_featured. I realized this after modifying. I lef the code there for your review, but wpmc_check_db_has_meta probably isn’t any less expensive and covers the same thing.function wpmc_check_db_has_meta
503: Along with passing the full path into this function I am regexing the match, making sure it’s not matching against it’s own meta_data by also passing in the attachment_id (when available) and also added meta_value = attachment_id to catch some other forms of using featured images and some other plugins.function wpmc_check_db_has_content
519: So this is where the motherload of changes is… So the first thing I do is query post content via the more stricter regex match. If the file is found then great. Move on. If not, then we pull all posts that contain all of the applicable shortcode, loop through them processing the do_shortcode, and then store the results in a transient. I do the same thing with all active widgets and store it along with the post content. Once the transient is stored, we then loop through it checking for matches with the regex.function wpmc_find_attachment_id_by_file
598: Added function. Name describes it all. Used to get an attachment ID from a file path.function wpmc_check_file
611: This is a fix. You were doing strip slashes for $filepath but the $path was being used in other places. Most importantly storing it in your issues table. This was breaking detection for files with single quotes in their name.638: First we check the ignore table. If it’s there just quit.
640: Get the attachment id (used later) and then determine by the extension and with another bit a regex if this is an alternate image size. If the source file does not have a media id or the source file does not exist then mark it as “no media.” Otherwise I feel there is no reason to bother checking each of these individually. When the source file itself is scanned we will check the alternate image sizes then. NOTE: This is kind of irrelevant now because I coded this before I made the modifications to wpmc_wp_ajax_wpmc_scan where I eliminate the image sizes from the scanning arrays. So in theory, this should never happen anyway now.
655: This is another fix. I added wpmc_check_db_has_meta here and moved wpmc_check_db_has_content (in order of most expensive to least). You had wpmc_check_db_has_meta after this and I couldn’t quite figure out why as by doing so a file would never be marked as “NO_CONTENT” since it wouldn’t get by this and to the insert statement without being marked as “NO_MEDIA”.
666: To check alternate image sizes of files I use the glob function. I think maybe this should may be employed into check_media in addition to or instead of using wpmc_get_image_sizes. I didn’t find a reason to for now as all of my images checked out. But something to consider.
function wpmc_check_media
740 (& 776): Check to see if the file exists to check for orphaned media. If it doesn’t exist I add a new issue “ORPHAN_MEDIA”.747 & 766: Add wpmc_check_db_has_meta to this and also move wpmc_check_db_has_content last for speed.
function echo_issue
821: Added new issuefunction wpmc_screen
813: Another fix: weren’t htmlspecialchar escaping the file itself. Only the path before it. Broke files with single quotes in them.
- The topic ‘Overhaul of file scanning and detection’ is closed to new replies.