I propose adding to the HTML_Import class defined in html-importer.php the function
function strip_insignificant_html_whitespace($string) {
$pre_start = "<pre(?:>|\\s[^>]*>)";
$pre_end = "</pre(?:>|\\s[^>]*>)";
$old_parts = preg_split(";($pre_start|$pre_end);i",$string,0,PREG_SPLIT_DELIM_CAPTURE);
$new_parts = array();
$strip = true;
foreach ($old_parts as $part) {
if (preg_match(";$pre_start;i",$part)) {
$tmp = preg_replace(";\s+;"," ",$part);
$new_parts[] = preg_replace("; +>;",">",$tmp);
$strip = false;
continue;
}
if (preg_match(";$pre_end;i",$part)) {
$tmp = preg_replace(";\s+;"," ",$part);
$new_parts[] = preg_replace("; +>;",">",$tmp);
$strip = true;
continue;
}
if ($strip)
$new_parts[] = preg_replace(";\s+;"," ",$part);
else
$new_parts[] = $part;
}
return implode("",$new_parts);
}
In clean_html
replace
$string = str_replace( '\n', ' ', $string );
with
$string = $this->strip_insignificant_html_whitespace($string);
In get_post in the !empty($my_post['post_content']))
replace
$my_post['post_content'] = ereg_replace("[\n\r]", " ", $my_post['post_content']);
with
$my_post['post_content'] = $this->strip_insignificant_html_whitespace($my_post['post_content']);
It would be nice also to strip the contents of cdata blocks and <script>..</script> blocks cleanly. I find examples like
<div id="googleAds">
<!-- b e g i n g o o g l e a d s -->
<script type="text/javascript">
//<![CDATA[
<!--
google_ad_client = "...";
google_ad_slot = "...";
google_ad_width = ...;
google_ad_height = ...;
//-->
//]]>
</script>
<script type="text/javascript" src="/data/../pagead2.googlesyndication.com/pagead/show_ads.js">
</script> <!-- e n d g o o g l e a d s -->
</div>
that are not stripped cleanly by the application of the php strip_tags function in the plugin.