How to properly loop through these external URLs to get them into the sitemap
-
I have filtered Urls that I would like to add to push into a sitemap. I am using one of the sitemap plugins, which has hooks to further modify it.
my code so far
// add to theme's functions.php add_filter('bwp_gxs_external_pages', 'bwp_gxs_external_pages'); function bwp_gxs_external_pages($pages) { return array( array('location' => home_url('www.example.com/used-cars/location/new-york/model/bmw'), 'lastmod' => '27/03/2017', 'frequency' => 'auto', 'priority' => '1.0'), array('location' => home_url('www.example.com/used-cars/location/los-angeles/model/aston-martin'), 'lastmod' => '27/03/2017', 'frequency' => 'auto', 'priority' => '0.8') array('location' => home_url('www.example.com/used-cars/model/mercedes-benz'), 'lastmod' => '27/03/2017', 'frequency' => 'auto', 'priority' => '0.8') ); }
So as you can see in my code that I have these kind of URLs
www.example.com/used-cars/location/new-york/model/bmw
&www.example.com/used-cars/model/mercedes-benz
So my issue is that, there are thousands of these URLs and I to push them all into this sitemap.
So my question is that, isn’t there a way to perhaps loop them over? than to insert them into the code one by one like so
array('location' => home_url('www.example.com/used-cars/model/aston-martin'), 'lastmod' => '27/03/2017', 'frequency' => 'auto', 'priority' => '0.8')
-
This topic was modified 8 years ago by
Nkululeko.
-
This topic was modified 8 years ago by
-
Alright not a problem, take as much time as you need. On my side, I’ll keep checking the themes code to see if I can spot anything, I can also copy & paste the themes code from any section that you think the issue might be, to see if we could find anything
“Uh, Houston, we have a problem”
Using the ‘wpseo_sitemap_index’ the way we are does not comply with the referenced sitemap schema. The content on the page this filter is for is supposed to link to other *-sitemap.xml files. This is an improper location to provide URLs to site content. I can still provide something to get terms if you want, but I don’t see the point until we can figure out how to generate URLs that actually fit the schema.
I don’t think there is any point in presenting an improper sitemap structure, it will not validate and I’m quite sure Google Search Console would reject it if you were to submit there.
I did see a filter in the Yoast code to append data much like we are doing, but to a non-index sitemap.xml file that really is intended to have site URLs, but I don’t see where a lot of content that’s added will be broken into multiple pages. The added content is merely appended to the first page of that sitemap type. However, a single sitemap file can contain up to 50k URLs as long as the filesize is under 10MB uncompressed. If you stick with the Yoast default of 1000 entries, you should be OK as long as all permutations of your two vehicle taxonomies do not exceed 49k. If there are 220 terms in each, that would give us 48.4k permutations. Are we OK given these constraints?
Are any of the vehicle location terms valid as a single term URL? Like example.com/used-cars/location/new-york/ to get all listed used cars in NYC? If it is, our current code does not generate such links. Intentional or oversight?
In any case, Yoast sitemap can be configured to generate single taxonomy links. I don’t know if it’ll pickup the used-car component, but it’s possible to filter the links generated, so that can be fixed.
What I’m thinking is hooking into one of these taxonomy sitemaps that Yoast generates. It doesn’t really matter which, or even if it’s a taxonomy sitemap. From that hook add all the permutations of URLs, as long as they are less than 49k. Then the index links are already taken care of.
Another other thing I noticed is it’s kind of a pain to check the generated sitemaps for changes in coding because they are all cached. Do you know how to flush the cache so the recent changes show up? All I could figure out is to disable sitemaps, save, re-enable, save, the reload the sitemap page. That gets old fast!
Will Yoast generate either model or location taxonomy when configured to do so? It doesn’t matter yet if the links are correct, as long as the terms are showing up in the links. The reason I ask is Yoast is using get_terms() to populate these links. I’m curious if get_terms() works in this context or if it too fails.
Ahhhh! I get it. Thats what I was asking about on this previous question ,about the function being used to add a sitemap.xml rather than URLs . but then again I thought URLs were going to show up in the root(sitemapindex) anyway and then as the next step, I was going to ask you how we could push the URLs into a -sitemap.xml folder.
Yeah you are right, google will reject them, and the actual reason for all this, is to get these URLs indexed on search engines.
hmm! So Yoast doesn’t break the content into multiple sitemap files. That’s going to be problematic because we have about 82.9k permutations if I’m not mistaken, with
vehicle_model
having 648 terms andvehicle_location
having 128 terms. this is a feature that the BWP-GS plugin got really well.The vehicle location single term URL was left out intentionally, but now when I think about it, it would actually make sense to have it in there, so yeah lets include it in the function.
Yoast does generally generate single taxonomy links, but the new york location link would look like this example.com/location/new-york/. tell me, do you think google would see these links(example.com/location/new-york & example.com/used-cars/location/new-york) as duplicate content and then penalize my site? because these two links provide the same content in the site.
Thats another question I was going to ask at a later stage, I don’t think Yoast has the option or a function to flush the cache, this is another feature that BWP-GS got right.
Geesh! In your opinion, do you think Yoast can handle 82.9k links? given your experience, what would you do to solve this issue? I’m all out of ideas. It doesn’t matter what plugin we use at this stage, Whats important is… can it get the job done.
I’m sorry I didn’t grasp the significance of outputting site links into the sitemap schema until recently. I was focused on the getting terms not working more than the details of where the data was going.
BWP is clearly superior for generating complex sitemaps, but with Yoast it’s easier to grasp its inner workings because it’s relatively simplistic. Both its strength and weakness at the same time. If you are going to use Yoast for SEO anyway, I’d probably stick with it because I hate having extra plugins that essentially do the same thing. If only sitemap functionality is what you are after, not Yoast’s other features, I would not use Yoast. I also hate plugins that do far more than what I need. It doesn’t sound like either is causing the problem of getting terms.
Are you able to get Yoast to output either of the two taxonomy’s terms as single term links? Because it also gets terms by using get_terms(). Why would it work there and not in ours? (Rhetorical question, I know you don’t know why, I’m just curious if it does work)
With Google at least, it’s not a problem having multiple links leading to the same content IF the canonical link in the head section is the same for all cases. If the canonical differs or does not exist, multiple links are certainly a problem that would diminish page rank.
If you have around 90k links, that is actually only two sitemap pages, but more than can be added through Yoast’s page 1 only action hook. Truth be told, it’s not that big a deal to generate such pages complete, and add their URLs to the Yoast index page. The trick is intercepting such individual requests so we can generate the output instead of Yoast. There may be a hook we can use. If all else fails, add a separate rewrite rule in .htaccess to go to our own script page instead of Yoast’s. Our names cannot get caught by Yoast’s rewrite rule, which shouldn’t be too difficult. Yoast matches *-sitemap###.xml, our form need only be *sitemap-### .xml to be separated.
Of course, none of that matters if you go with BWP. But either way, we still need to be able to get terms somehow. I guess I’ll go back to figuring out how to get the data out of the DB with SQL. What’s done with the data is secondary.
Thats exactly what I was trying to avoid in the first place, having extra plugins that do the same thing. I already use Yoast for the sites SEO. Lets stick with Yoast then.
So its possible to split the links into different sitemap pages, without the feature being offered in the plugin? Alright, Lets give it a shot and see if it pans out.
Ah! of course, the get terms, our main problem. Please let me know if you’ve managed with the SQL solution. Can’t we also try it with that
alt_get_terms
solution you provided earlier? could that work this time around?Sure we can split the links into pages if it’s our own code doing the splitting ?? It’s just more work for me. I don’t mind doing so on my own time, but it does mean you’ll need to wait even longer than you have already for a complete solution.
The current alt_get_terms() doesn’t seem to be a viable solution for unknown reasons. My plan is to rewrite it to get the terms directly from the DB by using SQL. Thus your callback function remains the same for now, but the alt_get_terms() declaration is completely redone.
I’ve nearly put together the proper SQL query, but ran into a potential snag with discarding unused terms (for the ‘hide_empty’ argument). Because hierarchical terms can themselves be empty but have children who are not, this verification has to be done in PHP. I see no way to do so in SQL. This means I need to confirm that a couple other term related functions still work properly.
As a test, please temporarily add the following code to your add_sitemap_custom_items() function declaration, immediately below these lines:
// Loop through the search terms $pages = '';
and above the foreach loop:
$top_level_term_id = null ; //<<==== SET ID FOR null HERE && $child_term_id = null ; //<<========= HERE $children = get_term_children( $top_level_term_id, 'vehicle_model'); $child = get_term( $child_term_id, 'vehicle_model'); $count = is_array( $children ) ? count( $children ): 0 ; $pages .= $count ? "get_term_children() for ID $top_level_term_id returned $count child[ren]": 'get_term_children() FAILED to return any children'; $pages .= "\n"; $pages .= is_a( $child, 'WP_Term') ? "get_term() for ID $child_term_id returned $child->name": 'get_term() FAILED to return a term'; $pages .= "\n";
Please determine the term_ids for two ‘vehicle_model’ taxonomy terms and use the IDs to replace each
null
in the first two lines. The first term_id would be for a top level term that has no ancestors, the second for any child term. The two do not need to be related and the child can be any number of levels down. Please take care to ensure that each ID is indeed a proper ID for the ‘vehicle_model’ taxonomy. Supplying an improper ID will yield inaccurate results.This will add a couple lines to the sitemap index list, between the Yoast links and your added links. This code will always add something, so if they are not appearing at all, you are looking at cached data. The only way I’ve found to flush the buffer is to turn off Yoast sitemaps (then save) and clear the browser cache. Then reactivate the sitemap support and fresh data will be generated.
The output will be completely invalid XML, but it’ll tell me what I need to know about these functions. Ideally the output should be something like this:
get_term_children() for ID 111 returned 4 child[ren]
get_term() for ID 222 returned Audi A6OTOH, if you get something like the following, I’ve some more hard thinking to do:
get_term_children() FAILED to return any children
get_term() FAILED to return a termOnce we are finally able to get terms, I still need to code the actual sitemap file generation. I am assuming you would configure Yoast to output the single term links, so my code would only be generating the double term links.
Alright, so the test code did return some URLs like you said it should, however the URLs it returned are those with posts assigned to them, meaning it only returned URLs when
hide_empty
was set to true right.Then when the
hide empty
was set to false, we get almost the same problem as BWP-GS, but this time around it states the error. this is what I got…( ! ) Fatal error: Maximum execution time of 30 seconds exceeded in C:\wamp\www\autocity\wp-includes\class-wp-hook.php on line 284 Call Stack # Time Memory Function Location 1 0.0040 367680 {main}( ) ..\index.php:0 2 0.0060 371304 require( 'C:\wamp\www\autocity\wp-blog-header.php' ) ..\index.php:17 3 5.5573 65620184 wp( ) ..\wp-blog-header.php:16 4 5.5573 65620296 WP->main( ) ..\functions.php:955 5 5.5710 65717544 WP->query_posts( ) ..\class-wp.php:735 6 5.5711 65717632 WP_Query->query( ) ..\class-wp.php:617 7 5.5711 65718496 WP_Query->get_posts( ) ..\class-wp-query.php:3238 8 5.5722 65726496 do_action_ref_array( ) ..\class-wp-query.php:1681 9 5.5722 65726640 WP_Hook->do_action( ) ..\plugin.php:515 10 5.5722 65726720 WP_Hook->apply_filters( ) ..\class-wp-hook.php:323 11 5.5722 65727496 call_user_func_array ( ) ..\class-wp-hook.php:298 12 5.5722 65727680 WPSEO_Sitemaps->redirect( ) ..\class-wp-hook.php:0 13 5.5723 65727728 WPSEO_Sitemaps->get_sitemap_from_cache( ) ..\class-sitemaps.php:206 14 5.5809 65783496 WPSEO_Sitemaps->refresh_sitemap_cache( ) ..\class-sitemaps.php:248 15 5.5809 65783528 WPSEO_Sitemaps->build_sitemap( ) ..\class-sitemaps.php:270 16 5.5809 65783528 WPSEO_Sitemaps->build_root_map( ) ..\class-sitemaps.php:292 17 6.6518 66008248 WPSEO_Sitemaps_Renderer->get_index( ) ..\class-sitemaps.php:345 18 6.6529 66009616 apply_filters( ) ..\class-sitemaps-renderer.php:66 19 6.6529 66009880 WP_Hook->apply_filters( ) ..\plugin.php:203 20 6.6530 66010744 call_user_func_array ( ) ..\class-wp-hook.php:298 21 6.6530 66010928 add_sitemap_custom_items( ) ..\class-wp-hook.php:0 22 30.0374 78464384 home_url( ) ..\functions.php:190 23 30.0374 78464448 get_home_url( ) ..\link-template.php:2969 24 30.0374 78464496 get_option( ) ..\link-template.php:2995 25 30.0376 78464768 apply_filters( ) ..\option.php:141 26 30.0377 78465152 WP_Hook->apply_filters( ) ..\plugin.php:203
( ! ) Fatal error: Maximum execution time of 30 seconds exceeded in C:\wamp\www\autocity\wp-includes\wp-db.php on line 668 Call Stack # Time Memory Function Location 1 0.0040 367680 {main}( ) ..\index.php:0 2 0.0060 371304 require( 'C:\wamp\www\autocity\wp-blog-header.php' ) ..\index.php:17 3 5.5573 65620184 wp( ) ..\wp-blog-header.php:16 4 5.5573 65620296 WP->main( ) ..\functions.php:955 5 5.5710 65717544 WP->query_posts( ) ..\class-wp.php:735 6 5.5711 65717632 WP_Query->query( ) ..\class-wp.php:617 7 5.5711 65718496 WP_Query->get_posts( ) ..\class-wp-query.php:3238 8 5.5722 65726496 do_action_ref_array( ) ..\class-wp-query.php:1681 9 5.5722 65726640 WP_Hook->do_action( ) ..\plugin.php:515 10 5.5722 65726720 WP_Hook->apply_filters( ) ..\class-wp-hook.php:323 11 5.5722 65727496 call_user_func_array ( ) ..\class-wp-hook.php:298 12 5.5722 65727680 WPSEO_Sitemaps->redirect( ) ..\class-wp-hook.php:0 13 5.5723 65727728 WPSEO_Sitemaps->get_sitemap_from_cache( ) ..\class-sitemaps.php:206 14 5.5809 65783496 WPSEO_Sitemaps->refresh_sitemap_cache( ) ..\class-sitemaps.php:248 15 5.5809 65783528 WPSEO_Sitemaps->build_sitemap( ) ..\class-sitemaps.php:270 16 5.5809 65783528 WPSEO_Sitemaps->build_root_map( ) ..\class-sitemaps.php:292 17 6.6518 66008248 WPSEO_Sitemaps_Renderer->get_index( ) ..\class-sitemaps.php:345 18 6.6529 66009616 apply_filters( ) ..\class-sitemaps-renderer.php:66 19 6.6529 66009880 WP_Hook->apply_filters( ) ..\plugin.php:203 20 6.6530 66010744 call_user_func_array ( ) ..\class-wp-hook.php:298 21 6.6530 66010928 add_sitemap_custom_items( ) ..\class-wp-hook.php:0 22 30.0374 78464384 home_url( ) ..\functions.php:190 23 30.0374 78464448 get_home_url( ) ..\link-template.php:2969 24 30.0374 78464496 get_option( ) ..\link-template.php:2995 25 30.0376 78464768 apply_filters( ) ..\option.php:141 26 30.0377 78465152 WP_Hook->apply_filters( ) ..\plugin.php:203 27 30.0409 78461448 wpdb->__destruct( ) ..\wp-db.php:0
I’m not sure if its possible to configure yoast to generate single term links? I could be wrong, but it seems like Yoasts entire sitemap feature generates only pages, posts and URLs assigned to posts for taxonomies, and also… I would’ve liked to have that /used-cars/ also added in the single term URLs. Yoast can only generate these URLs https://www.example.com/model/audi, these are viewed via taxonomy.php, and actually not through the custom post type, which is really not ideal.
Wow, I hadn’t expected that! So you are saying the alt_get_terms() function works properly when hide_empty is set to true? That’s good news in a way, an improvement over get_terms() which didn’t work either way. Since hide_empty => true is our eventual goal, we ought to be OK for a while. Am I remembering correctly? When a lot more posts are added there will eventually be a timeout problem again, but that’s a different issue than not getting anything either way. We should be OK for a good long while.
If I’m understanding correctly that alt_get_terms() will meet our needs, we can now move on to generating our own sitemaps on request using the current Yoast hook only to generate index links to our own sitemap files. It also means I don’t need an answer to my latest test code since we do not need to resort to SQL.
If you can confirm I’m understanding all this correctly, I’ll start figuring out the next step. I’d also like to know roughly the maximum number of terms in both the location and model taxonomies that you would expect there to be.
Yeah, you got it! the alt_get_terms functions works. But Hold up, are you saying the timeout problem can be solved if we generate/split links into their own multiple sitemap files? Cause remember our ultimate goal is to get links with hide_empty => false
Max number of terms are as follows:
648 terms currently max 700 terms for vehicle model
128 terms currently max 500 terms for vehicle locationYour next to last post sneaked in when I was composing my last reply (“I’m not sure if its possible…”) I’ll address that first. Yes, Yoast will do single term links, but maybe not in the way you are thinking. Single term would be like your example /models/audi/ — a single term in the vehicle_model taxonomy.
If you don’t want /models/audi/ but /used-cars/models/audi/ is acceptable, I’m pretty sure Yoast output can be filtered so that extra term will appear in sitemap links. You could then also get /used-cars/location/chicago/ instead of just /location/chicago/.
Splitting the double term links into decent sized sitemap pages is kind of a pain. Having to factor in single term links would just complicate things further. I’d prefer to avoid this complication.
OK then, about your more recent post. Yes, I got the hide_empty bit reversed in my mind. Maybe it was a Freudian slip because that is what I wanted it to be. Honestly, I don’t think showing empty terms is a good idea. If links without assigned posts are in your sitemap, there will be a lot of links for search bots to follow that all essentially have the same content: “Nothing found” (or similar). At least with Google bots, when it finds many links with the same content, your site’s page ranks will be significantly diminished. The one workaround to avoid this is to have pages with the same content have the same canonical link in a meta tag. Then the different links aren’t used as a demotion.
In your case though, each of these links with no related posts will have a different canonical link, yet the content is the same in each case. You do not want this to occur! I urge you to hide empty terms. If you hide empty terms, we’ll be OK for a while with the time out problem. Eventually, as more cars are added, the problem will arise again. Or if you insist on not hiding empty terms, the problem needs to be addressed now.
The best approach is to optimize the DB so these huge queries are able to run faster. I’m not sure all of what’s involved, but when done properly, it can help quite a bit. It may also be necessary to improve the hardware resources. It will take some analysis to determine where the bottleneck is. What ever that is, it likely needs to be upgraded.
We can probably make the error go away by increasing the allowed execution time. This is not much help because I don’t think search bots are going to wait a long time for results. Even if it’s not an error on your server, Google bot will likely record a crawl error anyway.
We could try optimizing the code, but we’d have to determine where the bottleneck is again. One thing we can do is cache the term arrays since they will need to be referred to a number of times. This will help with subsequent sitemap requests, but the initial request that populates the buffer still needs to happen and it will definitely take a while.
Thanks for the term data, that’s helpful. I’m going back to my corner now to see if I can figure out how to generate sitemap files on the fly. I’ve a pretty good idea, just not sure yet if it works.
Alright, I may have found a solution. I managed to change those URLs( /model/audi/ ) into these /used-cars/model/audi for the entire site, so right now thats how they generate into the sitemap. So we can exclude single term links. that’s sorted.
The reason why I actually needed the the hide_empty to be set on false is mainly because I wanted to be certain that it’s possible to get all 90k links into a sitemap. Because I eventually wanted to export about 30k car posts from the previous site, into the the one that we currently working on. I can’t tell if these 30k car posts are assigned to all the terms, but I’m sure it will get to that point in the near future. The google bot issue makes perfect sense, I didn’t think of that. Somebody did actually mention canonicals, I just didn’t take it seriously, Cause I couldn’t fully understand what he meant. However we will revisit this issue, I need more clarity on it, For now lets focus on the get_terms() issue, it’s giving me alot of sleepless nights.
Right! I haven’t done any optimization of the DB ever for this site. I’ve installed and removed alot of plugins. not to mention post revisions. They could’ve caused alot of bloat. I’ll do a 360 optimization of the DB and see if that changes anything. I was planning on a VPS hosting plan for this site, could that help?
Yeah You’re right. Increasing the execution time won’t help when the links get crawled.
Hmm.. How do we cache the term arrays? I think we should also try that. Lets rather fail knowing we tried every possibility.
Alright, Lets try everything we can think of. take your time.
You have sleepless nights but tell me to take my time! You are too kind ??
All those dual term links will certainly not fit on a single sitemap, but that’s not what we are doing. We’ll take as many pages as necessary. There’s tons of room on the index page for additional pages. It will not be a problem getting all links to fit. The only possible glitch is me ensuring there is a clean break between pages, no missing or redundant links. I’ll get it right eventually.
That does bring up a good point about caching though. A huge array of term data could take up a lot of memory! I was planning on using WP’s built in data cache. Since it’s easy to turn hide_empty on and off, we can test caching with all terms, but generate sitemaps with used terms. If memory is an issue, I could use transient data as storage. It takes up DB space, but retrieving it will still be faster than querying for terms that meet certain criteria.
Thinking about this more, memory ought to be OK, I forget how little space text takes up and how much memory is in servers these days. I doubt 1200 terms will crack 1 MB. Peanuts!
VPS ought to be a good improvement over typical shared hosting. Even though physically it’s still shared resources, I believe hosts do not load up their VPS servers nearly as much as they do with shared. Where does this site reside now? Shared hosting? Or localhost (your PC)? We may need to up the execution time in order to test caching. If it works on either of these, it’ll certainly work on VPS. Once that’s confirmed we can go back to hide_empty and 30s execution.
Lately I’ve been sort of idling to some extent, waiting for a project to start. Well, it just started. This means I’ll have even less time to work on your sitemaps, but I will try to optimize that time so you will have something in a reasonable amount of time. You’ve been very patient and easy going with this, thank you for that.
Hahaha.. Well, good things require time!
Of course. I’d like to see how you split the links with code. I never thought that’s possible.
Alright! Thats great, so we good with memory.
I’ll go for VPS then. Currently the site is on localhost, I usually work on sites while on localhost, and then only upload them when everything is good. I’ve had bad experiences editing live sites.
Alright, not a problem! Oh thanks to you. I would have never been able to figure this out on my own. If you don’t mind me asking, what kind of projects do you usually work on?
Yes, localhost development is definitely the way to go. Way better even than hosted test/staging sites. Working on live sites is the absolute worst. I think everyone has had bad experiences doing that.
What I usually do these days is typically minor tweaks to sites I’ve built in the past. There’s occasionally a new site to build. But what I like doing is solving unusual coding puzzles. Your issue piqued my interest, I see it as a puzzle challenge.
As you probably know, I also volunteer my time here in the WP forums to help support the WP community. It’s a good place to find coding puzzles and help people out at the same time. I don’t normally do much coding for anyone here beyond writing up little short examples, but when I smell a good puzzle I’m willing to go beyond my usual limits. I typically just tell people how to do something. I’ve solved their puzzle, coding it is merely documentation. Sometimes it’s easier to code it than explain it though.
What I’m willing to do for people depends a lot on each individual. If they are really trying to handle it themselves, I don’t mind helping out. Some people don’t even try. Not being a coder is no excuse to not even try AFAIC. I’m not inclined to help people much if they just want it done for them without at least making an effort for themselves. They can go hire someone else to help them in that case.
None of this is related to my current project. I have a home rental where the tenants are changing over and there’s quite a bit of maintenance to do when no one’s in the way. I don’t make enough from this rental for me to hire it all out. It mostly has to be sweat equity by me.
Enough about me, I have the site mapping figured out! (I hope) It works on my site anyway. We’ll have to see how it does on yours. It ended up being in the form of a custom plugin. I’ll explain why in a while. Thus you shouldn’t need any of the code you currently have that’s directly related to this. To avoid conflicts, you should disable it by commenting out any add_action() and add_filter() lines. Don’t delete it just yet.
It’s a one file plugin. You can get the contents from pastebin.com. Create a new, empty file named double-term.php in either the plugins folder or a new folder in plugins named double-term. Paste the contents from pastebin into this file and save it. Go to the plugins admin screen and activate it. It’s called “double-term”, natch.
The rest of this is going to be long. (Like it hasn’t already been long ?? ) Best to not start in on this unless you have the time. It’s a general overview of how everything works, so you will want to go through it, just maybe not right this minute.
BTW, the simplest way I’ve found to flush the Yoast sitemap cache is to make a minor change to the max. entries per sitemap number, then save. You also need to reload the sitemaps in your browser or clear its cache to see new results. My plugin does not use the Yoast number for entries per sitemap, the number it uses is hardcoded into the file as a constant definition (line 28). It’s currently 1000, you can change it to whatever seems reasonable up to around 48k. This defined number is not the absolute maximum, more of a general target. The plugin will continue adding links over that number until a full set of locations are output. This makes it easier to cleanly break between pages. So anticipate that nearly the total count of all locations could possibly be added over and above what number you choose to use.
Since you’re interested, the page breaks are figured starting at line 67. The amount of models to loop through is calculated based on the max. items number, and considering how many locations there are. This determines the offset into the models array to start for the current page and how many models to loop through. It could be only one if there are many locations and a low max. item count.
You can see in the subsequent loops that the last modified date for all URLs will be 3.5 days prior to the current date. We talked initially that the update frequency we would use is weekly. If that were truly the case, on average the modification dates work out to about 3.5 days ago. You can alter this if you like at line 82. You could put in 0.01 or similar to be very recent, or 30 for a month ago. Yoast does not output frequency, so I changed my output to match theirs.
I found a way to register our sitemap type with Yoast, which means Yoast manages the redirects and caching for us. I didn’t have to fuss with either of those! What basically happens when one of our sitemaps is requested is Yoast sees the request and hijacks it from WP so Yoast can handle the request itself. Yoast first looks for a cached version, stored as transients (more about these later) in the DB. If that doesn’t work, Yoast fires a particular action hook that our sitemap function has been added to, causing our function to run.
Once our function has generated the requested sitemap, it is set as the sitemap for Yoast to send as a response. Yoast handles the headers and ?xml tag that specifies version, encoding, stylesheets, etc., then our sitemap follows. The sitemap filenames for these double term links follows the same structure as the other Yoast sitemaps, using “dblterm” as the Yoast map type. These sitemap files will look like this: dblterm-sitemap#.xml where # is an integer page number. The count can go up to any allowable integer, like 88888 if necessary.
The plugin code has not been stress tested. I’m not sure what will happen if you set the max. items number to something silly, like 2, causing the need for thousands of pages. It ought to be OK, but it has not been tested. I suggest something reasonable, anything from 2000 to 48000.
Starting at line 95 is a callback function for the ‘wpseo_sitemap_index’ filter that you initially discovered for index links. This function also gets the terms in order to get a total count of all terms. Much like how the page breaks were figured, the number of sitemap pages needed to handle all the links is determined. Then those page’s filenames are generated and output to the sitemap index page. Like with the URLs, the last modified date is assumed to be 3.5 days ago. This can be adjusted on line 115.
Starting at line 123 is my version of get_terms(). You can see the initial use of WP_Term_Query is still there, but commented out. Despite that working, I went with a direct SQL query. I figured it ought to be a little faster. I also reduced what it returned since we only need the slugs. The hide_empty argument requires more than just slugs, but that’s still less than the usual default full term objects normally returned. I don’t know if it’ll be any faster, but it’ll certainly be easier on memory.
By default unused terms are hidden (removed actually). The script that does this can be commented out if you should want all terms to be returned. There is no formal argument, you need to directly alter the code to take it out of play. It’s pretty simple to do so. The terminal */ for comments is already in place, you merely need to add an asterisk between the existing // for the comment at line 144, which is telling you this same thing. Once you’ve checked that all terms can be handled, just remove that asterisk to restore normal hide_empty behavior that is required for search bots.
The last few lines of this function reassigns the array so all indices (indicies?, indexes? IDK) are consecutive again after removing unused terms. This is necessary for my for(){} loop to work correctly. Other extraneous data is also removed so that only term slugs remain.
Remember that I said earlier in this “dissertation” that Yoast caches sitemap data as transients in the DB? Transients are data stored in the DB short term. An expiration time is set, after which the data is no longer available and can be reallocated for different use. If required, it can also be reallocated prior to expiration, so you wouldn’t want to store anything important as a transient. However, this is ideal for caching. If the data is there, great, if not we can always regenerate it.
Relatively speaking, this is long term caching. It may be available all day long. Compare with short term caching that is only valid for the current request. In fact, for most installations where DB space is not in short supply, the data remains in place long after the scripts are no longer allowed to access it. WP only flushes out transients during updates. When certain apps make extensive use of transients, this stuff accumulates at a significant rate even though it is completely useless to scripts.
Since Yoast makes use of transients frequently, I’ve devised a little script (i.e. copied it from WP core) to flush out expired transients daily in order to keep the amount of stale data down. This flushing (or garbage collecting as it’s sometimes called) routine utilizes the WP scheduled event feature to automatically do this daily. In order to properly initiate this scheduled event, I’ve tied it into the activation of this plugin.
Remember I promised earlier to explain why I created a plugin for all of this? This is why. When this plugin activates, the daily schedule is started. If you should deactivate it, this daily schedule will be removed.
Well, that pretty much covers everything. Give this thing a try! I fully expect something to not be right. That’s to be expected with new code that’s only been tested on one installation, by one person – the developer. Just let me know what issues crop up and I’ll try to figure out a solution.
Here is what I would suggest to start testing. As I said, disable any previous code you have related to this. This plugin, along with Yoast, should handle all sitemap related functionality. Set the max. items per page in this plugin at line 28. Pick a number that will work out to roughly a couple dozen or so sitemaps when hide_empty is true. It should be at least the total number of terms used in one taxonomy, whichever one is greater. IIRC, 2-3k might be good. Go to Yoast’s sitemap admin page and change the count there slightly (or more or less, it’s not used by the double term script), then save. This will flush the sitemap cache. Follow the link above this labeled XML Sitemap (You actually will end up at sitemap_index.xml) Once the index loads, reload it because you probably had the browser’s cached version. Sometimes reloading isn’t enough. For good measure explicitly flush your browser’s cache.
Follow the links to a few of the dblterm-sitemap#.xml. As these are new “files”, you probably don’t need to reload, but do so anyway for good measure (unless you explicitly flushed your browser caches.)
If you encounter time out errors, open your wp-config.php file and add a line like this:
set_time_limit( 60 ); //default 30
The number is the time to allow in seconds for script completion. Even if you have this problem on localhost, on a VPS webserver it likely will not be a problem. Besides setting this in wp-config.php, you can also set this in php.ini (using different syntax:
max_execution_time = 60
). If you do use php.ini, you need to restart the server for the change to take effect. Don’t be tempted to use a big number to cover any possibility. If anything happens like an infinite loop, you’ll have to wait that long before you can use your computer again. 60 seconds is not too bad, 600 seconds would drive you insane!In the sitemap, the first location slug should always be the same on each page. The same goes for the final location slug, but it’ll be a different term. The plugin will want to output all the location terms at least once per page. Depending on the max items count, it may go through the complete list a number of times. There should be one model term for every complete list of locations. If the max. items is relatively small, there may be only one model per page. (and a LOT of pages!)
So far so good? Anticipate a good max. items count if unused terms are not hidden, or leave it as is to get many more pages. Edit the plugin file to place an asterisk at line 144 between the // so that unused terms are returned as well. Do some more testing. If that doesn’t break anything, remove the asterisk to restore the prior conditions.
You can fiddle with the max. items count to try to find a happy medium between items on a page and the number of sitemaps listed on the index page. That should do it for now if everything checks out. If you end up having speed issues even on VPS, I can tell you right now there’s little that can be done to speed up the get terms routine. Any speed improvements will have to come from optimizing the DB somehow. DB queries will not occur too often because Yoast will mainly get its sitemap data from cache. Only when it needs to refresh the cache after it becomes stale will it need to re-query the DB for all terms.
Let me know how this all works out. Enjoy!
- The topic ‘How to properly loop through these external URLs to get them into the sitemap’ is closed to new replies.