Major bug with WordPress + New Relic

If you haven’t seen New Relic yet, definitely take a look, it’s awesome.

That being said… if you are running New Relic monitoring on a huge production site, your logs might be spinning out of control without you knowing it, and I’ll explain why.

We use New Relic at eMusic to monitor our WordPress servers and many other application servers – Web Services, Databases etc – and we started noticing this in the logs like whoa: failed to delete New Relic auto-RUM

In New Relic’s FAQ section:

Beginning with release 2.6 of the PHP agent, the automatic Real User Monitoring (auto-RUM) feature is implemented using an output buffer. This has several advantages, but the most important one is that it is now completely accurate for all frameworks, not just Drupal and WordPress. The old mechanism was fraught with problems and highly sensitive to things like extra Drupal modules being installed, or customization of the header format. Using this new scheme, all of the problems go away. However, there is a down-side, but only for specific PHP code. This manifests itself as a PHP notice that PHP failed to delete buffer New Relic auto-RUM in…. If you do not have notices enabled, you may not see this and depending on how your code is written, you may enter an infite loop in your script, which will eventually time out, and simply render either an empty or a partial page.

To understand the reason for this error and how it can create an infinite loop in code that previously appeared to work, it is worth reading the PHP documentation on the ob_start() PHP function. Of special interest is the last optional parameter, which is a boolean value called erase that defaults to true. If you call ob_start() yourself and pass in a value of false for that argument, you will encounter the exact same warning and for the same reason. If that variable is set to false, it means that the buffer, once created, can not be destroyed with functions like ob_end_clean(), ob_get_clean(), ob_end_flush() etc. The reason is that PHP assumes that if a buffer is created with that flag that it modifies the buffer contents in such a way that the buffer cannot be arbitrarily stopped and deleted, and this is indeed the case with the auto-RUM buffer. Essentially, inside the agent code, we start an output buffer with that flag set to false, in order to prevent anyone from deleting that buffer. It should also be noted that New Relic is not the only extension that does this. The standard zlib extension that ships with PHP does the same thing, for the exact same reasons.

We have had several customers that were affected by this, and in all cases it was due to problematic code. Universally, they all had code similar to the following:

while (ob_get_level()) {
  ob_end_flush ();
}

The intent behind this code is to get rid of all output buffers that may exist prior to this code, ostensibly to create a buffer that the code has full control over. The problem with this is that it will create an infinite loop if you use New Relic, the zlib extension, or any buffer created with the erase parameter set to false. The reason is pretty simple. The call to ob_get_level() will eventually reach a point where it encounters a non-erasable buffer. That means the loop will never ever exit, because ob_get_level() will always return a value. To make matters worse, PHP tries to be helpful and spit out a notice informing you it couldn’t close whatever the top-most non-erasable buffer is. Since you are doing this in a loop, that message will be repeated for as long as the loop repeats itself, which could be infinitely.

So basically, you’re cool if you don’t try to flush all of the output buffers in a loop, because you will end up breaking New Relic’s buffer. Problematic as well if you are managing several of your own nested output buffers. But the problem might not be you, the problem is / could be WordPress.

Line 250 of wp-includes/default-filters.php:

add_action( 'shutdown', 'wp_ob_end_flush_all', 1 );

What does that code do?

/**
 * Flush all output buffers for PHP 5.2.
 *
 * Make sure all output buffers are flushed before our singletons our destroyed.
 *
 * @since 2.2.0
 */
function wp_ob_end_flush_all() {
	$levels = ob_get_level();
	for ($i=0; $i

So that’s not good. We found our culprit (if we were having the problem New Relic describes above). How to fix it?

I put this in wp-content/sunrise.php

<?php
remove_action( 'shutdown', 'wp_ob_end_flush_all', 1 );

function flush_no_new_relic() {
	$levels = ob_get_level();
	for ( $i = 0; $i < $levels - 1; $i++ )
		ob_end_flush();
}

add_action( 'shutdown', 'flush_no_new_relic', 1, 0 );

This preserves New Relic’s final output buffer. An esoteric error, but something to be aware of if you are monitoring WordPress with New Relic.