Ep #1: WordPress things you should never do

Here is a somewhat random list of rules containing things I hope you never do in PHP or WordPress.

Do not ignore your error_ and access_ logs

If you don’t have your Apache error_log running 100% of the time you are coding, your site is probably full of bugs. Your error_log should only contain intentional messages you have included to benchmark or log specific activity. If should NEVER contain PHP Errors, Warnings or Notices. Those should be corrected immediately. Once your site rolls to production and you start getting a lot of traffic, monitor your logs there as well. Production will expose every edge case you could possibly dream up.

You should regularly check your Apache access_log as well for requests to your domain that don’t return something less than HTTP error code 400. If any of your files or scripts return 400, 404, or 500 in WordPress, you may get into a loop of infinite recursion that your Web Server will choke on after about 10 times through this circle of hell.

To tail your error_log for PHP-only errors, etc (using MacPorts):

tail -f /opt/local/apache2/logs/error_log | grep "PHP " &

Also make sure your wp-config.php contains the following:

// turn on all errors
error_reporting( -1 );

Don’t try to access Array / Object properties that don’t exist

90% of the errors that inexperienced PHP developers generate are based on doing something like this:

// $data might be empty
$data = give_me_what_I_want();

echo $data[0]['items'][7]->fail();

The second you try to access the 0th index of $data, your code will fire off a PHP Notice. DO NOT IGNORE IT. You have 2 choices with arrays to avoid this:

// check that the whole path is cool

if ( isset( $data[0]['items'][7] ) )
     $data[0]['items'][7]->fail();

// be safe all the way down

if ( !empty( $data ) )
    if ( !empty( $data['items'] ) )
        if ( is_a( $data[0]['items'][7], 'SomeObj' ) )
            $data[0]['items'][7]->fail();

You don’t have to get ridiculous, but you also don’t have to be careless.

isset() won’t throw any log notices when you’re checking for depth in an array that may not have it.

empty() is a great utility. It will return a boolean for 0, empty string, empty array, false, or null. The best way to check a variable that might be an array, but might be nothing. You should always check an array for empty before iterating over it, this will prevent  an error being thrown when you start your foreach loop:

// do this

if ( !empty( $might_be_empty_array ) ) foreach ( $might_be_empty_array as $item ) {
    blame_nacin( $item );
}

// not this

foreach ( $im_empty_and_logging_it as $item ) {
     hope_its_not_empty( $item );
}

Don’t use mysql_real_escape_string

If you think you can just call this function out of nowhere to escape some text to be used in a SQL statement, you server is probably barfing and you don’t know it:

$whoops = mysql_real_escape_string( 'escape me!' );

// your logs are blowing up while server
// tries to connect to MySQL with default connection params
// because you didn't pass in an open connection to MySQL
// as the 2nd parameter

// use me instead

$cool = $wpdb->escape( 'escape me!' );

Most functions that start with mysql_* are the procedural counterpart to the object-oriented functions made available by the MySQL extension to PHP. Most of the procedural functions (similar to Memcache functions in this way) require the connection, or current resource instance, passed in as an argument as well. Even if you did pass in the resource, you should be using the mysqli functions instead (MySQL Improved Interface). WordPress uses mysql_* everywhere, so I guess forget what I just said….

ereg_*

This should be blatantly obvious by now, but if you aren’t a whiz at RegEx and Google that wrong site, you may end up pasting in some deprecated POSIX regular expressions when you should be using PCRE = Perl-compatible Regular Expressions.

POSIX = ereg_* functions
PCRE = preg_* functions

serialize

I am going to dare you to name me a good reason to create serialized arrays in your code. Since you can’t find one, I am going to ask you how great of an idea it is to store serialized arrays in the database. Since you don’t know the answer, I am going to ask you how knowledgable you are about ASCII, UTF-*, and ISO-*. Since you have no clue why I am asking you that, I want you to heed my warning: DO NOT USE SERIALIZED ARRAYS. If you do use serialized arrays, store NUMBERS and CHARs only. DO NOT STORE STRINGS of any length that constitute any amount of whitespace.

Here’s why:

Serialized arrays that contain strings bind those strings to their string length, making them as non-portable as possible. If you import / export data and have ANY weird characters that were pasted in from Word or worse, your strings may become invisibly altered and won’t match your bound string lengths. This is super important because of the way we get our data into a usable format when it is stored this way is through unserialize. unserialize fails easy and often when dealing with weird strings.

In WordPress, maybe_unserialize is a function invoked to do this to strings stored in the wp_*meta tables. maybe_unserialize will fail as well, easily and often, but does so….SILENTLY.

You might ask yourself, when would I ever do this anyways?

Does this look familiar?

$data = array( 'format' => 'long', 'color' => 'red' );
update_post_meta( $post_id, 'stuff', $data );

My example is harmless – but let’s say you are using Post Meta to store some SEO text, you are susceptible to maybe_unserialize failing. maybe_unserialize won’t return mangled text when this happens, it will return quite literally nothing.

Another huge problem is the version of MySQL you might be running. MySQL 5.5 is WAY more forgiving with invisible characters in UTF-8 strings with illegal bytes. If you are running any flavor of MySQL 5.1.* and you import data from a MySQL 5.5 database, you mean get bombarded with foreign characters you didn’t have in your 5.5 environment. If you were storing that data in serialized arrays, the data will cause unserialize to fail.

Be careful using $_SESSION

Using $_SESSIONs with Memcached can be palatable, but for most people that don’t know what they are doing, using PHP sessions can be extremely problematic. In almost every case, PHP will store your session_id( ) for a user as a Cookie, it will then throw no-cache headers with every request. Which sucks.

When no-cache is triggered by the session_start() function, this is what the HTTP headers look like:

Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache

Those headers instruct your browser to re-request the page every time you visit it in your browser. Luckily for us, this little nugget tell sessions to not use a cache limiter. It will allow Batcache to work as well. Without it, Batcache will not work:

To turn off the no-cache headers:

session_cache_limiter( false );
session_start();

“switch_to_blog( )” is an Unfunny Nightmare

It can be extremely challenging to write WordPress code that works across many environments and allows you to use a custom location for WordPress media, WordPress multisite media, and WordPress core itself. I wrote about this extensively here. The code I included in that post works, although a lot of it was excerpted to be shown as a short example here or there. If your site switches from one blog to the next and doesn’t intermingle content – you really don’t have much to worry about after your initial setup. But in almost all cases, if you want to start using switch_to_blog() to co-mingle content from multiple sites inline, get ready to do some debugging!

switch_to_blog() works like so:

// I am on my blog

switch_to_blog( 2 );

// I am on your mom's blog

restore_current_blog();

// I am on my blog

Simple enough. This will switch context all over the place in WP core. wp_posts will become wp_2_posts. get_current_blog_id() will return 2 because the global variable $blog_id will be set to 2, etc.

In my post about environments, I had a lot of filters that I suggested adding in wp-content/sunrise.php that work just great in switching initial context to a specific blog and in setting up overrides for default media upload paths etc. They work fine… unless you ever plan on using switch_to_blog().

Here’s an example:

// Before
add_filter( 'pre_option_home', function ( $str ) use ( $domain ) {
    return 'http://' . $domain;
} );

// After using switch_to_blog and realizing I needed to account
// for any sort of weird context I may find myself in

add_filter( 'pre_option_home', function () {
    global $current_blog;
    $extra = rtrim( $current_blog->path, '/' );
    return 'http://' . MY_ENVIRONMENT_HOST . $extra;
} );

Ok cool, so you pull a path from a global variable and append it if it still has a value after being rtrim‘d? Easy. That would be true if $current_blog and its properties were updated every time switch_to_blog() is called. It is not!

$current_blog is one of the values we set in wp-content/sunrise.php to override the database and set our $current_blog->domain value to our current environment. $current_blog and $current_site exist mainly for use on the initialization of Multisite. Outside of startup, they aren’t really accessed or modified.

Because I want a static way to access dynamic variables, I have added some code to change the context of $current_blog when switch_to_blog() is called in wp-content/sunrise.php:

function emusic_switch_to_blog( $blog_id, $prev_blog_id = 0 ) {
    if ( $blog_id === $prev_blog_id )
        return;

    global $current_blog, $wpdb;
    $current_blog = $wpdb->get_row( "SELECT * FROM {$wpdb->blogs} WHERE blog_id = {$blog_id} LIMIT 1" );
    $current_blog->domain = MY_ENVIRONMENT_HOST;
}

emusic_switch_to_blog( $the_id );

add_action( 'switch_blog', 'emusic_switch_to_blog', 10, 2 );

Now that I have added that action, my filter pre_option_home will work. I use the same method for pre_option_siteurl. What I was previously doing to retrieve the path of the current blog didn’t work:

add_filter( 'pre_option_siteurl', function () {
    $extra = rtrim( get_blog_details( get_current_blog_id() )->path, '/' );
    return 'http://' . EMUSIC_CURRENT_HOST . $extra;
} );

Why didn’t it work? get_blog_details() eventually does $details->siteurl = get_blog_option( $blog_id, 'siteurl' ); giving us a nice and hardy dose of infinite recursion. So to combat it – I implemented the setting of $current_blog on the switch_blog action so its properties are always the current blog’s properties. Boom.

The next 2 annoyances are media upload urls / paths and admin paths. We use custom media locations for the main blog and network blogs:

add_filter( 'pre_option_upload_path', function () {
    $id = get_current_blog_id();
    if ( 1 < $id )
        return $_SERVER['DOCUMENT_ROOT'] . "/blogs/{$id}/files";

     return $_SERVER['DOCUMENT_ROOT'] . '/' . EMUSIC_UPLOADS;
} );

add_filter( 'pre_option_upload_url_path', function () {
    $id = get_current_blog_id();
    if ( 1 < $id )
        return 'http://' . EMUSIC_CURRENT_HOST . "/blogs/{$id}/files";

     return 'http://' . EMUSIC_CURRENT_HOST  . '/' . EMUSIC_UPLOADS;
} );

We switched blog context using switch_to_blog(), which triggers our action, which then sets the global variable $blog_id to our current blog’s id, which can then be retrieved from within functions by get_current_blog_id(). Yeah, that’s a mouthful. We are also using a constant for our host name, so we don’t have to query for it / output-buffer it / or str_replace() it.

The admin is a little bit trickier because we usually don’t call switch_to_blog() in the code, but the admin bar will list your blogs and has URLs to their Dashboards etc. We can filter those as well:

function get_admin_host( $url, $path, $blog_id = '' ) {
    $path = ltrim( $path, '/' );

    if ( empty( $blog_id ) ) {
        $blog_id = get_current_blog_id();
    }

    $blog_path = rtrim( get_blog_details( $blog_id )->path, '/' );
    return sprintf( 'http://%s%s/wp-admin/%s', EMUSIC_CURRENT_HOST, $blog_path, $path );
}

add_filter( 'admin_url', 'get_admin_host', 10, 3 );

function get_network_admin_host( $url, $path, $blog_id = '' ) {
    $path = ltrim( $path, '/' );

    if ( empty( $blog_id ) ) {
        $blog_id = get_current_blog_id();
    }

    $blog_path = rtrim( get_blog_details( $blog_id )->path, '/' );
    return sprintf( 'http://%s%s/wp-admin/network/%s', EMUSIC_CURRENT_HOST, $blog_path, $path );
}

add_filter( 'network_admin_url', 'get_network_admin_host', 10, 3 );

If you want to see this in action, I recently integrated the eMusic editors’ 17 Dots blog into eMusic proper, check it out: 17 Dots. You can see multiple blogs intertwining on the homepage.

Major bug with WordPress + New Relic

If you haven’t seen New Relic yet, definitely take a look, it’s awesome.

That being said… if you are running New Relic monitoring on a huge production site, your logs might be spinning out of control without you knowing it, and I’ll explain why.

We use New Relic at eMusic to monitor our WordPress servers and many other application servers – Web Services, Databases etc – and we started noticing this in the logs like whoa: failed to delete New Relic auto-RUM

In New Relic’s FAQ section:

Beginning with release 2.6 of the PHP agent, the automatic Real User Monitoring (auto-RUM) feature is implemented using an output buffer. This has several advantages, but the most important one is that it is now completely accurate for all frameworks, not just Drupal and WordPress. The old mechanism was fraught with problems and highly sensitive to things like extra Drupal modules being installed, or customization of the header format. Using this new scheme, all of the problems go away. However, there is a down-side, but only for specific PHP code. This manifests itself as a PHP notice that PHP failed to delete buffer New Relic auto-RUM in…. If you do not have notices enabled, you may not see this and depending on how your code is written, you may enter an infite loop in your script, which will eventually time out, and simply render either an empty or a partial page.

To understand the reason for this error and how it can create an infinite loop in code that previously appeared to work, it is worth reading the PHP documentation on the ob_start() PHP function. Of special interest is the last optional parameter, which is a boolean value called erase that defaults to true. If you call ob_start() yourself and pass in a value of false for that argument, you will encounter the exact same warning and for the same reason. If that variable is set to false, it means that the buffer, once created, can not be destroyed with functions like ob_end_clean(), ob_get_clean(), ob_end_flush() etc. The reason is that PHP assumes that if a buffer is created with that flag that it modifies the buffer contents in such a way that the buffer cannot be arbitrarily stopped and deleted, and this is indeed the case with the auto-RUM buffer. Essentially, inside the agent code, we start an output buffer with that flag set to false, in order to prevent anyone from deleting that buffer. It should also be noted that New Relic is not the only extension that does this. The standard zlib extension that ships with PHP does the same thing, for the exact same reasons.

We have had several customers that were affected by this, and in all cases it was due to problematic code. Universally, they all had code similar to the following:

while (ob_get_level()) {
  ob_end_flush ();
}

The intent behind this code is to get rid of all output buffers that may exist prior to this code, ostensibly to create a buffer that the code has full control over. The problem with this is that it will create an infinite loop if you use New Relic, the zlib extension, or any buffer created with the erase parameter set to false. The reason is pretty simple. The call to ob_get_level() will eventually reach a point where it encounters a non-erasable buffer. That means the loop will never ever exit, because ob_get_level() will always return a value. To make matters worse, PHP tries to be helpful and spit out a notice informing you it couldn’t close whatever the top-most non-erasable buffer is. Since you are doing this in a loop, that message will be repeated for as long as the loop repeats itself, which could be infinitely.

So basically, you’re cool if you don’t try to flush all of the output buffers in a loop, because you will end up breaking New Relic’s buffer. Problematic as well if you are managing several of your own nested output buffers. But the problem might not be you, the problem is / could be WordPress.

Line 250 of wp-includes/default-filters.php:

add_action( 'shutdown', 'wp_ob_end_flush_all', 1 );

What does that code do?

/**
 * Flush all output buffers for PHP 5.2.
 *
 * Make sure all output buffers are flushed before our singletons our destroyed.
 *
 * @since 2.2.0
 */
function wp_ob_end_flush_all() {
	$levels = ob_get_level();
	for ($i=0; $i

So that’s not good. We found our culprit (if we were having the problem New Relic describes above). How to fix it?

I put this in wp-content/sunrise.php

<?php
remove_action( 'shutdown', 'wp_ob_end_flush_all', 1 );

function flush_no_new_relic() {
	$levels = ob_get_level();
	for ( $i = 0; $i < $levels - 1; $i++ )
		ob_end_flush();
}

add_action( 'shutdown', 'flush_no_new_relic', 1, 0 );

This preserves New Relic’s final output buffer. An esoteric error, but something to be aware of if you are monitoring WordPress with New Relic.

WordPress + Web Services

Most WordPress setups exist on a hosting service (Shared or VPS) – one server talking to one database, boom, done. As sites grow or big sites start to use WordPress, setups grow to many servers and a handful of databases, a handful of Memcached boxes, and maybe some external Web Services or 3rd-party APIs. If the site is heavy on Editorial content, you may not every interact with external services – but if your dynamic pages pull content from many APIs, internal and external, and you don’t want to write your entire site in JavaScript: it’s time to learn some tricks of the trade for powering and scaling many requests at a time over HTTP.

Web Services

So what’s a “Web Service”? At its simplest, a Web Service is a bunch of related URLs that return data, and the Service is usually RESTful – meaning, “Clients initiate requests to servers; servers process requests and return appropriate responses.”

http://www.woo.com/get/151
http://www.woo.com/hi-five?id=12345&friend_id=54321
http://www.woo.com/hi-five/delete?id=676767

Responses are typically in XML or JSON. It sucks to parse XML, so most APIs either return JSON by default or allow you to specify format=json in the request. Most also go as far as to return JSONP if you so desire, so you can make requests to external servers without ever leaving the confines of JavaScript.

Why does JSON rule so much?

<?php

// this produces an (associative) array
$data = json_decode( $response, true );

Parsing XML into a PHP array can look like this:

<?php
require( 'wordpress/wp-load.php' );

ini_set( 'memory_limit', '2048M' );
ini_set( 'max_execution_time', 6000 );

function walk( $arr, $func ) {
    if ( !is_array( $arr ) ) return $arr;

    while ( list( $key, $value ) = each( $arr ) ) {
        if ( is_array( $value ) ) {
            $arr[$key] = walk( $value, $func );
        } else if ( is_string( $value ) ) {
            $arr[$key] = call_user_func( $func, $value );
        }
    }
    return $arr;
}

$z = new XMLReader;
$z->open( 'blob-of.xml' );

while ( $z->read() && $z->name !== 'someNode' );

while ( $z->name === 'someNode' ) {
    $node = new SimpleXMLElement( $z->readOuterXML() );
    $data = (array) $node;
    $data = walk( $data, 'trim' );

    // nightmare of casting nested SimpleXMLElements into Arrays
    // etc etc

    $z->next( 'someNode' );
}
$z->close();

JSON is just way easier and ready-made for PHP.

Services expose APIs (Application Programming Interfaces) that are used to build requests. You should be familiar with available APIs on the web by now: Facebook Connect, Twitter, Google Maps, etc.

WP_Http

WordPress has an API for dealing with HTTP – procedural and object-oriented. It is also an abstraction layer, meaning that it will pick from a variety of HTTP connection methods based on your current setup. I’m going to spend my time introducing you to cURL, but WordPress doesn’t require that you have the cURL extension installed. The order in which it looks for connection methods is cURL, streams (fopen), and fsockopen.

<?php

$url = 'http://www.google.com/a-url/that/returns/json';
$request = new WP_Http; // or use the $wp_http global
$result = $request->request( $url );
$json = $result['body'];

$data = json_decode( $json, true );

cURL

cURL (or, “Client for URLs”) is a command line tool for getting or sending files using URL syntax. Its most bare-bones use is to return the response of a URL. In Terminal:

curl www.emusic.com

This will return the contents of whatever “www.emusic.com” returns. To just read the HTTP Response Headers:

curl -I www.emusic.com

You can post from the command line as well:

curl -d "foo=bar&pass=12345&id=blah" http://www.woo.com/post-to-me

You can even make an XML-RPC call from the command line like so:

curl -i -H 'Content-Type: text/xml' --data '<?xml version="1.0"?><methodCall><methodName>demo.sayHello</methodName><params></params></methodCall>' 'http://www.woo.com/xmlrpc.php'

cURL Extension

The cURL extension for PHP is at its core support for libcurl. libcurl currently supports the http, https, ftp, gopher, telnet, dict, file, and ldap protocols. libcurl also supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading (this can also be done with PHP’s ftp extension), HTTP form based upload, proxies, cookies, and user+password authentication.

Because the cURL extension has external dependencies, it is good to use a package manager to install it (MacPorts on OS X, yum on Linux – or PECL on either). You can run the following commands using MacPorts to install some PHP 5.4 extensions for HTTP:

sudo port install php54-curl php54-http

cURL Abstraction

We could start from scratch and write our own interface to cURL, or we could find someone who has done all of the dirty work for us. We are still going to write our own methods for talking to a cURL class, but we want to find a source that has already implemented the basics of cURL and cURL Multi.

WordPress has done this with the WP_Http class to an extent (no support or abstraction for cURL Multi). I use a variation on this guy’s code – an Object-Oriented cURL class that supports this form of “multi-threading.” The approach is very clean because it allows you to use the same interface to request one or multiple URLs:

// One URL
$curl = new CURL();
$curl->retry = 2;
$opts = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true  );
$curl->addSession( 'http://yahoo.com/', $opts );
$result = $curl->exec();
$curl->clear();

// 3 URLs
$curl = new CURL();
$opts = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true  );
$curl->addSession( 'http://yahoo.com/', $opts );
$curl->addSession( 'http://google.com/', $opts );
$curl->addSession( 'http://ask.com/', $opts );
$result = $curl->exec();
$curl->clear();

cURL Multi

cURL Multi has a procedural API in the cURL PHP extension, and it’s pretty nasty to deal with. There are standard ways it is implemented, which we took care of by using our OO class abstraction of it. The guts look like this:

<?php
$mh = curl_multi_init();

// Add all sessions to multi handle
foreach ( $this->sessions as $i => $url )
    curl_multi_add_handle( $mh, $this->sessions[$i] );

do
    $mrc = curl_multi_exec( $mh, $active );
while ( $mrc == CURLM_CALL_MULTI_PERFORM );

while ( $active && $mrc == CURLM_OK ) {
    if ( curl_multi_select( $mh ) != -1 ) {
	do
	    $mrc = curl_multi_exec( $mh, $active );
	while ( $mrc == CURLM_CALL_MULTI_PERFORM );
    }
}

It’s just a busy loop that eventually ends up returning an array of responses. You don’t get your response until all of the requests have completed, so you are only as fast as your slowest link.

Batch Requests

Almost every page on eMusic’s website is the result of multiple Web Service requests. Making these requests serially (one at a time) takes way too long, we obviously opt to use cURL Multi. I wrote a wrapper class for our methods simply called API.

Here are some methods it contains:

<?php
class API {
.....
    private function parse_response( $str ) {
	if ( is_string( $str ) && ( '404 Not Found' === $str || 0 === stripos( $str, 'Not Found' ) ) )
	    return $str;

	if ( is_array( $str ) )
	     return $str;

	$str = trim( $str );

	$arr = array();
	/**
         * Do a cheesy RegEx scan to indicate that it looks like JSON
         */
	if ( preg_match( self::JSON_REGEX, $str ) ) {
	    $str = preg_replace( '/t|ss+/', ' ', $str );
	    /**
	     * The Bjork problem...
	     * ö, etc can break JSON deserialization
	     * so we need to clean up the JSON response
	     *
	     */
	    if ( ! seems_utf8( $str ) ) {
		$str = emusic_fix_bad_chars( $str );
	    }

	    $arr = json_decode( $str, true );

	    if ( ! empty( $arr ) )
		return $arr;
	    }

	    /**
	     * Only return documents, not string version of empty array
	     */
	    if ( null === $arr )
		return $str;
	}

Take an array of URLs and make the requests in batch – the result is an array of responses:

public function multi_request( $urls ) {
	$this->curl->sessions = array();

	$opts = array( CURLOPT_ENCODING => 'identity' );

	$this->_add_existing_headers( $opts );

	foreach ( $urls as $u )
		$this->curl->addSession( $u, $opts );

	$result = $this->do_request( true );

	return $result;
}

Pass a URL and array of params and make a GET request:

public function get( $url, $params = '', $cache = true, $ttl = 0 ) {
    if ( ! empty( $params ) )
	$url = self::url( $url, $params );

    self::save_url( $url );

    $cached = null;

    if ( $cache )
	$cached = Cache::get( $this->bucketize( $url ), $url );

    if ( $cached ) {
	return $cached;
    } else {
	$this->curl->sessions = array();

	$opts = array( CURLOPT_TIMEOUT => 8 );

	$this->_add_existing_headers( $opts );

	$this->curl->addSession( $url, $opts );
	$result = $this->do_request();

	if ( $result && ! $this->is_http_error() ) {
	    $result = $this->parse_response( $result );
	    $this->add_to_cache( $url, $result, $ttl );
	    return $result;
	}
    }
}

Pass a URL and array of params and make a POST request:

public function post( $url, $args = '' ) {
	$this->curl->sessions = array();

	$opts = array(
		CURLOPT_POST => 1,
		CURLOPT_SSL_VERIFYHOST	=> 0,
		CURLOPT_SSL_VERIFYPEER	=> 0,
	);

	if ( ! empty( $args ) )
		$opts[CURLOPT_POSTFIELDS] = http_build_query( $args );
	else
		$opts[CURLOPT_POSTFIELDS] = '';

	$this->_add_existing_headers( $opts );

	$this->curl->addSession( $url, $opts );
	$response = $this->do_request();

	if ( empty( $response ) && ! $this->is_http_error() ) {
		return null;
	} else {
		return $this->parse_response( $response );
	}
}

Caching

When dealing with Web Services, you often (almost always) want to cache your responses for an arbitrary or specific amount of time. At eMusic, we have a large number of Web Services that often do very expensive things. The purpose of breaking functionality out into services is being able to scale and optimize each service individually outside of the realm of the core application logic. As long as the APIs remain unchanged, you can mess with the backend of your service and not affect my use of it.

Some of our services include: catalog-service, ratings-service, save-for-later-service, user-service, pricing-service, auth-service and the list goes on and on.

To take a advantage of caching while making batch requests, I wrote another method that does the following:

  1. Iterate through an array of URLs
  2. If it’s in the cache, remove that item from batch list
  3. After all items have been checked against the cache, make a batch request if any URLs are still in the queue
  4. For each request in the batch that is made, md5 the URL and use it as the key and the response as a value to place in the cache.
  5. If all of your URLs are cached, you make zero requests and just return the cached responses in an array

Here’s an example of how the logic works, your implementation will probably differ:

<?php

.....
public function batch( $urls, $ttl = 0, $usecache = true ) {
    $response = array();
    ob_start();

    if ( empty( $ttl ) )
	$ttl = CACHE::API_CACHE_TTL;

    array_map( 'API::save_url', $urls );

    $batch_urls = array();
    if ( is_array( $urls ) ) {
	if ( $usecache ) {
	    foreach ( $urls as $index => $url ) {
		$in = Cache::get( $this->bucketize( $url ), $url );

		if ( $in ) {
		    $response[$index] = $in;
		} else {
		    $batch_urls[$index] = $url;
		}
	    }
	}
    }
    $calls = $this->multi_request( $batch_urls );

    if ( is_array( $calls ) && count( $calls ) > 0 ) {
	$keys = array_keys( $batch_urls );
	$calls = array_combine( $keys, array_values( $calls ) );

	foreach ( $calls as $index => $c ) {
	    if ( $c ) {
		$response[$index] = $this->parse_response( $c );
		$this->add_to_cache( $batch_urls[$index], $response[$index], $ttl );
	    } else {
		$response[$index] = null;
	    }
	}
    } else if ( ! empty( $calls ) && ! empty( $batch_urls ) ) {

	reset( $batch_urls );
	$index = key( $batch_urls );
	$data = $this->parse_response( $calls );
	$this->add_to_cache( $batch_urls[$index], $data, $ttl );

	$response[$index] = $data;
    }
    ob_end_clean();

    return $response;
}
....

I have abstracted the Cache class because I sometimes use my API class without using WordPress. In my eMusic implementation, I use wp_cache_set and wp_cache_get, but this allows me to implement Cache however I want in other environments.

Request Map

Making batch requests is all well and good, but I still have to loop through the responses to do anything with them. If my batch request was for expanded catalog metadata on 10 separate albums, then a loop works just fine. If I am collecting all of the data I need for an entire page by making all of my service calls in batch, I might want to associate a callback with each individual URL. To accomplish this, I wrote a class called Request Map.

Here is a Request Map instance which makes all of the calls we need from Services to populate data on the homepage of eMusic:

<?php
$map = new RequestMap();
$map->add( get_twitter_url( 'eMusic' ), function ( $data ) {
    global $tweets;
    $tweets = $data;
} );
$map->add( get_charts( array( 'perPage' => 10 ) ), function ( $data ) {
    global $chart;

    if ( !empty( $data['albums'] ) ) {
        $chart = $data['albums'];
    }
} );
$map->add( get_trending_artists( array( 'perPage' => 5, 'return' => true ) ), function ( $data ) {
    global $trending;
    if ( !empty( $data['artists'] ) ) {
        $trending = $data['artists'];
    }
} );
$map->send();

RequestMap::add takes 2 arguments: a URL and callback that receives the data as a PHP array. Every time RequestMap::send is called, a batch request is fired for any URLs in the queue. When finished, appropriate data is passed to each callback. RequestMap::flush will reset the queue:

<?php
$map = new RequestMap();
$map->add( 'http://woo.com/1', function ( $data ) {....} );
$map->add( 'http://woo.com/2', function ( $data ) {....} );
$map->add( 'http://woo.com/3', function ( $data ) {....} );
$map->send();
$map->flush();
$map->add( 'http://woo.com/4', function ( $data ) {....} );
$map->add( 'http://woo.com/5', function ( $data ) {....} );
$map->add( 'http://woo.com/6', function ( $data ) {....} );
$map->send();

This is necessary if the second set of requests requires data from the first since PHP does not have true threading. Here is the code for my RequestMap class:

<?php
class RequestMap {
/**
 * @var array
 */
private $requests;
/**
 * @var array
 */
private $responses;
/**
 * @var int
 */
private $ttl;
/**
 * @var boolean
 */
private $useCache = true;
/**
 * @var boolean
 */
protected $error = false;

public function __construct() {
    $this->flush();
}

/**
 * @return boolean
 */
public function is_error() {
    return $this->error;
}

public function flush() {
    $this->requests = array();
}

/**
 * @return int
 */
public function getTtl() {
    if ( empty( $this->ttl ) )
	$this->ttl = CACHE::API_CACHE_TTL;

	return $this->ttl;
}

public function setTtl( $ttl ) {
    $this->ttl = $ttl;
}

/**
 * @return boolean
 */
public function getUseCache() {
    return $this->useCache;
}

public function setUseCache( $usecache ) {
    $this->useCache = $usecache;
}

/**
 *
 * @param string $url
 * @param callable $callback
 * @param array $vars
 */
public function add( $url, $callback, $vars = array() ) {
    $params = new stdClass();
    $params->url = $url;
    $params->callback = $callback;
    $params->params = (array) $vars;
    $this->requests[] = $params;
}

/**
 *
 * @return array
 */
private function getRequestUrls() {
    return array_map( function ( $item ) {
	return $item->url;
    }, $this->requests );
}

/**
 *
 * @param stdClass $item
 * @param array $response
 */
private function exec( $item, $response ) {
    $params = array_merge( array( $response ), $item->params );
    call_user_func_array( $item->callback, $params );
}

public function send() {
    if ( ! empty( $this->requests ) ) {
	$this->responses = API()->batch( $this->getRequestUrls(), $this->getTtl(), $this->useCache );

	if ( is_array( $this->responses ) ) {
	    foreach ( $this->responses as $i => $response ) {
		if ( ! empty( $this->requests[$i] ) ) {
	            $this->exec( $this->requests[$i], $response );
		}
	    }
        }
    }

    $this->flush();
}

}

It’s really not that complicated.

Conclusion

I would advise using methods like I have listed above instead of relying on JavaScript and JSONP to talk to external services. Pages can be served super-fast out of Memcached if set up properly, and your page load times will be faster – perceived or otherwise – with as little initial JavaScript as possible.

Term / Taxonomy is broken in WordPress

I meant to include this in my MySQL post, but making a separate post to address this one particular issue works just fine. In my previous post, I said that the Term tables are a mess. Here is why:

Let’s say I create 2 taxonomies, location and radio_station (FYI: an example of a “taxonomy” is a “tag” or a “category”). Here’s how I register them (simplified) in my theme or plugin:

<?php
register_taxonomy( 'radio_station', 'post', array( 'labels' => array(
    'name'          => 'Radio Stations',
    'singluar_name' => 'Radio Station'
) ) );
register_taxonomy( 'location', 'post', array( 'labels' => array(
    'name'          => 'Locations',
    'singluar_name' => 'Location'
) ) );

Once this is done, you should have extra links in the Posts fly-out:

Click each link (Tags, Radio Stations, Locations) and add “Brooklyn” as a term. This is a completely valid practice. Now change ONE of them to “Brooklyn’s Finest” and go back and check out the other 2. THEY ALL CHANGED! So that sucks right?

Why does this happen? Terms are stored in wp_terms and are constrained by uniqueness. So when you added Brooklyn that second time – even though it had a different taxonomy – it pointed at that first term, and more importantly, shackled itself to it forever by associating it with that first term’s term_id in the wp_term_taxonomy table.

wp_term_taxonomy is the first-class table of the bunch (wp_terms, wp_term_taxonomy, and wp_term_relationships). Terms are arbitrary. They are associated with taxonomies in the wp_term_taxonomy table. term_taxonomy_id is the PRIMARY KEY, and term_id is the foreign reference. Weirdly, taxonomy is not a foreign reference alluding to wp_taxonomy, there is no wp_taxonomy table! wp_term_relationships joins term_taxonomy_id with Post IDs.

So let’s say you have 1,000,000 terms but only 5 taxonomies. If 90,000 of them are post_tags, the taxonomy field for all of them will be post_tag. The opposite is not true. 5 taxonomies only point at ONE Brooklyn. In this scenario, shared terms make no sense. Changing a term’s name that that is stored at term_id in wp_terms will change the name for each and every taxonomy that term_id is associated with.

Discuss.

Understanding MySQL in WordPress

MySQL

You would be surprised how many people I have interviewed for a job that think their PHP chops are great, but their MySQL, not so much. Many great WordPress developers I know have said the same thing. Not cool. Bad SQL can bring down your site! Forget crazy scaling challenges, basic WP Query usage can bring your site down if you aren’t careful. WordPress is a framework that provides you with a lot out of the box, but you are still going to end up writing a bunch of your own PHP to make it work the way you want. The same should be true, but often isn’t, for SQL queries.

As is usually the case, for basic usage (e.g. your mom’s blog), don’t mess with SQL. If you work on an enterprise-level site and have all kinds of monitoring and benchmarking tools in place, let’s mess with some SQL 🙂

Schema

WordPress at its core has a tiny and non-complex database schema.

  • Posts, Comments, Users
  • Post-, Comment-, User- meta
  • Options
  • Links
  • Terms, Term Taxonomy, Term Relationships

The Posts table (wp_posts) is a one-size-fits-all data store for “content.” Posts, Custom Post Types, whatever. At its core, it stores a title, a blob of text (your blog post), and possibly an excerpt. Also stores time, reference to the user who wrote it, and post_type. Everything else that has to do with Posts is relegated to the wp_postmeta table. The Post Meta table (wp_postmeta) is a key / value store that arbitrarily stores anything you want. Like, literally ANYTHING. The meta_value column is of type LONGTEXT, which can store up to 4GB of data. LONGTEXT is a poor choice for key / value pairs – more in a bit.

The Comments table (wp_comments) stores the contents of comments, a reference to the user, and some additional metadata about anonymous users (User Agent, IP, etc). The Comment Meta table (wp_commentmeta) is identical to the Post Meta table, albeit with different column names.

The Users table (wp_users) stores arbitrary information about your site’s registered users: email address, string used for username, username, username as slug, etc. The User Meta table (wp_usermeta) is identical to the Post Meta table, albeit with different column names (noticing a trend?).

(We’ve already covered over 50% of the default schema…)

The Options table (wp_options) is identical to the Post Meta table, albeit with different column names…

The Links table (wp_links) is “probably” not used to its full extent by most casual bloggers. It probably isn’t used by anyone using WordPress as a CMS. The Links product seems like a labor of love from the early days of WordPress (IMHO) and shows Matt‘s fondness for XFN.

The Term tables (wp_terms, wp_term_taxonomy, wp_term_relationships) are, in my opinion, a mess and are easier to understand through the API methods for interacting with Terms. More on this later…

SQL-generation

Like most things in WordPress, SQL “just happens,” and you don’t have to know anything about it beforehand to start working with WP. You could actually go your entire life using WordPress and never once need to know anything about SQL. That’s if your entire life using WordPress is in a lightweight manner – a personal blog that doesn’t need to scale. The second you start “needing to scale” is the second you start “needing to know SQL.” Why? I thought WordPress was written by geniuses who never make mistakes? Well, sorta true, sorta very not true. WordPress works perfectly and scales perfectly for probably 99% of use cases. The 1% of cases where it needs some help are those sites that get millions of page views a day / month and sites that use complex database queries across millions of rows, which is most enterprise-level sites.

There is a lot of code in WordPress that talks to the database for you, but almost all of that code at its core reaches the same end result: a string of SQL that will get passed to some function as a database query.

There are a couple of common ways this code is generated.

WP Query

WP Query is one of WordPress’ most powerful tools. You can pass all kinds of request params to WP Query and receive in return a collection of rows containing your site’s blog posts / data. The WP_Query class will get posts for you. It will do JOINs for you and resolve foreign ids. While it can be optimized to an extent, it can’t read your mind, and it can’t predict the future. It also can’t stop you from making bad decisions or forcing the writing of slow queries.

Many parts of WP Query, get_posts(), and the like are filterable. There are all kinds of places to hook in and alter portions of generated SQL.

Here’s an example of filterable SQL in WordPress Core:

<?php
// line 2570 of query.php
// WP_Query->get_posts

$where	  = apply_filters_ref_array( 'posts_where_paged', array( $where, &$this ) );
$groupby  = apply_filters_ref_array( 'posts_groupby',	  array( $groupby, &$this ) );
$join	  = apply_filters_ref_array( 'posts_join_paged',  array( $join, &$this ) );
$orderby  = apply_filters_ref_array( 'posts_orderby',	  array( $orderby, &$this ) );
$distinct = apply_filters_ref_array( 'posts_distinct',	  array( $distinct, &$this ) );
$limits	  = apply_filters_ref_array( 'post_limits',       array( $limits, &$this ) );
$fields	  = apply_filters_ref_array( 'posts_fields',	  array( $fields, &$this ) );

// line 2589
// for cache plugins, but wide open for everyone
$where	  = apply_filters_ref_array( 'posts_where_request',	array( $where, &$this ) );
$groupby  = apply_filters_ref_array( 'posts_groupby_request',	array( $groupby, &$this ) );
$join	  = apply_filters_ref_array( 'posts_join_request',	array( $join, &$this ) );
$orderby  = apply_filters_ref_array( 'posts_orderby_request',	array( $orderby, &$this ) );
$distinct = apply_filters_ref_array( 'posts_distinct_request',	array( $distinct, &$this ) );
$fields	  = apply_filters_ref_array( 'posts_fields_request',	array( $fields, &$this ) );
$limits	  = apply_filters_ref_array( 'post_limits_request',	array( $limits, &$this ) );

// line 2614
// screw it, lets filter the resulting string of SQL
$this->request = apply_filters_ref_array('posts_request', array( $this->request, &$this ) );

Hopefully this shows the following things:

  • You can filter every part of every query which has its origins in WP_Query->get_posts()
  • You can make every query better if you know what you are doing
  • You can make every query worse if you don’t know what you are doing
  • In the end, you can clobber everything that any plugin has done by just arbitrarily altering the SQL string
WP Query is useful if you want to do something like a Taxonomy Query and have it “just work.” Let’s look at a WP Query I wrote at eMusic and then dissect it:
<?php

function get_stories( $post_types = array() ) {
    global $regions_map, $editorial_types;
    $key = 'stories-' . get_region();
    if ( !empty( $post_types ) ) {
        $key = 'stories-' . get_region() . '-' . md5( join( '', $post_types ) );
    }
    $group = 'editorial';
    $response = wp_cache_get( $key, $group );

    if ( empty( $response ) ) {
        $params = array(
            'posts_per_page'=> 10,
            'order'         => 'DESC',
            'post_type'     => $editorial_types,
            'orderby'       => 'date',
            'tax_query'     => array(
                array(
                'taxonomy'  => 'region',
                'field'     => 'term_id',
                'terms'     => array(
                    $regions_map[ 'ALL' ],
                    $regions_map[ strtoupper( get_region() ) ]
                ),
                'operator'  => 'IN'
                )
            ),
            'no_found_rows'             => true,
            'update_post_meta_cache'    => false,
            'update_post_term_cache'    => false
        );

        if ( !empty( $post_types ) ) {
            $params['post_type'] = $post_types;
        }

        $posts = new WP_Query( $params );

        if ( $posts->have_posts() ) {
            wp_cache_set( $key, $posts->posts, $group, 1800 );
            $response = $posts->posts;
        }
    }

    return $response;
}

In case it isn’t immediately obvious, I am requesting the 10 most recent posts written by our Editorial staff in reverse-chronological order filtered by the taxonomy “region” and including a subset of our large family of custom post types.

If my request is in the cache, awesome – serve it from there. Otherwise, make a WP Query request and cache the resulting posts.

Let’s list a number of optimizations I did:

  • Turned off SQL_CALC_FOUND_ROWS
  • Turned off slurping of post meta
  • Turned off slurping of terms
  • Populated a map of term_name => term_id for regionalization
  • Taking advantage of the WP cache abstraction layer

This particular WP Query won’t slurp posts and terms, but it is a good idea to turn them off everywhere you know you do not want them to be requested.

Let’s look at the SQL queries that make up my optimized request:

// hop to retrieve term_taxonomy_id
// for regions because Tax Query won't take term_taxonomy_id
// as the value for "field"
SELECT term_taxonomy_id FROM wp_term_taxonomy WHERE taxonomy = 'region' AND term_id IN (1308,1306)

// a JOIN on the Term Relationships table to get posts in the
// requested region
SELECT   wp_posts.* FROM wp_posts  INNER JOIN wp_term_relationships ON (wp_posts.ID = wp_term_relationships.object_id) WHERE 1=1  AND ( wp_term_relationships.term_taxonomy_id IN (1307,1309) ) AND wp_posts.post_type IN ('emusic_feature', 'emusic_spotlight', 'book_feature', 'book_spotlight', 'emusic_bookshelf', 'emusic_qa', 'book_qa', 'emusic_jukebox_jury', 'emusic_who', 'emusic_select', 'emusic_about', 'emusic_guide', 'emusic_list', 'emusic_stack', 'emusic_radio_program', 'emusic_review', 'book_review', 'emusic_hub', 'book_hub', 'emusic_list_hub', 'emusic_guide_hub', 'emusic_six_degrees', 'book_six_degrees', 'emusic_icon', 'book_icon') AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 10
2 queries, 565ms.

If I want to affect these queries, I can hook into WP Query with the filters I showed above. The problem I have with doing that is this: I don’t want to write a bunch functions just to remove a few clauses from one SQL statement:

<?php
// function to make my site really slow
function my_where_filter( $where ) {
    $my_where = preg_replace( '/this is overkill/', '', $where );
    return $my_where;
}

add_filter( 'posts_where_paged', 'my_where_filter' );

// all kinds of other stuff runs...
// other plugins collide with me...
// all I am doing is changing text...

$rows = result_of_slow_query();
remove_filter( 'posts_where_paged', 'my_where_filter' );

Imagine if I was filtering even 2 or 3 more clauses. What a mess…

Now, let’s remove my optimizations and see what happens when WP Query just runs like normal:

<?php

function get_stories( $post_types = array() ) {
	global $regions_map, $editorial_types;

	$params = array(
	    'posts_per_page'=> 10,
	    'order'         => 'DESC',
	    'post_type'     => $editorial_types,
	    'orderby'       => 'date',
	    'tax_query'     => array(
	        array(
	        'taxonomy'  => 'region',
	        'field'     => 'name',
	        'terms'     => array( 'ALL', strtoupper( get_region() ) ),
	        'operator'  => 'IN'
	        )
	    )
	);

	if ( !empty( $post_types ) ) {
	    $params['post_type'] = $post_types;
	}

	$posts = new WP_Query( $params );

	if ( $posts->have_posts() ) {
	    $response = $posts->posts;
	}

	return $response;
}

This version of the WP Query produces:

// hop to get term_id for regions
SELECT wp_term_taxonomy.term_id
FROM wp_term_taxonomy t
INNER JOIN wp_terms USING (term_id)
WHERE taxonomy = 'region'
AND wp_terms.name IN ('all','us')

// term_id is unusable, so go back and grab term_taxonomy_id
SELECT term_taxonomy_id FROM wp_term_taxonomy WHERE taxonomy = 'region' AND term_id IN (1308,1306)

// finally make the the request
SELECT SQL_CALC_FOUND_ROWS  wp_posts.* FROM wp_posts  INNER JOIN wp_term_relationships ON (wp_posts.ID = wp_term_relationships.object_id) WHERE 1=1  AND ( wp_term_relationships.term_taxonomy_id IN (1307,1309) ) AND wp_posts.post_type IN ('emusic_feature', 'emusic_spotlight', 'book_feature', 'book_spotlight', 'emusic_bookshelf', 'emusic_qa', 'book_qa', 'emusic_jukebox_jury', 'emusic_who', 'emusic_select', 'emusic_about', 'emusic_guide', 'emusic_list', 'emusic_stack', 'emusic_radio_program', 'emusic_review', 'book_review', 'emusic_hub', 'book_hub', 'emusic_list_hub', 'emusic_guide_hub', 'emusic_six_degrees', 'book_six_degrees', 'emusic_icon', 'book_icon') AND (wp_posts.post_status = 'publish' OR wp_posts.post_status = 'private') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 10

// we asked for 10, but I guess go ahead and do this in case we
// we want to indicate pagination or something
SELECT FOUND_ROWS()
4 queries, 658ms.

So we got this request from 4 database queries down to 2 using my optimizations. One of my optimizations that helped was $regions_map. $regions_map took my term_ids and turned them into term_taxonomy_ids. If we look closely at the first 2 queries in the non-optimized WP Query, we notice that the first 2 queries can be combined.

// produces term_id, so we have to make another database query
SELECT wp_term_taxonomy.term_id
FROM wp_term_taxonomy t
INNER JOIN wp_terms USING (term_id)
WHERE taxonomy = 'region'
AND wp_terms.name IN ('all','us')

// produces term_taxonomy_id, skip the second call
SELECT wp_term_taxonomy.term_taxonomy_id
FROM wp_term_taxonomy t
INNER JOIN wp_terms USING (term_id)
WHERE taxonomy = 'region'
AND wp_terms.name IN ('all','us')

I have another optimization called $regions_tax_map:

<?php

// term_id - for use with WP Query
$regions_map = array(
    'EU'    => 1303,
    'CA'    => 1304,
    'UK'    => 1305,
    'US'    => 1306,
    'ALL'   => 1308
);

// term_taxonomy_id, for when we want to write our own SQL
$regions_tax_map = array(
    'EU'    => 1304,
    'CA'    => 1305,
    'UK'    => 1306,
    'US'    => 1307,
    'ALL'   => 1309
);

When I snarked above that I thought the Term tables in WordPress are a mess, this is why. For any database call to have any teeth when joining with wp_term_relationships (as it inevitably does when you are using taxonomy to filter by region, etc), you have to have a term’s term_taxonomy_id for the lookup. There are many cases where this is not practical – a site that has 100s of models / makes of car in the term tables, but something like regional site code that only has 5 variations can be mirrored outside of the database and placed into code for quick access.

So let’s get that optimized WP Query down to only ONE call to the database by removing the GROUP BY clause and posts with a status of “private,” and make use of term_taxonomy_id shortcuts:

function get_stories( $post_types = array() ) {
    global $regions_tax_map, $editorial_types, $wpdb;
    $key = 'stories-' . get_region();
    if ( !empty( $post_types ) ) {
        $key = 'stories-' . get_region() . '-' . md5( join( '', $post_types ) );
    }
    $group = 'editorial';
    $response = wp_cache_get( $key, $group );

    if ( empty( $response ) ) {
        $types = empty( $post_types ) ? $editorial_types : $post_types;
        $joined = sprintf( "'%s'", join( "','", $types ) );

        $sql =<<<SQL
         SELECT wp_posts.* FROM wp_posts
         INNER JOIN wp_term_relationships
         ON (wp_posts.ID = wp_term_relationships.object_id)
         WHERE wp_term_relationships.term_taxonomy_id
            IN (
                {$regions_tax_map['ALL']},
                {$regions_tax_map[ strtoupper( get_region() ) ]}
            )
            AND wp_posts.post_type IN ({$joined})
            AND wp_posts.post_status = 'publish'
         ORDER BY wp_posts.post_date DESC
         LIMIT 0, 10
SQL;

         $response = $wpdb->get_results( $sql );

         if ( !empty( $response ) ) {
             wp_cache_set( $key, $response, $group, 1800 );
         }
    }

    return $response;
}
ONE query, 42ms.

As you can see, we are making ONE query that JOINs wp_posts and wp_term_relationships. We skip WP Query completely. We skip all 840 lines of WP_Query->get_posts(). Removing the GROUP BY clause drops the query time from 560ms to 42ms. HUGE.

We debugged the queries. We realized that all of that black voodoo is there to make sure 99% of the human population can get what they want from WP Query, but we have scalability to think about. We want to optimize ALL of our interactions with the database.

This request is a SMALL portion of our sidebar on pages that support it on eMusic’s website. However, I wasn’t originally caching this call. I was relying on the built-in WordPress post cache to take care of it. This ONE module almost brought down our site on one of our busiest traffic days at the end of last year. The panic immediately subsided when we enlisted more MySQL slaves, but I learned an important lesson: SLOW is SLOW. Don’t be a hero and cling to how you think or want WordPress to work, learn how things perform and make them as fast as possible. Also, cache like the wind.

$wpdb Abstraction Layer

A more hands-on approach is to use the Database Abstraction Layer in WordPress, namely: the $wpdb global object. $wpdb is great and provides MVC-like methods for UPDATEs and INSERTs. SELECTs are mostly hand-written when using $wpdb, which can be a good thing (like we demonstrated above).

Here are the best methods to use when retrieving one or many rows of data:

<?php

// get one row of data
$sql = $wpdb->prepare( 'SELECT ID, post_title, post_author, post_content FROM wp_posts WHERE id = %d', get_primary_key() );
$row = $wpdb->get_row( $sql, ARRAY_A );

// get a list of post IDs
$sql = $wpdb->prepare( 'SELECT ID FROM wp_posts WHERE post_type = "your_mom_jokes" AND post_author = %d', get_your_mom() );
$ids = $wpdb->get_col( $sql );

// get a bunch of rows
$sql = $wpdb->prepare( 'SELECT ID, post_title, post_author, post_content FROM wp_posts WHERE post_type = "bad_haiku"' );
$rows = $wpdb->get_results( $sql, ARRAY_A );

InnoDB vs MyISAM

WordPress “out-of-the-box” uses the MyISAM storage engine (the default storage engine for MySQL version < 5.5). For “most” users, this is just fine and dandy. MyISAM has the advantage of built-in full-text search. InnoDB is the default engine for MySQL > or = 5.5. InnoDB has row-level locking, as opposed to MyISAM’s table-level locking. Table-level locking hinders concurrency in the database when you have an astronomical number of rows in a high-traffic environment. Just one full-table scan query can cause all sorts of trouble when using MyISAM in production. Every mature RDBMS has built-in fulltext functionality, except for InnoDB (frown).

We’re going to look more at it in a minute, but realize that WordPress has some serious negatives when it comes to Search. For a giant site, you are better off looking at solutions like Lucene, Solr, and Elastic Search and providing them as a Web Service. Most sites won’t be serving every piece of content out of their WordPress database only, so this shouldn’t be outside the realm of possibility for your site’s architecture.

LONGTEXT COLUMNS

One particular thing about the WordPress database schema that is dangerous is its use of unindexed (and un-indexable) LONGTEXT columns to the store the values of key / value pairs in 4 of its tables. I suppose it makes sense to have maximum flexibility when storing data for a user base in the many millions, but the second you decide to query against the *values* in your key / value pairs, you are dancing with the devil. The “key” is stored as a VARCHAR with a standard BTREE index, and the “value” is stored as a LOB type (LONGTEXT, maximum size 4 GB). While the value for a particular key may be readily found because of the index on the key column, you cannot go the other way – you cannot, for instance, find all the users who use a particular nickname without doing a full table scan on an unindexed column. So try doing that on a table with over 10,000,000 rows and see how it performs.

Search

LOB columns cannot be indexed using standard BTREE indexes. Therefore any search against such a column invokes a full table scan, which means every block in the table must be read from disk into memory and searched. Using LIKE with wildcards (% or _) is even worse, because the entire column contents must be searched (possibly as much as 4GB of text) and, as we know, string comparison is CPU-intensive.

Another problem is the way that WordPress searches by default. Let’s say I want to search for “Arcade Fire.” WordPress is going to do wildcard searches for post_title LIKE "%Arcade%", post_title LIKE "%Fire%", post_content LIKE "%Arcade%", and post_content LIKE "%Fire%". I’m going to let you guess how well this performs. For a search like this, we really want to match the exact phrase if we’re using WordPress, or let some super advanced tool like Lucene take over and give your search results better handling.

<?php
add_filter( 'posts_search', 'refine' );

function refine( $where ) {
    if ( !empty( $where ) ) {
        $term = get_search_query();
        $term = mysql_real_escape_string( '%' . $term . '%' );
        $where = " AND ((wp_posts.post_title LIKE '$term') OR (wp_posts.post_content LIKE '$term')) ";
    }
    return $where;
}

You can also speed up instances of Search by forcing use of an index. Before we served Search natively through the WordPress front end, we were making Editorial Search available via XML-RPC to a Java client. We found that requesting a limited number of posts ordered by data was sped up by forcing index usage:

SELECT SQL_CALC_FOUND_ROWS wp_posts.* FROM wp_posts USE KEY (type_status_date)....

I know many people are against this practice, but I know of others who are for it in certain circumstances: namely, the MySQL performance team at Facebook.

HyperDB

If you are working in a high traffic environment, chances are you are going to need more power than just one database server can handle. You’ve probably already scaled to several web servers, adding slaves to your master database should come next.

HyperDB is awesome and works like a champ right out of the box. You can even enable replication lag detection by uncommenting some lines of code that ship with it and installing mk-heartbeat.

Conclusion

Debug your code. Find out how fast / slow everything you are doing is. Don’t assume everything in WordPress is fast. It’s not.

eMusic.com is faster on WordPress

We’re optimizing every day (and still have work to do), but we get lots of traffic and perform tons of parallel Web Service calls – the site performs like a beast on WordPress and Amazon EC2.

From http://www.pcmag.com/article2/0,2817,2399365,00.asp, “Top Retail Sites Still Way Too Slow, Study Finds”:

Only eight Web sites finished with an average load time under four seconds: Nike.com (2.269 sec), JCPenney.com (2.65 sec), Amazon.com (2.776 sec), eMusic.com (3.279 sec), eCrater (3.542 sec), WellsFargo.com (3.81 sec) and CDUniverse (3.81 sec).

WordPress + Memcached

One of the most bizarre critiques of WordPress that I often hear is “it doesn’t come with caching” – which makes no sense because Cache is one of the best features of WordPress out of the box. That’s kind of like saying: “my iPod sucks because it doesn’t have any songs in it” – when you first buy it. Your iPod can’t predict the future and come pre-loaded with songs you love, and your WordPress environment can’t come already-installed without knowing a minimal number of things. You have to pick a username / password, you have to point at a database, and if you want to cache, you have to pick how you want to cache (you don’t HAVE to cache – but really, you HAVE to cache).

If you just have a blog that gets “light” traffic, and that blog only sits on one server: install W3 Total Cache (or WP Super Cache) and you can skip the rest of this post.

For W3TC, make sure it is not in preview mode and follow its instructions / guidance for how to get started making your site faster. W3TC can be used for a multi-server setup as well, but if you are running a website that is in a load-balanced environment, you are probably going to be more prone to use the tools I am about to show you.

Memcached

Memcached is a distributed in-memory caching system that stores key-value pairs in RAM. That will make sense to a lot of people, if it doesn’t to you – check this out:

// hey - connect me to the server where I store stuff!
connection = connect_to_server();

// store this value in that RAM of yours
// "key" holds "value"
connection.set( key, value );

// give me that value I stored with "key"
connection.get( key );

It was originally developed by Brad Fitzpatrick for LiveJournal. The concept is simple: instead of running 18 web servers / each reading from their own limited cache, each server reads / writes from a common distributed set of caches that they all can share. As long as each server is aware of the same set of “Memcached boxes,” what one server stored in the cache can be read by every other server in the cluster of web servers. The advantage to doing this is immediately apparent: 18 servers reading from the same 30GB of collective cache is more efficient than each server reading from its own smaller local cache.

Memcached (pronounced: Mem-cash-dee), or Memcache-daemon, is a process that listens by default on port 11211. Like httpd (H-T-T-P-daemon), it runs in the background, often started automatically on server load.  A lot of huge websites use Memcached – at least: Facebook, YouTube, and Twitter.

Memcached has external dependencies, so it is best to install it using a package manager like MacPorts (on OS X) or yum (Linux). Package managers give you the ability to type commands like port install memcached and port load memcached to do almost all of the initial setup for Memcached.

Memcached does not have to run on external servers. If your production site uses external servers, and you want to develop locally with Memcached as a local cache, that is just fine.

We’ll come back and discuss more tools for interacting with Memcached once we know a little more about it.

Since WordPress is written in PHP and makes a bunch of expensive SQL queries to its companion MySQL database, we need to figure out how to start taking advantage of Memcache in PHP to store the results of expensive queries, functions, and external HTTP requests in cache.

Memcache PHP Extension

Now that you (maybe) have Memcached installed, you need a way for PHP to talk to it, because it can’t out of the box. You have at least 2 choices for installing PHP extensions:

  1. Package manager
  2. PECL

If you have PECL / PEAR installed, you can run commands like pecl install memcache. Using MacPorts, it will be something more like port install php5-memcache. So what does this get you?

Mainly, access to the Memcache class (after you restart Apache) provided by the PHP extension and a procedural API for add / update / delete / retrieve-ing items from your Memcache server(s). Some examples:

<?php
/**
 * Object-oriented approach
 *
 */
$memcache = new Memcache;
// connect locally on the default port
$memcache->connect( '127.0.0.1', 11211 );

// key = set this item in the cache
// value = a string in this case, also supports objects and arrays
// false = do NOT use zlib compression to store the value
// 30 = the number of seconds to cache the item
$memcache->set( 'some_key', 'this is the value I am storing', false, 30 );

// replace / update the value of an item
$memcache->replace( 'some_key', 'but this is the value I am updating', false, 90 );

// retrieve an item
$memcache->get( 'some_key' );

/**
 * Procedural approach (functions not classes)
 *
 */
$memcache = memcache_connect( '127.0.0.1', 11211 );
memcache_add( $memcache, 'some_key', 'this is the value I am storing', false, 30 );
memcache_set( $memcache, 'some_key', 'this is the value I am storing', false, 30 );

// replace / update the value of an item
memcache_replace( $memcache, 'some_key', 'but this is the value I am updating', false, 90 );

// retrieve an item
memcache_get( $memcache, 'some_key' );

These functions work just fine, but when using WordPress, we aren’t going to call them directly – these functions will be “abstracted.”

WP Object Cache

WordPress has a cache abstraction layer built-in. What this means is that WordPress has an API for interacting with what is called the WP Object Cache – a PHP class: WP_Object_Cache. The WP Object Cache class is “pluggable” – meaning, if *you* define a class called WP_Object_Cache in your codebase and include it at runtime, that’s the class WordPress will use, otherwise it defaults to WP’s. WordPress also has a procedural API for interacting with the cache object (likely what you’ll use in your code), here’s a peek into future code you may write:

<?php

//
// Function (or "template tag") to abstract the fetching of data.
// Data comes from cache, or call is made to db then cached.
// wp_cache_set( key, value, cache_group, expires (n seconds) )
//

function get_expensive_result() {
    global $wpdb;

    $key = 'expensive_query';
    $group = 'special-group';
    $response = wp_cache_get( $key, $group );

    if ( empty( $response ) ) {
        // add the result of an expensive query
        // to the cache for 3 hours
        $response = $wpdb->get_results(.....);
        if ( !empty( $response ) )
            wp_cache_set( $key, $response, $group, 60 * 60 * 3 );
    }
    return $response;
}

$data = get_expensive_result();

WP Object Cache is agnostic as to how it is implemented. Meaning, it does either a) whatever the default behavior is in WP, which is to “cache” non-persistently or b) do whatever your WP Object Cache (your “pluggable” override) class says to do.

Here’s the catch – it won’t do anything until you add this to wp-config.php:

define( 'WP_CACHE', true );

Here’s the other catch – all that will do is try to load wp-content/advanced-cache.php and wp-content/object-cache.php which we haven’t installed yet. Those files are typically made available by a persistent cache plugin (we want our cached items to be available across page loads until they expire – duh), but let’s pause and come back to this subject a little later.

Transients API

The Transients API in WordPress is an abstraction of an abstraction. Memcached should always be viewed as a “transient cache,” meaning that sometimes you will request data, and it won’t be there and will need to be re-cached. The Transients API has a simple procedural API to act on items. It is actually the exact same thing as wp_cache_*, and even allows you to pass in expiration, it just doesn’t allow you to specify a group (‘transient’ is the group). It is really an exercise in semantics. An advantage: if you aren’t using a WP Object Cache backend (a persistent cache plugin), your “transients” will get stored in the wp_options table of your site’s WP database.

<?php
// Transients procedural API
set_transient( 'woo', 'woo' );
update_transient( 'woo', 'hoo' );
get_transient( 'woo' );
delete_transient( 'woo' );

// multisite
get_site_transient( 'multi-woo' );

// behind the scenes, this is happening
wp_cache_set( 'woo', 'woo', 'transient' );
wp_cache_set( 'woo', 'hoo', 'transient' );
wp_cache_get( 'woo', 'transient' );
wp_cache_delete( 'woo', 'transient' );

I typically use Transients when I want to store something for an extremely short period of time and then manually delete the item from the cache. Here’s an example:

<?php
function this_takes_90_secs_to_run_in_background() {
    if ( !get_transient( 'locked' ) ) {
        // this function could get called 200 times
        // if we don't set a busy indicator
        // so it doesn't get called on every page request
        set_transient( 'locked', 1 );

        // do stuff....

        // ok I'm done!
        delete_transient( 'locked' );
    }
}

add_action( 'init', 'this_takes_90_secs_to_run_in_background' );

Could this exact same thing have been accomplished by using wp_cache_* functions? You betcha, because they were (behind the scenes)! Like I said, I use Transients in specific instances – it’s really an issue of semantics.

So our next step is to make WordPress ready to start caching.

Memcached plugin / WP Object Cache backend

Wouldn’t you know it? There’s a plugin that implements WP Object Cache / Memcache from one of the lead developers of WordPress, Ryan Boren. It’s not really a “plugin” – it’s a drop-in file. Once you download the plugin, move object-cache.php into the wp-content directory, and you have pluggably-overriden WP Object Cache. If you are brave and actually look at the code in the lone file, you will notice a bunch of functions (a procedural API) for interacting with the WP Object Cache object (an instance of the Memcache class).

You will also notice that at it’s core, it is just calling the methods of the Memcache class, but it has abstracted these steps for you. It works with Multisite and has intelligently cooked up a way to cache by blog_id, group, and key. You get caching by blog_id for free, there is nothing you need to do for this to work.

Caching by group is really “namespacing” your keys. Let’s say you want to call 10 different things in your cache “query” – your choices for providing context are a) using a longer key like “bad-query” or b) just using “query” and adding the group “bad” into the fold:

<?php
wp_cache_set( 'bad-query', 'something' );
wp_cache_set( 'query', 'something', 'bad' );

This is where cache and transients completely overlap. Since you can’t specify a group with transients, “bad-query” would be your only option for the key to avoid name collision. Like I said, it is all semantics.

So we know we can call wp_cache_delete( $key, $group ) to delete an item from the cache, right? So that must be why groups are helpful as well, to remove a bunch of things from the cache at once? Nope. You can’t remove a group of items from the cache out of the box. Luckily, this infuriated me enough that I wrote a plugin that will do this for you. More in a bit, we still need to obtain an advanced-cache.php file.

Batcache

The advanced-cache.php piece of the puzzle is implemented by Batcache. Batcache is a full-page caching mechanism and was written by Andy Skelton from Automattic. WordPress.com used to host Gizomodo’s live blogs, because it would go down on its own architecture during Apple keynotes after ridiculous spikes in traffic. The idea is this: if a page on your site is requested x number of times in an elapsed period of time, cache the entire HTML output of the page and serve it straight from Memcached for a a defined amount of time. Batcache is a bonafide plugin, but the advanced-cache.php portion is a drop-in that should be moved to the wp-content directory.

Batcache will serve these fully-cached pages to anonymous users and send their browser a Cache-Control: max-age={MAX_AGE_OF_PAGE}, must-revalidate; HTTP header. There are some dangers you should be aware of when serving highly-cached pages from a dynamic website.

Batcache has a class member array called $unique which you should populate with values that would ordinarily make your page display different content for different users: Regionalization, perhaps Browser alias, country of origin of User, etc. If you are getting fancy and adding classes to the <body> tag based on browser, and you don’t add those classes to the $unique array in Batcache, you may end up serving HTML pages with an “ie ie7” class in the class attribute of the <body> tag to most of your users, regardless of what browser they are actually using. Fail.

Batcache serves these pages to “anonymous” users only. How does it know they are anonymous? It looks in the $_COOKIE super-global for some WordPress-specific cookies.

// Never batcache when cookies indicate a cache-exempt visitor.
if ( is_array( $_COOKIE) && ! empty( $_COOKIE ) ) {
    foreach ( array_keys( $_COOKIE ) as $batcache->cookie ) {
        if ( $batcache->cookie != 'wordpress_test_cookie' &&
        (
            substr( $batcache->cookie, 0, 2 ) == 'wp' ||
            substr( $batcache->cookie, 0, 9 ) == 'wordpress' ||
            substr( $batcache->cookie, 0, 14 ) == 'comment_author'
        )
    ) {
            return;
        }
    }
}

That’s all well and good unless you have implemented a custom authentication system like we have at eMusic.

The advanced-cache.php file that ships with Batcache is really a starting point. I know it may seem daunting or too complicated to dig into someone else’s plugin code, but if you are using Batcache, your site begs another level of skill and coding.

Using Memcached at eMusic

At eMusic, we use 4 dedicated Memcached servers in production – combined equaling ~28GB of RAM. When you have THAT much memory to interact with, some interesting things can happen. Here are few:

  • Your keys with no expiration will seemingly never expire until the cache starts evicting LRU (Least Recently Used) keys. Lesson learned here… always indicate expiration!
  • If you didn’t divide your keys up into a bunch of smaller groups, try to flush one portion of the cache will end up flushing tons of data that doesn’t need to be refreshed and might send a blast of traffic to your database cluster or web service tier. Lesson learned here… use MANY cache groups.
  • If you aren’t updating or deleting cache keys in your code at all, you may find that you end up with stale data often. Especially if you work with an editorial team / writers. They’ll regularly come to you with a “hey, my data is not updated on the live site!”
  • Don’t assume WordPress is getting it right when it comes to caching its own data, dig in find out how it really works. You may (will) find some objectionable things.

Johnny Cache

Remember how I said that you can’t “flush” the cache by cache group? Turns out that’s a big problem for us. Why? If we roll code or add / change a feature, we sometimes want to clear a certain cache to reflect a certain change.

Cache groups are a WordPress concept, not a Memcached concept. WordPress adds the group names to keys by convention, so if you know how to parse and create keys like the Memcached WP Object Cache backend, you can sort through the keys on your Memcached servers and group them together yourself to inspect them in the admin. If you have a list of keys for a group, you can loop through them and call wp_cache_delete( $key, $group ) on each.

Here’s some Memcache extension code to retrieve keys:

<?php
$memcache = new Memcache();
$memcache->connect( $server, '11211' );
$list = array();
$allSlabs = $memcache->getExtendedStats( 'slabs' );
$items = $memcache->getExtendedStats( 'items' );
foreach ( $allSlabs as $server => $slabs ) {
    foreach( $slabs as $slabId => $slabMeta ) {
        $cdump = $memcache->getExtendedStats( 'cachedump', (int) $slabId );
        foreach( $cdump as $keys => $arrVal ) {
            if ( !is_array( $arrVal ) ) continue;
            foreach( $arrVal as $k => $v ) {
                $list[] = $k;
            }
        }
    }
}

The Memcached backend creates keys like so:

// blog_id:group:key
1:catalog:artist-info-246809809

To parse the list of keys and sort them into groups, try this:

<?php
$keymaps = array();
foreach ( $list as $item ) {
    $parts = explode( ':', $item );
    if ( is_numeric( $parts[0] ) ) {
	$blog_id = array_shift( $parts );
	$group = array_shift( $parts );
    } else {
	$group = array_shift( $parts );
	$blog_id = 0;
    }

    if ( count( $parts ) > 1 ) {
	$key = join( ':', $parts );
    } else {
	$key = $parts[0];
    }
    $group_key = $blog_id . $group;
    if ( isset( $keymaps[$group_key] ) ) {
        $keymaps[$group_key][2][] = $key;
    } else {
	$keymaps[$group_key] = array( $blog_id, $group, array( $key ) );
    }
}

ksort( $keymaps );
foreach ( $keymaps as $group => $values ) {
    list( $blog_id, $group, $keys ) = $values;
    foreach ( $keys as $key ) {
        .........
    }
}

Johnny Cache is the front-end I wrote for the WP Admin to do this. It allows you to select one Memcached server at a time. Once selected, the servers keys (not values) are retrieved then parsed and ordered by group and blog. The plugin allows you to do the following:

  • Flush keys by cache group
  • Remove single items from the cache
  • View the contents of single items in the cache
  • Flush the cache for a specific user by User ID – helpful if you made a change by hand in the database for a specific user

Johnny Cache is still a work in progress, but flushing cache by group was important enough that I wrote my own code to handle it.

Minify-cation

Working in a load-balanced environment is a lot different than working on one server. This is no clearer than when working with tools that expect for you to dynamically create static files on the server and then point a URL directly at them. Almost all of the time, this doesn’t work. Also, if the generation of the static files is expensive, this is a task best performed once, cached, and then have the results shared with every other server.

I wrote a plugin called Minify that magically grabs all of your page’s scripts and styles, hashes the src names to create an identifier, and then combines the styles or scripts into one file and runs them through some minification classes.

I get around having to serve flat files by creating a rewrite which points to a “make” file. The make file either reads the result of this smashing-together-of-sources, or does the smashing and then saves / serves the result. The advantage here is that every request for our JavaScript and CSS comes straight out of Memcached.

Gotcha alert! To cache-bust a CDN like Akamai, you need to change the file name every time you roll new code to make sure your users aren’t getting served cached files. Query strings suck for that, so I made an increment part of the rewrite ( get_site_option( 'minify:incr' ) ). Our paths end up looking like:

http://www.emusic.com/wp-content/cache/minify-bdda2ca041434058e578f7b84eb7482b-23875598.css

// here's how it is translated
http://{HOST}/wp-content/cache/minify-{HASH}-{INCR}.{EXTENSION}

Works for both JavaScript and CSS. It’s magic.

Sitemaps

Same concept. If I want to serve google-news-sitemap.xml, I make a rewrite that points to google-news-sitemap.php. If it is in the cache, serve it, otherwise build it, store it, then serve it.

Memcached on the command line

One final note on interacting with Memcached servers: you can telnet to a server and use the command line tools bundled with it.

$ telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get foo
VALUE foo 0 2
hi
END
stats
STAT pid 8861

More here. Also take a look at Memcache-top.

Conclusion

There is no “definitive” way to use cache, so experiment and think through your caching strategy. A finely-tuned cache will drastically improve your website’s resilience and performance.

WordPress in Dev, QA, and Prod

if you work at a “corporation” and are using WordPress as a CMS, chances are you have multiple environments in which you do your development, testing, and then ultimately deploy to production. Syncing your environments (code, configs, database) can be a nightmare, unless you take some time to figure out just how similar your environments can be, and only focus on how they are different. I am going to share with you techniques we use at eMusic to seamlessly switch environments without having to do any manual intervention in the database, codebase, cache or configs (after the initial setup).

Database

You shouldn’t have to touch your database! You should embrace the idea that “production” is your authoritative database and only pull data DOWN, never push it UP. I know many tools exist to publish database rows from one environment’s database to another, but it is way easier to know you can always pull fresh data and “everything just works” rather than pushing data in real-time at runtime and praying for the best.

This will of course require some planning. Ok, I made a new “page” that uses rewrite and has some query vars that end up producing a very dynamic page, like an artist’s page for one of their albums. Before we roll this code to prod, I need to create the page in the WordPress admin on production so the URL will work and the page will be “live” when the code rolls.

Periodically, perhaps after each roll to production, a backup can be made of production data (an export) and “pulled down” to every other environment’s database (an import on each). I prefer to use Sequel Pro to manage our many databases and love its easy-to-export-and-import GUI tools. Sequel Pro will give you the option to compress your SQL export using Gzip compression. I have had some edge case issues where this has exposed some bad character encoding (someone pasted from Word without using the GUI button in TinyMCE…) in some of my tables, exporting SQL uncompressed has not.  Uncompressed SQL for a database with many millions of rows will produce a huge file. If you are confident in the cleanliness of your character encoding, try a compressed export for a much smaller file size (8-10x smaller).

Why this is a good method? 

  • Staging can always be replenished with fresh data, erasing any unnecessary dummy / test posts
  • What you do in dev, stays in dev, probably local to YOUR machine, so it can’t break anything
  • Prod remains prod, you can’t blow it up in real time with a bad import or a haywire push of rows / deltas

HEY WAIT A SECOND, ALL OF THE URLS IN THE DATABASE POINT TO PROD!!!!!

Output Buffering

Right, so you will need to somehow filter EVERYTHING that comes out of the database and replace production URLs with your current environment’s URLs. This can be an impossible task until you understand how output buffering works.

An output buffer will read all output that would normally get sent to the browser as bytes of HTML until you get its contents and choose to echo them out. Meaning, I can code my HTML document like I normally would, but the output buffer will swallow all of the output until I give the command for it to display. Sometimes this is called “flushing” the output buffer. Steve Souders even suggests calling the PHP method flush() right after your HTML </head> to send bytes to the user’s browser earlier and faster. Our technique kinda does the opposite of that, but what we get in return is far more valuable: a codebase that adapts to any environment, and one that rewrites the URLs on the fly for us.

This code can really go anywhere before you start echo’ing content or writing HTML, but the safest place to put it is in wp-config.php:

<?php 
define( 'DEV_HOST', $_SERVER['HTTP_HOST'] );
// single site install
define( 'PROD_HOST', 'www.emusic.com' );
// if you're using multisite
define( 'PROD_HOST', DOMAIN_CURRENT_SITE );

function your_callback_functions_name( $page ) {
    return str_replace( PROD_HOST, DEV_HOST, $page ); 
}

// pre-PHP 5.3
ob_start( 'your_callback_functions_name' );
// PHP 5.3
ob_start( function ( $page ) {
    return str_replace( PROD_HOST, DEV_HOST, $page ); 
} );

You can also include an array of URLs to replace with your DEV HOST. This is important if you have a production environment that has a host name for “wp-admin,” maybe a host name for XML-RPC servers, and a host name for your production site.

To set up a local host name for your site in Apache – you need to add an entry to your {PATH_TO_APACHE}/conf/extra/httpd-vhosts.conf file:

##
#
# Example Local Configuration (on a Mac)
#
##

<Directory /Users/staylor/Sites/>
Options Indexes MultiViews ExecCGI FollowSymLinks
AllowOverride All
Order allow,deny
Allow from all
</Directory>
<VirtualHost *:80>
    DocumentRoot "/Users/staylor/Sites/emusic/trunk"
    ServerName emusic.local
</VirtualHost>

You also need to add an entry in your /etc/hosts file:

127.0.0.1 emusic.local

AND AFTER THAT, you need to flush your local DNS cache to accept your new host:

#!/bin/bash
sudo {PATH_TO_APACHE_BIN}/apachectl restart
dscacheutil -flushcache

DEV_HOST requires a hard-coded host name (e.g. emusic.local) if $_SERVER['HTTP_HOST'] produces an IP address.

Here’s an example of filtering by searching for multiple URLs and replacing with the current host (the callback is called – passing the contents of the buffer – when the script and all output has reached its end. You can nest output buffers as well, more on this later):

// PHP 5.3
// these aren't real hosts
define( 'TOOLS_HOST', 'tools.emusic.com' );
define( 'XML_RPC_HOST', 'xml-rpc.emusic.com' );
define( 'PROD_HOST', 'www.emusic.com' );

ob_start( function ( $page ) {
    return str_replace( array( TOOLS_HOST, XML_RPC_HOST, PROD_HOST ), DEV_HOST, $page ); 
} );

Ok, I got this to work, but wait a second, we have DIFFERENT host names in staging for XML-RPC servers, etc, how do I tackle that….?

Machine-specific configs

We still use wp-config.php, but with a twist. We use MANDATORY machine configs. I’ll explain how. Inside of wp-config.php (right near the top):

define( 'DB_CONFIG_FILE', '/wp-config/hyperdb.php' );

if ( !is_file( DB_CONFIG_FILE ) ) {
    die( "You need a HyperDB config file in this machine's /wp-config folder" );
}

if ( is_file( '/wp-config/config.php' ) ) {
    require( '/wp-config/config.php' );
} else {
    die( 'You need a config file for your environment called config.php and located in /wp-config, and it needs to be owned by Apache.' );
}

Granted, you don’t have to use HyperDB. But, if you get any sort of serious traffic on your production site, you kinda HAVE to use HyperDB. HyperDB is outside the realm of this post, but just know it is a magical master-slave manager for a fancy production MySQL cluster.

The setup here is simple enough. Create a folder called wp-config (or whatever you would like) in the root directory of your machine and make sure Apache can read it.

Ok great, what goes in this /wp-config/config.php file…? Well, really anything you want, but probably these things:

// local constant for use in code
define( 'EMUSIC_CURRENT_HOST', $_SERVER['HTTP_HOST'] );

global $super_admins, $memcached_servers;
// define those here if you want, or not...

define( 'SUNRISE', 1 );

// your local database credentials
define( 'DB_NAME', 'my_database_name' );
define( 'DB_USER', 'root' );
define( 'DB_PASSWORD', 'mypassword' );
define( 'DB_HOST', '127.0.0.1' );
define( 'DB_CHARSET', 'utf8' );

// you may have environment specific S3 buckets, or not
// it's a good idea to keep these credentials in a file like this for security
// and ease of editing
define( 'AMAZON_S3_KEY', 'your_bucket_key' );
define( 'AMAZON_S3_SECRET', 'your_bucket_secret' );
define( 'AMAZON_S3_BUCKET', 'your_bucket' );

// multisite values
define( 'WP_ALLOW_MULTISITE', true );
define( 'MULTISITE', true );
define( 'SUBDOMAIN_INSTALL', false );
define( 'DOMAIN_CURRENT_SITE', 'www.emusic.com' );
$base = '/';
define( 'PATH_CURRENT_SITE', '/' );
define( 'SITE_ID_CURRENT_SITE', 1 );
define( 'BLOG_ID_CURRENT_SITE', 1 );

// maybe some hard-coded production values to filter
define( 'PRODUCTION_CURRENT_SITE', 'www.emusic.com' );
define( 'PRODUCTION_CURRENT_TOOLS', 'tools.emusic.com' );
define( 'XMLRPC_DOMAIN', 'xmlrpc.emusic.com' );

// environment-specific endpoints
define( 'XMLRPC_ENDPOINT', EMUSIC_CURRENT_HOST );
define( 'TOOLS_DOMAIN_FQDN', EMUSIC_CURRENT_HOST );

ob_start( function ( $page ) {
   // this might be overkill, but it makes sure you are 
   // getting rid of any "wrong" URL that made its way into
   // the database somehow
    return str_replace(
        array(
            XMLRPC_DOMAIN,
            TOOLS_DOMAIN_FQDN,
            DOMAIN_CURRENT_SITE,
            PRODUCTION_CURRENT_TOOLS 
        ),
        EMUSIC_CURRENT_HOST,
        $page
     );
} );

// custom WordPress configs:
// we use an svn:external for WordPress code
// in a different directory other than root
//
// we move our assets directory and exclude it from Subversion
// more on that later...
define( 'WP_CONTENT_URL',
    'http://' . DOMAIN_CURRENT_SITE . '/wp-content' );
define( 'WP_CONTENT_DIR',
    $_SERVER['DOCUMENT_ROOT'] . '/wp-content' );
define( 'EMUSIC_UPLOADS', 'assets' );

This is an example config that might look completely different in every enviroment’s machine(s):

  • Each environment might have specific overrides (a $super_admins array?)
  • You might specify 6 Memcached servers in one environment and zero in another
  • Your database credentials will undoubtedly change across environments

Sunrise

We use Multisite, so we also have to do some work in wp-content/sunrise.php. When using Multisite, WordPress allows you to get in there early by setting define( 'SUNRISE', 1 ) and completely changing what site and blog you are on by adding your own monkey business in wp-content/sunrise.php, if you so choose. You can also start adding filters and actions before the meat and potatoes of WordPress starts doing its thing. We are filtering output with the output buffer, but we have to filter PHP variables in code using WordPress filters:

$domain = EMUSIC_CURRENT_HOST;

// a bunch of code here is omitted that looks at domain
// and does some crazy stuff to switch between "blogs"
// without actually adding Wildcard DNS

// Multisite can require some unconventional
// code and admin wrangling to get things to work
// properly when you have a custom WP location,
// custom media location, AND a custom Multisite media
// location

// all you need to know:
// this code will produce $the_id (representing blog_id) and
// possibly alter $domain

// and then...

function get_environment_host( $url ) {
    global $domain;
    return str_replace(
        array(
            XMLRPC_DOMAIN,
            TOOLS_DOMAIN_FQDN,
            DOMAIN_CURRENT_SITE
        ),
        $domain,
        $url
    );
}

add_filter( 'home_url', 	'get_environment_host' );
add_filter( 'site_url', 	'get_environment_host' );
add_filter( 'network_home_url', 'get_environment_host' );
add_filter( 'network_site_url', 'get_environment_host' );
add_filter( 'network_admin_url','get_environment_host' );
add_filter( 'post_link', 	'get_environment_host' );

add_filter( 'pre_option_home', function ( $str ) use ( $domain ) {
    return 'http://' . $domain;
} );

add_filter( 'pre_option_siteurl', function ( $str ) use ( $domain ) {
    return 'http://' . $domain;
} );

// our custom image location for blogs / sites in our network
if ( $the_id > 1 ) {
    define( 'UPLOADBLOGSDIR', 0 );
    define( 'UPLOADS', 0 );
    define( 'BLOGUPLOADDIR',
        $_SERVER['DOCUMENT_ROOT'] . "/blogs.dir/{$the_id}/files/" );

    add_filter( 'pre_option_upload_path', function () use ( $the_id ) {
        return $_SERVER['DOCUMENT_ROOT'] . "/blogs.dir/{$the_id}/files/";
    } );

    add_filter( 'pre_option_upload_url_path', function () use ( $the_id, $domain ) {
        return 'http://' . $domain . "/blogs.dir/{$the_id}/files/";
    } );

// our custom image location for our main site
} else {
    add_filter( 'pre_option_upload_path', function ( $str ) {
        return $_SERVER['DOCUMENT_ROOT'] . '/' . EMUSIC_UPLOADS;
    } );

    add_filter( 'pre_option_upload_url_path', function ( $str ) {
        return 'http://' . EMUSIC_CURRENT_HOST  . '/' . EMUSIC_UPLOADS;
    } );
}

Ok wow, this is great, but what about images, how do I sync them…?

Images / S3

Trying to keep images sync’d in Subversion is a nightmare. Your production code probably isn’t a Subversion checkout (although WordPress.com is, they run trunk). It is more than likely a Subversion export that has been rsync’d across all of your many load-balanced servers / EC2 instances. If you don’t have a dedicated server for wp-admin, your images might not even end up on the same server – they could be spread across several. So that begs the following questions:

  • How is it humanly possible for servers to share images?
  • What happens if a server receives a request for an image it doesn’t have?
  • What happens when I pull the prod database to my local db but I don’t have any of those image paths in my file system?
  • How do I pull all of those images from production (each server) and somehow check them into Subversion… in real-time(!) ?

I am of the following opinion – you can’t! But this isn’t limited to Image assets. How do you serve a sitemap which is supposed to be a static file when you have 18 servers and the file is generated dynamically and saved locally? How do you minify JavaScript and then save them to flat files that can served by any server. My answer: you can’t, and shouldn’t!

So let’s start looking at solutions!

W3 Total Cache is the subject of intense debate across the WordPressosphere. Many say it does too much. When you are already using Batcache and Memcached, it sorta becomes overkill. But for me, there was one feature I always thought was invaluable. The CDN portion! So what does this CDN portion do?

The CDN code in W3TC gives you an admin interface to enter CDN (Amazon S3 or Cloudfront, Microsoft Azure, etc) credentials, and then magically when you upload a media file, it will upload that file to Amazon S3. Not only that, it will rewrite your image /media URLs to said remote location. So you keep working and uploading, it takes care of the rest. It’s magic!

To use this feature, I had to rip all kinds of code OUT of W3TC and make some changes here or there. One of the first things I knew I needed was a config that would work across any environments (much like our database works). I needed to accomplish the following things:

  • At no point in any environment do I need to have the images my local file system
  • Every environment’s media URLs should be seamlessly rewritten to their S3 counterpart
  • When I import the production database into any environment, media assets should appear as if they were always there.

This might sound super-sophisticated, but we’re only doing 2 crucial tasks:

  1. Adding an action that will upload the media to Amazon S3. W3TC already did all of that heavy-lifting. Thanks!
  2. Adding an output buffer to match media URLs against RegEx and replace them. W3TC already did all of that heavy-lifting. Thanks!

I had to make changes for this to work – altered the ways the configs work, made sure the GUI couldn’t override the config when messing around in the admin, added some constants that are defined in config.php for machine specific configuration. But ultimately, I took an existing technology and tweaked it to work to our advantage.

If you notice, I said that W3TC uses an output buffer – the output buffer starts after ours, meaning: we use nested output buffers. Their callback will run before ours, so the result of their output buffer callback will get passed to ours.

The Result

Our stuff “just works.” To get started developing with WordPress at eMusic, the following has to be done:

  • A production account on eMusic.com
  • Machine configs present in /wp-config
  • A copy of the production database
  • A Subversion checkout
  • Some entries in /etc/hosts for our many Web Services (from our Java universe)

That’s it. Here’s me talking about this and more at WordPress San Francisco 2011:

Fix for “The plugin does not have a valid header”

If you’ve ever seen this error when trying to install a plugin in your WordPress Admin Panel, you have probably considered going on a murderous rampage:

“The plugin does not have a valid header.”

This error is probably due to your list of Plugins being cached in whatever WP Object Cache your blog / site is using. None of the sites you will find when searching Google will tell you this. They will tell you that you have an invalid Plugin Header. You might. But probably, your list of files is cached.

There are 2 problems here:

1) The list of plugins ( $cache_plugins ) is not pluggable

2) the generic error for all missing plugins is “The plugin does not have a valid header”

function validate_plugin($plugin) {
    if ( validate_file($plugin) )
        return new WP_Error('plugin_invalid',
            __('Invalid plugin path.'));
    if ( ! file_exists(WP_PLUGIN_DIR . '/' . $plugin) )
        return new WP_Error('plugin_not_found',
            __('Plugin file does not exist.'));

   $installed_plugins = get_plugins();
   if ( ! isset($installed_plugins[$plugin]) )
       return new WP_Error('no_plugin_header', 
           __('The plugin does not have a valid header.'));
   return 0;
}

TO FIX THIS GARBAGE = in /wp-admin/includes/plugin.php at line 219-ish, comment out the lines below then visit your Plugins list:

// if ( ! $cache_plugins = wp_cache_get('plugins', 'plugins') )
$cache_plugins = array();

// if ( isset($cache_plugins[ $plugin_folder ]) )
// return $cache_plugins[ $plugin_folder ];

When you’re done and everything works, un-comment those lines. If this didn’t fix the problem, then yes, you probably fucked up when creating the Plugin header comment.