Understanding MySQL in WordPress

MySQL

You would be surprised how many people I have interviewed for a job that think their PHP chops are great, but their MySQL, not so much. Many great WordPress developers I know have said the same thing. Not cool. Bad SQL can bring down your site! Forget crazy scaling challenges, basic WP Query usage can bring your site down if you aren’t careful. WordPress is a framework that provides you with a lot out of the box, but you are still going to end up writing a bunch of your own PHP to make it work the way you want. The same should be true, but often isn’t, for SQL queries.

As is usually the case, for basic usage (e.g. your mom’s blog), don’t mess with SQL. If you work on an enterprise-level site and have all kinds of monitoring and benchmarking tools in place, let’s mess with some SQL :)

Schema

WordPress at its core has a tiny and non-complex database schema.

  • Posts, Comments, Users
  • Post-, Comment-, User- meta
  • Options
  • Links
  • Terms, Term Taxonomy, Term Relationships

The Posts table (wp_posts) is a one-size-fits-all data store for “content.” Posts, Custom Post Types, whatever. At its core, it stores a title, a blob of text (your blog post), and possibly an excerpt. Also stores time, reference to the user who wrote it, and post_type. Everything else that has to do with Posts is relegated to the wp_postmeta table. The Post Meta table (wp_postmeta) is a key / value store that arbitrarily stores anything you want. Like, literally ANYTHING. The meta_value column is of type LONGTEXT, which can store up to 4GB of data. LONGTEXT is a poor choice for key / value pairs – more in a bit.

The Comments table (wp_comments) stores the contents of comments, a reference to the user, and some additional metadata about anonymous users (User Agent, IP, etc). The Comment Meta table (wp_commentmeta) is identical to the Post Meta table, albeit with different column names.

The Users table (wp_users) stores arbitrary information about your site’s registered users: email address, string used for username, username, username as slug, etc. The User Meta table (wp_usermeta) is identical to the Post Meta table, albeit with different column names (noticing a trend?).

(We’ve already covered over 50% of the default schema…)

The Options table (wp_options) is identical to the Post Meta table, albeit with different column names…

The Links table (wp_links) is “probably” not used to its full extent by most casual bloggers. It probably isn’t used by anyone using WordPress as a CMS. The Links product seems like a labor of love from the early days of WordPress (IMHO) and shows Matt‘s fondness for XFN.

The Term tables (wp_terms, wp_term_taxonomy, wp_term_relationships) are, in my opinion, a mess and are easier to understand through the API methods for interacting with Terms. More on this later…

SQL-generation

Like most things in WordPress, SQL “just happens,” and you don’t have to know anything about it beforehand to start working with WP. You could actually go your entire life using WordPress and never once need to know anything about SQL. That’s if your entire life using WordPress is in a lightweight manner – a personal blog that doesn’t need to scale. The second you start “needing to scale” is the second you start “needing to know SQL.” Why? I thought WordPress was written by geniuses who never make mistakes? Well, sorta true, sorta very not true. WordPress works perfectly and scales perfectly for probably 99% of use cases. The 1% of cases where it needs some help are those sites that get millions of page views a day / month and sites that use complex database queries across millions of rows, which is most enterprise-level sites.

There is a lot of code in WordPress that talks to the database for you, but almost all of that code at its core reaches the same end result: a string of SQL that will get passed to some function as a database query.

There are a couple of common ways this code is generated.

WP Query

WP Query is one of WordPress’ most powerful tools. You can pass all kinds of request params to WP Query and receive in return a collection of rows containing your site’s blog posts / data. The WP_Query class will get posts for you. It will do JOINs for you and resolve foreign ids. While it can be optimized to an extent, it can’t read your mind, and it can’t predict the future. It also can’t stop you from making bad decisions or forcing the writing of slow queries.

Many parts of WP Query, get_posts(), and the like are filterable. There are all kinds of places to hook in and alter portions of generated SQL.

Here’s an example of filterable SQL in WordPress Core:

<?php
// line 2570 of query.php
// WP_Query->get_posts

$where	  = apply_filters_ref_array( 'posts_where_paged', array( $where, &$this ) );
$groupby  = apply_filters_ref_array( 'posts_groupby',	  array( $groupby, &$this ) );
$join	  = apply_filters_ref_array( 'posts_join_paged',  array( $join, &$this ) );
$orderby  = apply_filters_ref_array( 'posts_orderby',	  array( $orderby, &$this ) );
$distinct = apply_filters_ref_array( 'posts_distinct',	  array( $distinct, &$this ) );
$limits	  = apply_filters_ref_array( 'post_limits',       array( $limits, &$this ) );
$fields	  = apply_filters_ref_array( 'posts_fields',	  array( $fields, &$this ) );

// line 2589
// for cache plugins, but wide open for everyone
$where	  = apply_filters_ref_array( 'posts_where_request',	array( $where, &$this ) );
$groupby  = apply_filters_ref_array( 'posts_groupby_request',	array( $groupby, &$this ) );
$join	  = apply_filters_ref_array( 'posts_join_request',	array( $join, &$this ) );
$orderby  = apply_filters_ref_array( 'posts_orderby_request',	array( $orderby, &$this ) );
$distinct = apply_filters_ref_array( 'posts_distinct_request',	array( $distinct, &$this ) );
$fields	  = apply_filters_ref_array( 'posts_fields_request',	array( $fields, &$this ) );
$limits	  = apply_filters_ref_array( 'post_limits_request',	array( $limits, &$this ) );

// line 2614
// screw it, lets filter the resulting string of SQL
$this->request = apply_filters_ref_array('posts_request', array( $this->request, &$this ) );

Hopefully this shows the following things:

  • You can filter every part of every query which has its origins in WP_Query->get_posts()
  • You can make every query better if you know what you are doing
  • You can make every query worse if you don’t know what you are doing
  • In the end, you can clobber everything that any plugin has done by just arbitrarily altering the SQL string
WP Query is useful if you want to do something like a Taxonomy Query and have it “just work.” Let’s look at a WP Query I wrote at eMusic and then dissect it:
<?php

function get_stories( $post_types = array() ) {
    global $regions_map, $editorial_types;
    $key = 'stories-' . get_region();
    if ( !empty( $post_types ) ) {
        $key = 'stories-' . get_region() . '-' . md5( join( '', $post_types ) );
    }
    $group = 'editorial';
    $response = wp_cache_get( $key, $group );

    if ( empty( $response ) ) {
        $params = array(
            'posts_per_page'=> 10,
            'order'         => 'DESC',
            'post_type'     => $editorial_types,
            'orderby'       => 'date',
            'tax_query'     => array(
                array(
                'taxonomy'  => 'region',
                'field'     => 'term_id',
                'terms'     => array(
                    $regions_map[ 'ALL' ],
                    $regions_map[ strtoupper( get_region() ) ]
                ),
                'operator'  => 'IN'
                )
            ),
            'no_found_rows'             => true,
            'update_post_meta_cache'    => false,
            'update_post_term_cache'    => false
        );

        if ( !empty( $post_types ) ) {
            $params['post_type'] = $post_types;
        }

        $posts = new WP_Query( $params );

        if ( $posts->have_posts() ) {
            wp_cache_set( $key, $posts->posts, $group, 1800 );
            $response = $posts->posts;
        }
    }

    return $response;
}

In case it isn’t immediately obvious, I am requesting the 10 most recent posts written by our Editorial staff in reverse-chronological order filtered by the taxonomy “region” and including a subset of our large family of custom post types.

If my request is in the cache, awesome – serve it from there. Otherwise, make a WP Query request and cache the resulting posts.

Let’s list a number of optimizations I did:

  • Turned off SQL_CALC_FOUND_ROWS
  • Turned off slurping of post meta
  • Turned off slurping of terms
  • Populated a map of term_name => term_id for regionalization
  • Taking advantage of the WP cache abstraction layer

This particular WP Query won’t slurp posts and terms, but it is a good idea to turn them off everywhere you know you do not want them to be requested.

Let’s look at the SQL queries that make up my optimized request:

// hop to retrieve term_taxonomy_id
// for regions because Tax Query won't take term_taxonomy_id
// as the value for "field"
SELECT term_taxonomy_id FROM wp_term_taxonomy WHERE taxonomy = 'region' AND term_id IN (1308,1306)

// a JOIN on the Term Relationships table to get posts in the
// requested region
SELECT   wp_posts.* FROM wp_posts  INNER JOIN wp_term_relationships ON (wp_posts.ID = wp_term_relationships.object_id) WHERE 1=1  AND ( wp_term_relationships.term_taxonomy_id IN (1307,1309) ) AND wp_posts.post_type IN ('emusic_feature', 'emusic_spotlight', 'book_feature', 'book_spotlight', 'emusic_bookshelf', 'emusic_qa', 'book_qa', 'emusic_jukebox_jury', 'emusic_who', 'emusic_select', 'emusic_about', 'emusic_guide', 'emusic_list', 'emusic_stack', 'emusic_radio_program', 'emusic_review', 'book_review', 'emusic_hub', 'book_hub', 'emusic_list_hub', 'emusic_guide_hub', 'emusic_six_degrees', 'book_six_degrees', 'emusic_icon', 'book_icon') AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 10
2 queries, 565ms.

If I want to affect these queries, I can hook into WP Query with the filters I showed above. The problem I have with doing that is this: I don’t want to write a bunch functions just to remove a few clauses from one SQL statement:

<?php
// function to make my site really slow
function my_where_filter( $where ) {
    $my_where = preg_replace( '/this is overkill/', '', $where );
    return $my_where;
}

add_filter( 'posts_where_paged', 'my_where_filter' );

// all kinds of other stuff runs...
// other plugins collide with me...
// all I am doing is changing text...

$rows = result_of_slow_query();
remove_filter( 'posts_where_paged', 'my_where_filter' );

Imagine if I was filtering even 2 or 3 more clauses. What a mess…

Now, let’s remove my optimizations and see what happens when WP Query just runs like normal:

<?php

function get_stories( $post_types = array() ) {
	global $regions_map, $editorial_types;

	$params = array(
	    'posts_per_page'=> 10,
	    'order'         => 'DESC',
	    'post_type'     => $editorial_types,
	    'orderby'       => 'date',
	    'tax_query'     => array(
	        array(
	        'taxonomy'  => 'region',
	        'field'     => 'name',
	        'terms'     => array( 'ALL', strtoupper( get_region() ) ),
	        'operator'  => 'IN'
	        )
	    )
	);

	if ( !empty( $post_types ) ) {
	    $params['post_type'] = $post_types;
	}

	$posts = new WP_Query( $params );

	if ( $posts->have_posts() ) {
	    $response = $posts->posts;
	}

	return $response;
}

This version of the WP Query produces:

// hop to get term_id for regions
SELECT wp_term_taxonomy.term_id
FROM wp_term_taxonomy t
INNER JOIN wp_terms USING (term_id)
WHERE taxonomy = 'region'
AND wp_terms.name IN ('all','us')

// term_id is unusable, so go back and grab term_taxonomy_id
SELECT term_taxonomy_id FROM wp_term_taxonomy WHERE taxonomy = 'region' AND term_id IN (1308,1306)

// finally make the the request
SELECT SQL_CALC_FOUND_ROWS  wp_posts.* FROM wp_posts  INNER JOIN wp_term_relationships ON (wp_posts.ID = wp_term_relationships.object_id) WHERE 1=1  AND ( wp_term_relationships.term_taxonomy_id IN (1307,1309) ) AND wp_posts.post_type IN ('emusic_feature', 'emusic_spotlight', 'book_feature', 'book_spotlight', 'emusic_bookshelf', 'emusic_qa', 'book_qa', 'emusic_jukebox_jury', 'emusic_who', 'emusic_select', 'emusic_about', 'emusic_guide', 'emusic_list', 'emusic_stack', 'emusic_radio_program', 'emusic_review', 'book_review', 'emusic_hub', 'book_hub', 'emusic_list_hub', 'emusic_guide_hub', 'emusic_six_degrees', 'book_six_degrees', 'emusic_icon', 'book_icon') AND (wp_posts.post_status = 'publish' OR wp_posts.post_status = 'private') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 10

// we asked for 10, but I guess go ahead and do this in case we
// we want to indicate pagination or something
SELECT FOUND_ROWS()
4 queries, 658ms.

So we got this request from 4 database queries down to 2 using my optimizations. One of my optimizations that helped was $regions_map. $regions_map took my term_ids and turned them into term_taxonomy_ids. If we look closely at the first 2 queries in the non-optimized WP Query, we notice that the first 2 queries can be combined.

// produces term_id, so we have to make another database query
SELECT wp_term_taxonomy.term_id
FROM wp_term_taxonomy t
INNER JOIN wp_terms USING (term_id)
WHERE taxonomy = 'region'
AND wp_terms.name IN ('all','us')

// produces term_taxonomy_id, skip the second call
SELECT wp_term_taxonomy.term_taxonomy_id
FROM wp_term_taxonomy t
INNER JOIN wp_terms USING (term_id)
WHERE taxonomy = 'region'
AND wp_terms.name IN ('all','us')

I have another optimization called $regions_tax_map:

<?php

// term_id - for use with WP Query
$regions_map = array(
    'EU'    => 1303,
    'CA'    => 1304,
    'UK'    => 1305,
    'US'    => 1306,
    'ALL'   => 1308
);

// term_taxonomy_id, for when we want to write our own SQL
$regions_tax_map = array(
    'EU'    => 1304,
    'CA'    => 1305,
    'UK'    => 1306,
    'US'    => 1307,
    'ALL'   => 1309
);

When I snarked above that I thought the Term tables in WordPress are a mess, this is why. For any database call to have any teeth when joining with wp_term_relationships (as it inevitably does when you are using taxonomy to filter by region, etc), you have to have a term’s term_taxonomy_id for the lookup. There are many cases where this is not practical – a site that has 100s of models / makes of car in the term tables, but something like regional site code that only has 5 variations can be mirrored outside of the database and placed into code for quick access.

So let’s get that optimized WP Query down to only ONE call to the database by removing the GROUP BY clause and posts with a status of “private,” and make use of term_taxonomy_id shortcuts:

function get_stories( $post_types = array() ) {
    global $regions_tax_map, $editorial_types, $wpdb;
    $key = 'stories-' . get_region();
    if ( !empty( $post_types ) ) {
        $key = 'stories-' . get_region() . '-' . md5( join( '', $post_types ) );
    }
    $group = 'editorial';
    $response = wp_cache_get( $key, $group );

    if ( empty( $response ) ) {
        $types = empty( $post_types ) ? $editorial_types : $post_types;
        $joined = sprintf( "'%s'", join( "','", $types ) );

        $sql =<<<SQL
         SELECT wp_posts.* FROM wp_posts
         INNER JOIN wp_term_relationships
         ON (wp_posts.ID = wp_term_relationships.object_id)
         WHERE wp_term_relationships.term_taxonomy_id
            IN (
                {$regions_tax_map['ALL']},
                {$regions_tax_map[ strtoupper( get_region() ) ]}
            )
            AND wp_posts.post_type IN ({$joined})
            AND wp_posts.post_status = 'publish'
         ORDER BY wp_posts.post_date DESC
         LIMIT 0, 10
SQL;

         $response = $wpdb->get_results( $sql );

         if ( !empty( $response ) ) {
             wp_cache_set( $key, $response, $group, 1800 );
         }
    }

    return $response;
}
ONE query, 42ms.

As you can see, we are making ONE query that JOINs wp_posts and wp_term_relationships. We skip WP Query completely. We skip all 840 lines of WP_Query->get_posts(). Removing the GROUP BY clause drops the query time from 560ms to 42ms. HUGE.

We debugged the queries. We realized that all of that black voodoo is there to make sure 99% of the human population can get what they want from WP Query, but we have scalability to think about. We want to optimize ALL of our interactions with the database.

This request is a SMALL portion of our sidebar on pages that support it on eMusic’s website. However, I wasn’t originally caching this call. I was relying on the built-in WordPress post cache to take care of it. This ONE module almost brought down our site on one of our busiest traffic days at the end of last year. The panic immediately subsided when we enlisted more MySQL slaves, but I learned an important lesson: SLOW is SLOW. Don’t be a hero and cling to how you think or want WordPress to work, learn how things perform and make them as fast as possible. Also, cache like the wind.

$wpdb Abstraction Layer

A more hands-on approach is to use the Database Abstraction Layer in WordPress, namely: the $wpdb global object. $wpdb is great and provides MVC-like methods for UPDATEs and INSERTs. SELECTs are mostly hand-written when using $wpdb, which can be a good thing (like we demonstrated above).

Here are the best methods to use when retrieving one or many rows of data:

<?php

// get one row of data
$sql = $wpdb->prepare( 'SELECT ID, post_title, post_author, post_content FROM wp_posts WHERE id = %d', get_primary_key() );
$row = $wpdb->get_row( $sql, ARRAY_A );

// get a list of post IDs
$sql = $wpdb->prepare( 'SELECT ID FROM wp_posts WHERE post_type = "your_mom_jokes" AND post_author = %d', get_your_mom() );
$ids = $wpdb->get_col( $sql );

// get a bunch of rows
$sql = $wpdb->prepare( 'SELECT ID, post_title, post_author, post_content FROM wp_posts WHERE post_type = "bad_haiku"' );
$rows = $wpdb->get_results( $sql, ARRAY_A );

InnoDB vs MyISAM

WordPress “out-of-the-box” uses the MyISAM storage engine (the default storage engine for MySQL version < 5.5). For “most” users, this is just fine and dandy. MyISAM has the advantage of built-in full-text search. InnoDB is the default engine for MySQL > or = 5.5. InnoDB has row-level locking, as opposed to MyISAM’s table-level locking. Table-level locking hinders concurrency in the database when you have an astronomical number of rows in a high-traffic environment. Just one full-table scan query can cause all sorts of trouble when using MyISAM in production. Every mature RDBMS has built-in fulltext functionality, except for InnoDB (frown).

We’re going to look more at it in a minute, but realize that WordPress has some serious negatives when it comes to Search. For a giant site, you are better off looking at solutions like Lucene, Solr, and Elastic Search and providing them as a Web Service. Most sites won’t be serving every piece of content out of their WordPress database only, so this shouldn’t be outside the realm of possibility for your site’s architecture.

LONGTEXT COLUMNS

One particular thing about the WordPress database schema that is dangerous is its use of unindexed (and un-indexable) LONGTEXT columns to the store the values of key / value pairs in 4 of its tables. I suppose it makes sense to have maximum flexibility when storing data for a user base in the many millions, but the second you decide to query against the *values* in your key / value pairs, you are dancing with the devil. The “key” is stored as a VARCHAR with a standard BTREE index, and the “value” is stored as a LOB type (LONGTEXT, maximum size 4 GB). While the value for a particular key may be readily found because of the index on the key column, you cannot go the other way – you cannot, for instance, find all the users who use a particular nickname without doing a full table scan on an unindexed column. So try doing that on a table with over 10,000,000 rows and see how it performs.

Search

LOB columns cannot be indexed using standard BTREE indexes. Therefore any search against such a column invokes a full table scan, which means every block in the table must be read from disk into memory and searched. Using LIKE with wildcards (% or _) is even worse, because the entire column contents must be searched (possibly as much as 4GB of text) and, as we know, string comparison is CPU-intensive.

Another problem is the way that WordPress searches by default. Let’s say I want to search for “Arcade Fire.” WordPress is going to do wildcard searches for post_title LIKE "%Arcade%", post_title LIKE "%Fire%", post_content LIKE "%Arcade%", and post_content LIKE "%Fire%". I’m going to let you guess how well this performs. For a search like this, we really want to match the exact phrase if we’re using WordPress, or let some super advanced tool like Lucene take over and give your search results better handling.

<?php
add_filter( 'posts_search', 'refine' );

function refine( $where ) {
    if ( !empty( $where ) ) {
        $term = get_search_query();
        $term = mysql_real_escape_string( '%' . $term . '%' );
        $where = " AND ((wp_posts.post_title LIKE '$term') OR (wp_posts.post_content LIKE '$term')) ";
    }
    return $where;
}

You can also speed up instances of Search by forcing use of an index. Before we served Search natively through the WordPress front end, we were making Editorial Search available via XML-RPC to a Java client. We found that requesting a limited number of posts ordered by data was sped up by forcing index usage:

SELECT SQL_CALC_FOUND_ROWS wp_posts.* FROM wp_posts USE KEY (type_status_date)....

I know many people are against this practice, but I know of others who are for it in certain circumstances: namely, the MySQL performance team at Facebook.

HyperDB

If you are working in a high traffic environment, chances are you are going to need more power than just one database server can handle. You’ve probably already scaled to several web servers, adding slaves to your master database should come next.

HyperDB is awesome and works like a champ right out of the box. You can even enable replication lag detection by uncommenting some lines of code that ship with it and installing mk-heartbeat.

Conclusion

Debug your code. Find out how fast / slow everything you are doing is. Don’t assume everything in WordPress is fast. It’s not.

9 thoughts on “Understanding MySQL in WordPress

  1. This what’s been on my ‘to read’ list if for a while. A really good post, thanks Scott. Just one question about the optimised query… Isn’t the ‘GROUP BY’ necessary to prevent the post being duplicated in the returned results when it belongs to multiple taxonomies (Perhaps not relevant in your ‘region’ example) ?

Leave a Reply