Most WordPress setups exist on a hosting service (Shared or VPS) – one server talking to one database, boom, done. As sites grow or big sites start to use WordPress, setups grow to many servers and a handful of databases, a handful of Memcached boxes, and maybe some external Web Services or 3rd-party APIs. If the site is heavy on Editorial content, you may not every interact with external services – but if your dynamic pages pull content from many APIs, internal and external, and you don’t want to write your entire site in JavaScript: it’s time to learn some tricks of the trade for powering and scaling many requests at a time over HTTP.
Web Services
So what’s a “Web Service”? At its simplest, a Web Service is a bunch of related URLs that return data, and the Service is usually RESTful – meaning, “Clients initiate requests to servers; servers process requests and return appropriate responses.”
http://www.woo.com/get/151 http://www.woo.com/hi-five?id=12345&friend_id=54321 http://www.woo.com/hi-five/delete?id=676767
Responses are typically in XML or JSON. It sucks to parse XML, so most APIs either return JSON by default or allow you to specify format=json
in the request. Most also go as far as to return JSONP if you so desire, so you can make requests to external servers without ever leaving the confines of JavaScript.
Why does JSON rule so much?
<?php // this produces an (associative) array $data = json_decode( $response, true );
Parsing XML into a PHP array
can look like this:
<?php require( 'wordpress/wp-load.php' ); ini_set( 'memory_limit', '2048M' ); ini_set( 'max_execution_time', 6000 ); function walk( $arr, $func ) { if ( !is_array( $arr ) ) return $arr; while ( list( $key, $value ) = each( $arr ) ) { if ( is_array( $value ) ) { $arr[$key] = walk( $value, $func ); } else if ( is_string( $value ) ) { $arr[$key] = call_user_func( $func, $value ); } } return $arr; } $z = new XMLReader; $z->open( 'blob-of.xml' ); while ( $z->read() && $z->name !== 'someNode' ); while ( $z->name === 'someNode' ) { $node = new SimpleXMLElement( $z->readOuterXML() ); $data = (array) $node; $data = walk( $data, 'trim' ); // nightmare of casting nested SimpleXMLElements into Arrays // etc etc $z->next( 'someNode' ); } $z->close();
JSON is just way easier and ready-made for PHP.
Services expose APIs (Application Programming Interfaces) that are used to build requests. You should be familiar with available APIs on the web by now: Facebook Connect, Twitter, Google Maps, etc.
WP_Http
WordPress has an API for dealing with HTTP – procedural and object-oriented. It is also an abstraction layer, meaning that it will pick from a variety of HTTP connection methods based on your current setup. I’m going to spend my time introducing you to cURL, but WordPress doesn’t require that you have the cURL extension installed. The order in which it looks for connection methods is cURL, streams (fopen
), and fsockopen
.
<?php $url = 'http://www.google.com/a-url/that/returns/json'; $request = new WP_Http; // or use the $wp_http global $result = $request->request( $url ); $json = $result['body']; $data = json_decode( $json, true );
cURL
cURL (or, “Client for URLs”) is a command line tool for getting or sending files using URL syntax. Its most bare-bones use is to return the response of a URL. In Terminal:
curl www.emusic.com
This will return the contents of whatever “www.emusic.com” returns. To just read the HTTP Response Headers:
curl -I www.emusic.com
You can post from the command line as well:
curl -d "foo=bar&pass=12345&id=blah" http://www.woo.com/post-to-me
You can even make an XML-RPC call from the command line like so:
curl -i -H 'Content-Type: text/xml' --data '<?xml version="1.0"?><methodCall><methodName>demo.sayHello</methodName><params></params></methodCall>' 'http://www.woo.com/xmlrpc.php'
cURL Extension
The cURL extension for PHP is at its core support for libcurl. libcurl currently supports the http, https, ftp, gopher, telnet, dict, file, and ldap protocols. libcurl also supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading (this can also be done with PHP’s ftp extension), HTTP form based upload, proxies, cookies, and user+password authentication.
Because the cURL extension has external dependencies, it is good to use a package manager to install it (MacPorts on OS X, yum on Linux – or PECL on either). You can run the following commands using MacPorts to install some PHP 5.4 extensions for HTTP:
sudo port install php54-curl php54-http
cURL Abstraction
We could start from scratch and write our own interface to cURL, or we could find someone who has done all of the dirty work for us. We are still going to write our own methods for talking to a cURL class, but we want to find a source that has already implemented the basics of cURL and cURL Multi.
WordPress has done this with the WP_Http class to an extent (no support or abstraction for cURL Multi). I use a variation on this guy’s code – an Object-Oriented cURL class that supports this form of “multi-threading.” The approach is very clean because it allows you to use the same interface to request one or multiple URLs:
// One URL $curl = new CURL(); $curl->retry = 2; $opts = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true ); $curl->addSession( 'http://yahoo.com/', $opts ); $result = $curl->exec(); $curl->clear(); // 3 URLs $curl = new CURL(); $opts = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true ); $curl->addSession( 'http://yahoo.com/', $opts ); $curl->addSession( 'http://google.com/', $opts ); $curl->addSession( 'http://ask.com/', $opts ); $result = $curl->exec(); $curl->clear();
cURL Multi
cURL Multi has a procedural API in the cURL PHP extension, and it’s pretty nasty to deal with. There are standard ways it is implemented, which we took care of by using our OO class abstraction of it. The guts look like this:
<?php $mh = curl_multi_init(); // Add all sessions to multi handle foreach ( $this->sessions as $i => $url ) curl_multi_add_handle( $mh, $this->sessions[$i] ); do $mrc = curl_multi_exec( $mh, $active ); while ( $mrc == CURLM_CALL_MULTI_PERFORM ); while ( $active && $mrc == CURLM_OK ) { if ( curl_multi_select( $mh ) != -1 ) { do $mrc = curl_multi_exec( $mh, $active ); while ( $mrc == CURLM_CALL_MULTI_PERFORM ); } }
It’s just a busy loop that eventually ends up returning an array
of responses. You don’t get your response until all of the requests have completed, so you are only as fast as your slowest link.
Batch Requests
Almost every page on eMusic’s website is the result of multiple Web Service requests. Making these requests serially (one at a time) takes way too long, we obviously opt to use cURL Multi. I wrote a wrapper class for our methods simply called API.
Here are some methods it contains:
<?php class API { ..... private function parse_response( $str ) { if ( is_string( $str ) && ( '404 Not Found' === $str || 0 === stripos( $str, 'Not Found' ) ) ) return $str; if ( is_array( $str ) ) return $str; $str = trim( $str ); $arr = array(); /** * Do a cheesy RegEx scan to indicate that it looks like JSON */ if ( preg_match( self::JSON_REGEX, $str ) ) { $str = preg_replace( '/t|ss+/', ' ', $str ); /** * The Bjork problem... * ö, etc can break JSON deserialization * so we need to clean up the JSON response * */ if ( ! seems_utf8( $str ) ) { $str = emusic_fix_bad_chars( $str ); } $arr = json_decode( $str, true ); if ( ! empty( $arr ) ) return $arr; } /** * Only return documents, not string version of empty array */ if ( null === $arr ) return $str; }
Take an array of URLs and make the requests in batch – the result is an array of responses:
public function multi_request( $urls ) { $this->curl->sessions = array(); $opts = array( CURLOPT_ENCODING => 'identity' ); $this->_add_existing_headers( $opts ); foreach ( $urls as $u ) $this->curl->addSession( $u, $opts ); $result = $this->do_request( true ); return $result; }
Pass a URL and array of params and make a GET request:
public function get( $url, $params = '', $cache = true, $ttl = 0 ) { if ( ! empty( $params ) ) $url = self::url( $url, $params ); self::save_url( $url ); $cached = null; if ( $cache ) $cached = Cache::get( $this->bucketize( $url ), $url ); if ( $cached ) { return $cached; } else { $this->curl->sessions = array(); $opts = array( CURLOPT_TIMEOUT => 8 ); $this->_add_existing_headers( $opts ); $this->curl->addSession( $url, $opts ); $result = $this->do_request(); if ( $result && ! $this->is_http_error() ) { $result = $this->parse_response( $result ); $this->add_to_cache( $url, $result, $ttl ); return $result; } } }
Pass a URL and array of params and make a POST request:
public function post( $url, $args = '' ) { $this->curl->sessions = array(); $opts = array( CURLOPT_POST => 1, CURLOPT_SSL_VERIFYHOST => 0, CURLOPT_SSL_VERIFYPEER => 0, ); if ( ! empty( $args ) ) $opts[CURLOPT_POSTFIELDS] = http_build_query( $args ); else $opts[CURLOPT_POSTFIELDS] = ''; $this->_add_existing_headers( $opts ); $this->curl->addSession( $url, $opts ); $response = $this->do_request(); if ( empty( $response ) && ! $this->is_http_error() ) { return null; } else { return $this->parse_response( $response ); } }
Caching
When dealing with Web Services, you often (almost always) want to cache your responses for an arbitrary or specific amount of time. At eMusic, we have a large number of Web Services that often do very expensive things. The purpose of breaking functionality out into services is being able to scale and optimize each service individually outside of the realm of the core application logic. As long as the APIs remain unchanged, you can mess with the backend of your service and not affect my use of it.
Some of our services include: catalog-service, ratings-service, save-for-later-service, user-service, pricing-service, auth-service and the list goes on and on.
To take a advantage of caching while making batch requests, I wrote another method that does the following:
- Iterate through an array of URLs
- If it’s in the cache, remove that item from batch list
- After all items have been checked against the cache, make a batch request if any URLs are still in the queue
- For each request in the batch that is made, md5 the URL and use it as the key and the response as a value to place in the cache.
- If all of your URLs are cached, you make zero requests and just return the cached responses in an array
Here’s an example of how the logic works, your implementation will probably differ:
<?php ..... public function batch( $urls, $ttl = 0, $usecache = true ) { $response = array(); ob_start(); if ( empty( $ttl ) ) $ttl = CACHE::API_CACHE_TTL; array_map( 'API::save_url', $urls ); $batch_urls = array(); if ( is_array( $urls ) ) { if ( $usecache ) { foreach ( $urls as $index => $url ) { $in = Cache::get( $this->bucketize( $url ), $url ); if ( $in ) { $response[$index] = $in; } else { $batch_urls[$index] = $url; } } } } $calls = $this->multi_request( $batch_urls ); if ( is_array( $calls ) && count( $calls ) > 0 ) { $keys = array_keys( $batch_urls ); $calls = array_combine( $keys, array_values( $calls ) ); foreach ( $calls as $index => $c ) { if ( $c ) { $response[$index] = $this->parse_response( $c ); $this->add_to_cache( $batch_urls[$index], $response[$index], $ttl ); } else { $response[$index] = null; } } } else if ( ! empty( $calls ) && ! empty( $batch_urls ) ) { reset( $batch_urls ); $index = key( $batch_urls ); $data = $this->parse_response( $calls ); $this->add_to_cache( $batch_urls[$index], $data, $ttl ); $response[$index] = $data; } ob_end_clean(); return $response; } ....
I have abstracted the Cache class because I sometimes use my API class without using WordPress. In my eMusic implementation, I use wp_cache_set and wp_cache_get, but this allows me to implement Cache however I want in other environments.
Request Map
Making batch requests is all well and good, but I still have to loop through the responses to do anything with them. If my batch request was for expanded catalog metadata on 10 separate albums, then a loop works just fine. If I am collecting all of the data I need for an entire page by making all of my service calls in batch, I might want to associate a callback with each individual URL. To accomplish this, I wrote a class called Request Map.
Here is a Request Map instance which makes all of the calls we need from Services to populate data on the homepage of eMusic:
<?php $map = new RequestMap(); $map->add( get_twitter_url( 'eMusic' ), function ( $data ) { global $tweets; $tweets = $data; } ); $map->add( get_charts( array( 'perPage' => 10 ) ), function ( $data ) { global $chart; if ( !empty( $data['albums'] ) ) { $chart = $data['albums']; } } ); $map->add( get_trending_artists( array( 'perPage' => 5, 'return' => true ) ), function ( $data ) { global $trending; if ( !empty( $data['artists'] ) ) { $trending = $data['artists']; } } ); $map->send();
RequestMap::add takes 2 arguments: a URL and callback that receives the data as a PHP array. Every time RequestMap::send is called, a batch request is fired for any URLs in the queue. When finished, appropriate data is passed to each callback. RequestMap::flush will reset the queue:
<?php $map = new RequestMap(); $map->add( 'http://woo.com/1', function ( $data ) {....} ); $map->add( 'http://woo.com/2', function ( $data ) {....} ); $map->add( 'http://woo.com/3', function ( $data ) {....} ); $map->send(); $map->flush(); $map->add( 'http://woo.com/4', function ( $data ) {....} ); $map->add( 'http://woo.com/5', function ( $data ) {....} ); $map->add( 'http://woo.com/6', function ( $data ) {....} ); $map->send();
This is necessary if the second set of requests requires data from the first since PHP does not have true threading. Here is the code for my RequestMap class:
<?php class RequestMap { /** * @var array */ private $requests; /** * @var array */ private $responses; /** * @var int */ private $ttl; /** * @var boolean */ private $useCache = true; /** * @var boolean */ protected $error = false; public function __construct() { $this->flush(); } /** * @return boolean */ public function is_error() { return $this->error; } public function flush() { $this->requests = array(); } /** * @return int */ public function getTtl() { if ( empty( $this->ttl ) ) $this->ttl = CACHE::API_CACHE_TTL; return $this->ttl; } public function setTtl( $ttl ) { $this->ttl = $ttl; } /** * @return boolean */ public function getUseCache() { return $this->useCache; } public function setUseCache( $usecache ) { $this->useCache = $usecache; } /** * * @param string $url * @param callable $callback * @param array $vars */ public function add( $url, $callback, $vars = array() ) { $params = new stdClass(); $params->url = $url; $params->callback = $callback; $params->params = (array) $vars; $this->requests[] = $params; } /** * * @return array */ private function getRequestUrls() { return array_map( function ( $item ) { return $item->url; }, $this->requests ); } /** * * @param stdClass $item * @param array $response */ private function exec( $item, $response ) { $params = array_merge( array( $response ), $item->params ); call_user_func_array( $item->callback, $params ); } public function send() { if ( ! empty( $this->requests ) ) { $this->responses = API()->batch( $this->getRequestUrls(), $this->getTtl(), $this->useCache ); if ( is_array( $this->responses ) ) { foreach ( $this->responses as $i => $response ) { if ( ! empty( $this->requests[$i] ) ) { $this->exec( $this->requests[$i], $response ); } } } } $this->flush(); } }
It’s really not that complicated.
Conclusion
I would advise using methods like I have listed above instead of relying on JavaScript and JSONP to talk to external services. Pages can be served super-fast out of Memcached if set up properly, and your page load times will be faster – perceived or otherwise – with as little initial JavaScript as possible.