OS X 10.8 Mountain Lion Preview + MacPorts

I love having access to software early, so I bought a Mac Developer Program membership last year for 100 bones and subsequently never really used it. I got an invite last week to download the Developer Preview of Mountain Lion, the forthcoming OS X upgrade, so I did.

Note: as soon as you upgrade to Mountain Lion, you need to install the preview of Xcode 4.4. After doing that, you have to install Xcode command line tools, which doesn’t happen automatically. If you don’t follow this extra step, you won’t have anything at your disposal in Terminal – make, svn, etc.

Installing pre-release software is dangerous and will break things all over the place. I’m going to tell you what breaks (that I know of so far) and how to fix it (if i know).

Chrome

I’m not going to pretend like I know why, but Chrome runs like shit so far in Mountain Lion. Might have something to do with Darwin 11 vs 12, might not. However, Safari is NOT broken and runs faster and smoother than I have ever seen it. I have a strong affinity for Chrome so I may not switch back yet, but man, it is fast.

Little Snitch

If you have pirated software and you want to block all attempts at activation pings in a managed way, you probably have a program like Little Snitch helping you since it allows you to block all internet access to selected programs. I use it because I have Adobe Master Collection installed and I, not surprisingly, didn’t pay the thousands of dollars that it costs. Little Snitch doesn’t work because it sniffs the OS version (10.8) and deems itself incompatible.

MacPorts – specific ports

This one’s a doozy. If you run port upgrade outdated, every one of your MacPorts will be up for upgrade because they were previously compiled on Darwin 11, not Darwin 12 which ships with OS X 10.8. Annoying, and time-consuming, but not a deal-breaker… until you get to any ports which require libxml2, which is “most.”

Because some of your ports will bail on error, you need to upgrade ports individually (get the list using port outdated) using port upgrade fontforge or whatever.

Once you have upgraded every port that doesn’t die on error, you will realize you have a ton of ports that are in limbo, namely, php5 and all php5-* extensions, X11 and all libraries, and postgresql90. They all have libxml2 as a dependency which no worky. The error is due to locale in reinplace which is not fixed in the distributed version of MacPorts but is fixed in SVN trunk of MacPorts project: http://trac.macports.org/browser/trunk/base/src/port1.0/portutil.tcl?rev=89839.

So, if we want to fix our ports, we need to run trunk of MacPorts while running the dev version of Mountain Lion.

mkdir -p /opt/mports
cd /opt/mports
svn checkout https://svn.macports.org/repository/macports/trunk
cd /opt/mports/trunk/base
./configure --enable-readline
make
sudo make install
make distclean

port edit libxml2
--- and change "reinplace" to "reinplace -locale C"

At this point, you need to go back to individually upgrading ports, but you’ll find that most of them get upgraded as dependencies of other ports, so it won’t take as long as before.

Ports that are still broken after this for me: ffmpeg – due to failure of libvpx port (VP8 codec), postgresql90, and the PHP PostgreSQL extension (php5-postgresql). I’m sure there are many others.

Mountain Lion is pretty cool and worth upgrading to if you have the option. Once you upgrade, you can’t go back to Lion without having previously cloned your hard drive, etc, so be careful and be willing to get your hands dirty when things don’t work.

I installed Mountain Lion on my laptop, which I have been using less and less for programming lately, but I often need to run local PHP, so I had no choice but to figure all of this out.

WordPress + Web Services

Most WordPress setups exist on a hosting service (Shared or VPS) – one server talking to one database, boom, done. As sites grow or big sites start to use WordPress, setups grow to many servers and a handful of databases, a handful of Memcached boxes, and maybe some external Web Services or 3rd-party APIs. If the site is heavy on Editorial content, you may not every interact with external services – but if your dynamic pages pull content from many APIs, internal and external, and you don’t want to write your entire site in JavaScript: it’s time to learn some tricks of the trade for powering and scaling many requests at a time over HTTP.

Web Services

So what’s a “Web Service”? At its simplest, a Web Service is a bunch of related URLs that return data, and the Service is usually RESTful – meaning, “Clients initiate requests to servers; servers process requests and return appropriate responses.”

http://www.woo.com/get/151
http://www.woo.com/hi-five?id=12345&friend_id=54321
http://www.woo.com/hi-five/delete?id=676767

Responses are typically in XML or JSON. It sucks to parse XML, so most APIs either return JSON by default or allow you to specify format=json in the request. Most also go as far as to return JSONP if you so desire, so you can make requests to external servers without ever leaving the confines of JavaScript.

Why does JSON rule so much?

<?php

// this produces an (associative) array
$data = json_decode( $response, true );

Parsing XML into a PHP array can look like this:

<?php
require( 'wordpress/wp-load.php' );

ini_set( 'memory_limit', '2048M' );
ini_set( 'max_execution_time', 6000 );

function walk( $arr, $func ) {
    if ( !is_array( $arr ) ) return $arr;

    while ( list( $key, $value ) = each( $arr ) ) {
        if ( is_array( $value ) ) {
            $arr[$key] = walk( $value, $func );
        } else if ( is_string( $value ) ) {
            $arr[$key] = call_user_func( $func, $value );
        }
    }
    return $arr;
}

$z = new XMLReader;
$z->open( 'blob-of.xml' );

while ( $z->read() && $z->name !== 'someNode' );

while ( $z->name === 'someNode' ) {
    $node = new SimpleXMLElement( $z->readOuterXML() );
    $data = (array) $node;
    $data = walk( $data, 'trim' );

    // nightmare of casting nested SimpleXMLElements into Arrays
    // etc etc

    $z->next( 'someNode' );
}
$z->close();

JSON is just way easier and ready-made for PHP.

Services expose APIs (Application Programming Interfaces) that are used to build requests. You should be familiar with available APIs on the web by now: Facebook Connect, Twitter, Google Maps, etc.

WP_Http

WordPress has an API for dealing with HTTP – procedural and object-oriented. It is also an abstraction layer, meaning that it will pick from a variety of HTTP connection methods based on your current setup. I’m going to spend my time introducing you to cURL, but WordPress doesn’t require that you have the cURL extension installed. The order in which it looks for connection methods is cURL, streams (fopen), and fsockopen.

<?php

$url = 'http://www.google.com/a-url/that/returns/json';
$request = new WP_Http; // or use the $wp_http global
$result = $request->request( $url );
$json = $result['body'];

$data = json_decode( $json, true );

cURL

cURL (or, “Client for URLs”) is a command line tool for getting or sending files using URL syntax. Its most bare-bones use is to return the response of a URL. In Terminal:

curl www.emusic.com

This will return the contents of whatever “www.emusic.com” returns. To just read the HTTP Response Headers:

curl -I www.emusic.com

You can post from the command line as well:

curl -d "foo=bar&pass=12345&id=blah" http://www.woo.com/post-to-me

You can even make an XML-RPC call from the command line like so:

curl -i -H 'Content-Type: text/xml' --data '<?xml version="1.0"?><methodCall><methodName>demo.sayHello</methodName><params></params></methodCall>' 'http://www.woo.com/xmlrpc.php'

cURL Extension

The cURL extension for PHP is at its core support for libcurl. libcurl currently supports the http, https, ftp, gopher, telnet, dict, file, and ldap protocols. libcurl also supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading (this can also be done with PHP’s ftp extension), HTTP form based upload, proxies, cookies, and user+password authentication.

Because the cURL extension has external dependencies, it is good to use a package manager to install it (MacPorts on OS X, yum on Linux – or PECL on either). You can run the following commands using MacPorts to install some PHP 5.4 extensions for HTTP:

sudo port install php54-curl php54-http

cURL Abstraction

We could start from scratch and write our own interface to cURL, or we could find someone who has done all of the dirty work for us. We are still going to write our own methods for talking to a cURL class, but we want to find a source that has already implemented the basics of cURL and cURL Multi.

WordPress has done this with the WP_Http class to an extent (no support or abstraction for cURL Multi). I use a variation on this guy’s code – an Object-Oriented cURL class that supports this form of “multi-threading.” The approach is very clean because it allows you to use the same interface to request one or multiple URLs:

// One URL
$curl = new CURL();
$curl->retry = 2;
$opts = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true  );
$curl->addSession( 'http://yahoo.com/', $opts );
$result = $curl->exec();
$curl->clear();

// 3 URLs
$curl = new CURL();
$opts = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true  );
$curl->addSession( 'http://yahoo.com/', $opts );
$curl->addSession( 'http://google.com/', $opts );
$curl->addSession( 'http://ask.com/', $opts );
$result = $curl->exec();
$curl->clear();

cURL Multi

cURL Multi has a procedural API in the cURL PHP extension, and it’s pretty nasty to deal with. There are standard ways it is implemented, which we took care of by using our OO class abstraction of it. The guts look like this:

<?php
$mh = curl_multi_init();

// Add all sessions to multi handle
foreach ( $this->sessions as $i => $url )
    curl_multi_add_handle( $mh, $this->sessions[$i] );

do
    $mrc = curl_multi_exec( $mh, $active );
while ( $mrc == CURLM_CALL_MULTI_PERFORM );

while ( $active && $mrc == CURLM_OK ) {
    if ( curl_multi_select( $mh ) != -1 ) {
	do
	    $mrc = curl_multi_exec( $mh, $active );
	while ( $mrc == CURLM_CALL_MULTI_PERFORM );
    }
}

It’s just a busy loop that eventually ends up returning an array of responses. You don’t get your response until all of the requests have completed, so you are only as fast as your slowest link.

Batch Requests

Almost every page on eMusic’s website is the result of multiple Web Service requests. Making these requests serially (one at a time) takes way too long, we obviously opt to use cURL Multi. I wrote a wrapper class for our methods simply called API.

Here are some methods it contains:

<?php
class API {
.....
    private function parse_response( $str ) {
	if ( is_string( $str ) && ( '404 Not Found' === $str || 0 === stripos( $str, 'Not Found' ) ) )
	    return $str;

	if ( is_array( $str ) )
	     return $str;

	$str = trim( $str );

	$arr = array();
	/**
         * Do a cheesy RegEx scan to indicate that it looks like JSON
         */
	if ( preg_match( self::JSON_REGEX, $str ) ) {
	    $str = preg_replace( '/t|ss+/', ' ', $str );
	    /**
	     * The Bjork problem...
	     * ö, etc can break JSON deserialization
	     * so we need to clean up the JSON response
	     *
	     */
	    if ( ! seems_utf8( $str ) ) {
		$str = emusic_fix_bad_chars( $str );
	    }

	    $arr = json_decode( $str, true );

	    if ( ! empty( $arr ) )
		return $arr;
	    }

	    /**
	     * Only return documents, not string version of empty array
	     */
	    if ( null === $arr )
		return $str;
	}

Take an array of URLs and make the requests in batch – the result is an array of responses:

public function multi_request( $urls ) {
	$this->curl->sessions = array();

	$opts = array( CURLOPT_ENCODING => 'identity' );

	$this->_add_existing_headers( $opts );

	foreach ( $urls as $u )
		$this->curl->addSession( $u, $opts );

	$result = $this->do_request( true );

	return $result;
}

Pass a URL and array of params and make a GET request:

public function get( $url, $params = '', $cache = true, $ttl = 0 ) {
    if ( ! empty( $params ) )
	$url = self::url( $url, $params );

    self::save_url( $url );

    $cached = null;

    if ( $cache )
	$cached = Cache::get( $this->bucketize( $url ), $url );

    if ( $cached ) {
	return $cached;
    } else {
	$this->curl->sessions = array();

	$opts = array( CURLOPT_TIMEOUT => 8 );

	$this->_add_existing_headers( $opts );

	$this->curl->addSession( $url, $opts );
	$result = $this->do_request();

	if ( $result && ! $this->is_http_error() ) {
	    $result = $this->parse_response( $result );
	    $this->add_to_cache( $url, $result, $ttl );
	    return $result;
	}
    }
}

Pass a URL and array of params and make a POST request:

public function post( $url, $args = '' ) {
	$this->curl->sessions = array();

	$opts = array(
		CURLOPT_POST => 1,
		CURLOPT_SSL_VERIFYHOST	=> 0,
		CURLOPT_SSL_VERIFYPEER	=> 0,
	);

	if ( ! empty( $args ) )
		$opts[CURLOPT_POSTFIELDS] = http_build_query( $args );
	else
		$opts[CURLOPT_POSTFIELDS] = '';

	$this->_add_existing_headers( $opts );

	$this->curl->addSession( $url, $opts );
	$response = $this->do_request();

	if ( empty( $response ) && ! $this->is_http_error() ) {
		return null;
	} else {
		return $this->parse_response( $response );
	}
}

Caching

When dealing with Web Services, you often (almost always) want to cache your responses for an arbitrary or specific amount of time. At eMusic, we have a large number of Web Services that often do very expensive things. The purpose of breaking functionality out into services is being able to scale and optimize each service individually outside of the realm of the core application logic. As long as the APIs remain unchanged, you can mess with the backend of your service and not affect my use of it.

Some of our services include: catalog-service, ratings-service, save-for-later-service, user-service, pricing-service, auth-service and the list goes on and on.

To take a advantage of caching while making batch requests, I wrote another method that does the following:

  1. Iterate through an array of URLs
  2. If it’s in the cache, remove that item from batch list
  3. After all items have been checked against the cache, make a batch request if any URLs are still in the queue
  4. For each request in the batch that is made, md5 the URL and use it as the key and the response as a value to place in the cache.
  5. If all of your URLs are cached, you make zero requests and just return the cached responses in an array

Here’s an example of how the logic works, your implementation will probably differ:

<?php

.....
public function batch( $urls, $ttl = 0, $usecache = true ) {
    $response = array();
    ob_start();

    if ( empty( $ttl ) )
	$ttl = CACHE::API_CACHE_TTL;

    array_map( 'API::save_url', $urls );

    $batch_urls = array();
    if ( is_array( $urls ) ) {
	if ( $usecache ) {
	    foreach ( $urls as $index => $url ) {
		$in = Cache::get( $this->bucketize( $url ), $url );

		if ( $in ) {
		    $response[$index] = $in;
		} else {
		    $batch_urls[$index] = $url;
		}
	    }
	}
    }
    $calls = $this->multi_request( $batch_urls );

    if ( is_array( $calls ) && count( $calls ) > 0 ) {
	$keys = array_keys( $batch_urls );
	$calls = array_combine( $keys, array_values( $calls ) );

	foreach ( $calls as $index => $c ) {
	    if ( $c ) {
		$response[$index] = $this->parse_response( $c );
		$this->add_to_cache( $batch_urls[$index], $response[$index], $ttl );
	    } else {
		$response[$index] = null;
	    }
	}
    } else if ( ! empty( $calls ) && ! empty( $batch_urls ) ) {

	reset( $batch_urls );
	$index = key( $batch_urls );
	$data = $this->parse_response( $calls );
	$this->add_to_cache( $batch_urls[$index], $data, $ttl );

	$response[$index] = $data;
    }
    ob_end_clean();

    return $response;
}
....

I have abstracted the Cache class because I sometimes use my API class without using WordPress. In my eMusic implementation, I use wp_cache_set and wp_cache_get, but this allows me to implement Cache however I want in other environments.

Request Map

Making batch requests is all well and good, but I still have to loop through the responses to do anything with them. If my batch request was for expanded catalog metadata on 10 separate albums, then a loop works just fine. If I am collecting all of the data I need for an entire page by making all of my service calls in batch, I might want to associate a callback with each individual URL. To accomplish this, I wrote a class called Request Map.

Here is a Request Map instance which makes all of the calls we need from Services to populate data on the homepage of eMusic:

<?php
$map = new RequestMap();
$map->add( get_twitter_url( 'eMusic' ), function ( $data ) {
    global $tweets;
    $tweets = $data;
} );
$map->add( get_charts( array( 'perPage' => 10 ) ), function ( $data ) {
    global $chart;

    if ( !empty( $data['albums'] ) ) {
        $chart = $data['albums'];
    }
} );
$map->add( get_trending_artists( array( 'perPage' => 5, 'return' => true ) ), function ( $data ) {
    global $trending;
    if ( !empty( $data['artists'] ) ) {
        $trending = $data['artists'];
    }
} );
$map->send();

RequestMap::add takes 2 arguments: a URL and callback that receives the data as a PHP array. Every time RequestMap::send is called, a batch request is fired for any URLs in the queue. When finished, appropriate data is passed to each callback. RequestMap::flush will reset the queue:

<?php
$map = new RequestMap();
$map->add( 'http://woo.com/1', function ( $data ) {....} );
$map->add( 'http://woo.com/2', function ( $data ) {....} );
$map->add( 'http://woo.com/3', function ( $data ) {....} );
$map->send();
$map->flush();
$map->add( 'http://woo.com/4', function ( $data ) {....} );
$map->add( 'http://woo.com/5', function ( $data ) {....} );
$map->add( 'http://woo.com/6', function ( $data ) {....} );
$map->send();

This is necessary if the second set of requests requires data from the first since PHP does not have true threading. Here is the code for my RequestMap class:

<?php
class RequestMap {
/**
 * @var array
 */
private $requests;
/**
 * @var array
 */
private $responses;
/**
 * @var int
 */
private $ttl;
/**
 * @var boolean
 */
private $useCache = true;
/**
 * @var boolean
 */
protected $error = false;

public function __construct() {
    $this->flush();
}

/**
 * @return boolean
 */
public function is_error() {
    return $this->error;
}

public function flush() {
    $this->requests = array();
}

/**
 * @return int
 */
public function getTtl() {
    if ( empty( $this->ttl ) )
	$this->ttl = CACHE::API_CACHE_TTL;

	return $this->ttl;
}

public function setTtl( $ttl ) {
    $this->ttl = $ttl;
}

/**
 * @return boolean
 */
public function getUseCache() {
    return $this->useCache;
}

public function setUseCache( $usecache ) {
    $this->useCache = $usecache;
}

/**
 *
 * @param string $url
 * @param callable $callback
 * @param array $vars
 */
public function add( $url, $callback, $vars = array() ) {
    $params = new stdClass();
    $params->url = $url;
    $params->callback = $callback;
    $params->params = (array) $vars;
    $this->requests[] = $params;
}

/**
 *
 * @return array
 */
private function getRequestUrls() {
    return array_map( function ( $item ) {
	return $item->url;
    }, $this->requests );
}

/**
 *
 * @param stdClass $item
 * @param array $response
 */
private function exec( $item, $response ) {
    $params = array_merge( array( $response ), $item->params );
    call_user_func_array( $item->callback, $params );
}

public function send() {
    if ( ! empty( $this->requests ) ) {
	$this->responses = API()->batch( $this->getRequestUrls(), $this->getTtl(), $this->useCache );

	if ( is_array( $this->responses ) ) {
	    foreach ( $this->responses as $i => $response ) {
		if ( ! empty( $this->requests[$i] ) ) {
	            $this->exec( $this->requests[$i], $response );
		}
	    }
        }
    }

    $this->flush();
}

}

It’s really not that complicated.

Conclusion

I would advise using methods like I have listed above instead of relying on JavaScript and JSONP to talk to external services. Pages can be served super-fast out of Memcached if set up properly, and your page load times will be faster – perceived or otherwise – with as little initial JavaScript as possible.

Term / Taxonomy is broken in WordPress

I meant to include this in my MySQL post, but making a separate post to address this one particular issue works just fine. In my previous post, I said that the Term tables are a mess. Here is why:

Let’s say I create 2 taxonomies, location and radio_station (FYI: an example of a “taxonomy” is a “tag” or a “category”). Here’s how I register them (simplified) in my theme or plugin:

<?php
register_taxonomy( 'radio_station', 'post', array( 'labels' => array(
    'name'          => 'Radio Stations',
    'singluar_name' => 'Radio Station'
) ) );
register_taxonomy( 'location', 'post', array( 'labels' => array(
    'name'          => 'Locations',
    'singluar_name' => 'Location'
) ) );

Once this is done, you should have extra links in the Posts fly-out:

Click each link (Tags, Radio Stations, Locations) and add “Brooklyn” as a term. This is a completely valid practice. Now change ONE of them to “Brooklyn’s Finest” and go back and check out the other 2. THEY ALL CHANGED! So that sucks right?

Why does this happen? Terms are stored in wp_terms and are constrained by uniqueness. So when you added Brooklyn that second time – even though it had a different taxonomy – it pointed at that first term, and more importantly, shackled itself to it forever by associating it with that first term’s term_id in the wp_term_taxonomy table.

wp_term_taxonomy is the first-class table of the bunch (wp_terms, wp_term_taxonomy, and wp_term_relationships). Terms are arbitrary. They are associated with taxonomies in the wp_term_taxonomy table. term_taxonomy_id is the PRIMARY KEY, and term_id is the foreign reference. Weirdly, taxonomy is not a foreign reference alluding to wp_taxonomy, there is no wp_taxonomy table! wp_term_relationships joins term_taxonomy_id with Post IDs.

So let’s say you have 1,000,000 terms but only 5 taxonomies. If 90,000 of them are post_tags, the taxonomy field for all of them will be post_tag. The opposite is not true. 5 taxonomies only point at ONE Brooklyn. In this scenario, shared terms make no sense. Changing a term’s name that that is stored at term_id in wp_terms will change the name for each and every taxonomy that term_id is associated with.

Discuss.