Peter Sobot

The Cost of Waterloo Software Engineering

2014-09-08T00:00:00-04:00

This past June, I graduated from the University of Waterloo’s Software Engineering program. After 5 long and difficult years, I’m extremely proud to say that I’m a Waterloo grad, and very proud of my accomplishments and experiences at the school. Somewhat surprisingly, myself and most of my classmates were able to graduate from a top-tier engineering school with zero debt. (I know this might sound like a sales pitch – stick with me here.)

Waterloo is home to the world’s largest cooperative education programs —– meaning that every engineering student is required to take at least 5 internships over the course of their degree. Most take six. This lengthens the duration of the course to five years, and forces us into odd schedules where we alternate between four months of work and four months of school. We get no summer breaks.

One of the most important parts of Waterloo’s co-op program is that the school requires each placement be paid. Without meeting certain minimum requirements for compensation, a student can’t claim academic credit for their internship, and without five internships, they can’t graduate. This results in Waterloo co-op students being able to pay their tuition in full (hopefully) each semester. In disciplines like Software Engineering, where demand is at an all-time high and many students are skilled enough to hold their own at Silicon Valley tech giants, many students end up negotiating for higher salaries at their internships.

To help visualize this financial situation and aid younger Software Engineering students in planning their future, I decided to create a little tool: the SE Calculator.

This simple, free, open-source in-browser tool allows you to calculate and visualize how much money you’ll earn or owe at the end of a five-year Waterloo Software Engineering degree. While it’s not rigorous (and should not be used as a financial advisor) it has helped me visualize how much money I’ve earned and spent during my academic career.

By default, the site assumes you’re a student that pays average Software Engineering tuition and average Software Engineering fees, earns one scholarship in your first year, and spends each internship working at software companies in Waterloo. The calculator includes a bunch of preset values, taken from personal experience and that of classmates, to simulate what you might make and spend when working in different regions or industries. (For example, the San Francisco Bay Area preset has a ridiculously high housing cost, but a similarly high salary.)

The site also stores your data in the URL string, because — well, simply — I wanted to store the data somewhere quick and easy. Bookmark the page once you’ve plugged in some values and store multiple datasets in your bookmarks bar.

If you’re a Software Engineering student (or will soon be one), I hope you find the tool useful to you. If you’re a student in some other Waterloo Engineering discipline, or in Computer Science, hopefully most of the fields still apply to you and you might get some utility out of the tool as well.

If you’re interested in customizing the tool – to add new presets, to adapt it to your own academic situation, or just to fix bugs – please feel free to fork it on GitHub. The tool runs almost entirely in-browser with Angular.js and uses Gulp as a build tool. Happy hacking!

The Holiday Party Hack

2013-12-14T00:00:00-05:00

For this year’s holiday party at The Working Group, I helped build something special to spice up the party – a live, music-synced slideshow of the evening, powered by a nearby photo booth. Take a photo with your friends and loved ones, then see it show up on the big screen seconds later.

The Hardware

To take the photos, we mounted a Canon Rebel T2i with an Eye-Fi card on a tripod in front of a great backdrop. A generous serving of props was provided for people to play with, and the room was well lit.

Also significant – the photo booth had a glass wall on one side, making it easy for partygoers to notice the fun to be had inside, while still allowing for a little bit of separation from the cacophony outside.

Finally, to allow partygoers to trigger their photos themselves without needing someone behind the camera, Brian Gilham and I built a huge, industrial-looking remote with a massive green button. In reality, we just wrapped the camera’s tiny remote in a larger enclosure and physically lined up the remote’s button with the plunger of a larger button.

The Eye-Fi card in the camera synced its photos automatically with a nearby Macbook Pro.

The Software

To get the photos on the screen, a ridiculous number of steps were used. Hazel, running on the Macbook Pro, copied the photos from the Eye-Fi card’s folder into a dedicated folder in Dropbox. A Node.js app running on a Rackspace cloud server connected to the Dropbox API and received real-time updates whenever new photos were placed in the Dropbox folder. This app downloaded the high-res photos from Dropbox, used Imagemagick to crop, scale, and rotate them appropriately, and streamed them down to all connected browsers.

A Macbook Pro connected to the projector ran a client-side JavaScript app and received real-time photo updates via Socket.io. This app also used the Web Audio API to run BeatDetektor, an open source JS beat detection library, on the audio received by the laptop’s microphone. Finally, Scott Schiller’s 2003-era snowstorm.js library provided the wonderfully tacky snow falling in-browser.

This complicated chain of events made it super simple to build the software – by piecing together pre-made components like Dropbox, Hazel, and BeatDetektor, most of the work was already done. Some extra functionality even came for free – for example, by sharing the Dropbox folder with select people at the party, candid photos could be uploaded from people’s phones directly to the projector screen.

The Results

By the end of the night, more than 350 photos – 1.5GB of data – had been processed by the hack and made it to the big screen. At one point, so many photos were taken in quick succession that the server load spiked to 38 and crashed hard – bringing with it forever.fm, my “infinite” radio station. Despite the small technical hiccups, the hack turned out wonderfully and was a huge success.

Huge thanks go out to Chris Mudiappahpillai, Brian Gilham, Derek Watson and Shiera Aryev and many more for making the hack – and the evening – a resounding success.

The Architecture of an Infinite Stream of Music

2013-11-05T00:00:00-05:00

Nearly a year ago, I launched forever.fm – a free online radio station that seamlessly beat matches its songs together into a never-ending stream. At launch, it was hugely popular – with hundreds of thousands of people tuning in. In the months since its initial spike of popularity, I’ve had a chance to revisit the app and rebuild it from the ground up for increased stability and quality.

(Grab the free iOS and Android apps to listen to forever.fm on the go.)

Initially, Forever.fm was a single-process Python app, written with the same framework I had built for my other popular web app, The Wub Machine. While this worked as a proof of concept, there were a number of issues with this model.

Single monolithic apps are very difficult to scale. In my case, Forever.fm’s monolithic Python process had to service web requests and generate the audio to send to its listeners. This task is what’s known as a “soft real-time” task – in which any delays or missed deadlines cause noticeable degradation of experience to the user. As the usage of the app grew, it became difficult to balance the high load generated by different parts of the app in a single process. Sharding was not an option, as Forever is built around a single radio stream – only one of which should exist at the same time. Unlike a typical CRUD app, I couldn’t just deploy the same app to multiple servers and point them at at the same database.
Single monolithic apps are very difficult to update. Any modifications to the code base of Forever required a complete restart of the server. (In my initial iteration and blog post, I detailed a method for reloading Python modules without stopping the app – but ran into so many stability issues with this method that I had to abandon it altogether.) As with any v1 app, Forever had a constant stream of updates and fixes. Restarting the app every time a bug fix had to be made – thereby stopping the stream of music – was ridiculous.
Memory usage and CPU profiling were both difficult problems to solve with a one-process app. Although Python offers a number of included profiling tools, none of them are made to be used in a production environment – which is often the environment in which these problems occur. Tracking down which aspect of the app is eating up gigabytes of memory is critical.

To solve all of these problems in one go, I decided to re-architect Forever.fm as a streaming service-oriented architecture with a custom queueing library called pressure.

Usually, service oriented architectures are strongly request/response based, with components briefly talking with each other in short bursts. Forever does make use of this paradigm, but its central data structure is an unbounded stream of MP3 packets. As such, a lot of the app’s architecture is structured around pipelines of data of different formats. To make these pipelines reliable and fast when working with large amounts of streaming data, I constructed my own Redis-based bounded queue protocol that currently has bindings in Python and C. It also creates really nice d3 graphs of the running system:

Forever.fm is broken down into multiple services that act on these pipelines of data:

The brain picks tracks from a traditional relational database, orders them by approximating the Traveling Salesman Problem on a graph of tracks and their similarities, and pushes them into a bounded queue.
The mixer reads tracks from this queue in order, analyzes the tracks and calculates the best-sounding overlaps between each track and the next. This is essentially the “listening” step. These calculations also go into a bounded queue.
The renderer reads calculations from this queue and actually renders the MP3 files into one stream, performing time stretching and volume compression as required. This step pushes MP3 frames, each roughly 23ms long, into another bounded queue.
The mp3_server reads mp3 frames from this queue at a precise rate (38.28125 frames per second, for 44.1kHz audio) and sends them to each listener in turn over HTTP. (It also keeps track of who’s listening to help produce a detailed report of how many people heard each song.) There are a number of other services that come together to make Forever.fm work, including the excitingly-named web_server, info_server, social_server, manager, tweeter, relay and playcounter. Each of these services consists of less than 1000 lines of code, and some of them are written in vastly different languages. At the moment, they all run on the same machine – but that could easily change without downtime and without dropping the music. Each service has a different pid and memory space, making it easy to see which task is using up resources.

To help achieve an unbroken stream of music and more easily satisfy the soft real-time requirements of the app, pressure queues have two very important properties: bounds and buffers.

Each pressure queue is bounded - meaning that a producer cannot push data into a full queue, and may choose to block or poll when this situation occurs. Forever uses this property to lazily compute data as required, reducing CPU and memory usage significantly. Each data pipeline necessarily has one sink - one node that consumes data but does not produce data – which is used to limit the data processing rate. By adjusting the rate of data consumption at this sink node, the rate (and amount of work required) of the entire processing chain can be controlled extremely simply. Furthermore, in Forever, if no users are listening to a radio stream, the sink can stop consuming data from its queue – implicitly stopping all of the backend processing and reducing the CPU load to zero. By blocking on IO, we let the OS schedule all of our work for us – and I trust the OS’s scheduler to do a much better job than Python’s.

In addition, each queue has a buffer of a set size that is kept in reliable out-of-process storage – Redis, in this case. If a process were to crash for any reason, the buffer in the queueing system would allow the next process to continue processing data for some amount of time before exhausting the queue. With current parameters, nearly all of the services in Forever could fail for up to 5 minutes without causing an audio interruption. These buffers allow each component to be independently stopped, started, upgraded or debugged in production without interrupting service. (This does lead to some high-pressure bug hunting sessions where I’ll set a timer before launching GDB.)

Most of the services involved in this pipeline are backend processors of data – not front-facing web servers. However, I’ve applied the same service-oriented philosophy to the frontend of the site, creating separate servers for each general type of data served by the app. In front of all of these web servers sits nginx, being used as a fast, flexible proxy server with the ability to serve static files. HAProxy was considered, but has not yet been implemented – as nginx has all of the features needed, including live configuration reloads.

With this combination of multiple specialized processes and a reliable queuing system, Forever has enjoyed very high availability since the new architecture was deployed. I’ve personally found it indispensable to be able to iterate quickly on a live audio stream – often in production. The ability to make impactful changes on a real-time system in minutes is incredible – and although somewhat reckless at times, can be an amazing productivity boon to a tiny startup.

Partially thanks to this new architecture, I’ve also built free iOS and Android clients for forever.fm. Download them and listen to infinite radio on the go!

Co-Working at The Working Group

2013-10-25T00:00:00-04:00

Early in my academic career at the University of Waterloo, I was fortunate enough to land a co-op placement at The Working Group. Back then, the team was just over a dozen people. We were taking on our first mobile projects, and were starting to outgrow our old office at the Burroughes building – where we still had musical jam sessions with the partners every couple weeks. I learned more and had more fun in that four-month placement than I thought possible.

That was two years ago. In February 2013, I founded a software company that creates music apps that anybody can use. So far, our portfolio of products includes The Wub Machine, an automatic music remixing app, and Forever.fm, an app that creates an infinite DJ mix of the hottest songs on SoundCloud. These two apps have proven popular, and have already reached more than 1,000,000 people across the world. However, their development had also plateaued – the “next steps” in each project required too much time and effort for me to complete in my spare time. Luckily, as a Waterloo co-op student, my classes are interrupted regularly by mandatory four-month work terms. For my sixth and final internship slot, I decided to forgo the tempting internship offers from San Francisco startups – and to instead spend four months bootstrapping my own products.

When I set out on this plan, I was first greeted by incredulity from my classmates who were returning to cushy internships in the Bay Area. One of the first people to offer encouragement was Andrés Aquino, partner at The Working Group. After I dropped back into the office to give a tech talk in early May, Andrés was quick to extend an invitation to return if I needed an environment to work in. For me, working full time to bootstrap my company, this simple invitation solved many problems. Without TWG, who would I bounce ideas off of? Who would I show my work to to ensure that I’m building the right products? Most importantly, who would point out to me when I was making mistakes? Incubators like Y Combinator or Waterloo’s own VeloCity Garage usually provide people who can fill that mentorship role – but I wasn’t yet at a stage to get accepted by either.

So far, only one month into my endeavour, things have been going extremely well. Having a desk to come in to and co-workers to talk with has been surprisingly motivating. The office has a very open culture that’s made me feel like part of the team again, despite only sharing a desk and hanging out in the team’s HipChat room. Each week, I’m held accountable by participating in morning standup meetings. (While I should hope that I don’t need external motivation to accomplish my goals, being present at the office has made it impossible for me to procrastinate.) I also make a point to demo two things every Friday: both the product I’ve worked on and the technology behind it. If I don’t learn something new each day, I’m not satisfied with my progress – and if I don’t pass on what I learn to the team, then I’m not doing my part. This spirit of “learning and teaching” also helps me solidify what I’ve learned and distill it into meaningful information that’s useful to others.

In the three months I’ve got left at TWG, I have a long list of things to accomplish. If productivity stays as high as it has been in the past month, I’ll have plenty to show for it by the time I’m done. My goal is to make sure that the TWG team learns just as much as I do.

Shared State and Customer Confusion

2013-08-21T00:00:00-04:00

tl;dr: Be very careful when declaring class-level variables in Python, and think very carefully about putting mutable state into a class-level variable.

Let’s go back to the good old days of writing web applications in PHP for a paragraph or two. When running PHP under Apache or nginx, every HTTP request resulted in a clean interpreter with completely new state. Developers had to explicitly ask for state to be shared – through the $_SESSION global, by persisting state on disk, or by saving state to some backing data store. This made developing applications amazingly simple. A PHP page was something like a pure function, producing consistent, predictable output based on the state of the underlying data store.

Now, consider this little bit of Python code:

class PatternRemixer(Remixer):
    _samplecache = {}

    def remix(song):
        # do some stuff
        for key in song:
            if key not in self._samplecache:
                self._samplecache[key] = self.render_audio()
            self.output(self._samplecache[key])

Any experienced Pythonista should notice the grievous error in this class – on the second line, no less. Here we have a _samplecache variable being initialized to an empty dictionary. While this itself is fine, what’s not fine is the fact that this variable is being used as a mutable cache. That’s because the variable is declared in the class’ scope, making it common to all classes.

Consider the following bit of code:

def remix_songs(songs):    
    song0       = PatternRemixer().remix(songs[0])

    #    Let's reset the cache here
    PatternRemixer._samplecache = {}

    song1       = PatternRemixer().remix(songs[1])
    song0_again = PatternRemixer().remix(songs[0])
    assert song0 == song0_again

This function will throw an AssertionError – song0 is not equal to song0_again! Even though in the previous snippet we refer to self._samplecache, we’re really accessing PatternRemixer._samplecache instead – the global variable used by all instances. Since this cache only gets cleared after using it once, our song0_again variable actually contains data from songs[1], when it really shouldn’t.

This can be quite a difficult bug to track down, as the problem only manifests itself when one Python interpreter accepts multiple requests. In a distributed system, each request might go to a different box – and possibly a different Python interpreter running on that box, making it very difficult to figure out where the stale data is coming from. Worse yet – if each interpreter is restarted after a certain number of requests, the stale data will not always show itself.

This results in confused emails from customers at all hours of the night, and is generally a Very Bad Thing™.

Pipes and Filters

2013-08-07T00:00:00-04:00

Pipelines are an extremely useful (and surprisingly underused) architectural pattern in modern software engineering. The concept of using pipes and filters to control the flow of data through software has been around since the 1970s, when the first Unix shells were created. If you’ve ever used the pipe (“|”) character in a terminal emulator, you’ve made use of the pipe-and-filter idiom. Take the following example:

cat /usr/share/dict/words |     # Read in the system's dictionary.
grep purple |                   # Find words containing 'purple'
awk '{print length($1), $1}' |  # Count the letters in each word
sort -n |                       # Sort lines ("${length} ${word}")
tail -n 1 |                     # Take the last line of the input
cut -d " " -f 2 |               # Take the second part of each line
cowsay -f tux                   # Put the resulting word into Tux's mouth

When run with bash, this pipeline returns a charming little ASCII art version of Tux, the Linux penguin, saying the longest word in the dictionary that contains the word “purple”:

 _____________ 
< unimpurpled >
 ------------- 
   \
    \
        .--.
       |o_o |
       |:_/ |
      //   \ \
     (|     | )
    /'\_   _/`\
    \___)=(___/

The Life Cycle of a Pipeline

This little bit of code actually does quite a lot as soon as it’s executed. The moment you hit enter, the following steps occur:

Seven (seven!) processes are immediately spawned by the shell.
The standard input (stdin) and standard output (stdout) file descriptors of each process are redirected to the shell’s internal buffers. (Each of these buffers is 512 bytes long on my machine, as measured by running ulimit -a.)
The source process, cat, starts to read from its file and output to its stdout. This data flows through the first pipe into the first buffer, quickly hitting the buffer size limit imposed by bash. As soon as this limit is reached, cat is blocked in its write(2) call. This is where pipelines really shine: the execution of cat is implicitly paused by the pipeline’s inability to handle more data. (For those familiar with the concept of coroutines, each process here is acting as a sort of coroutine.)
The first filtering process, grep, starts with a read(2) call on its stdin pipe. When the process first spawns, the pipe is empty – so the entire process blocks until it can read more data. Again, we see implicit execution control based on the availability of data. As soon as the preceding command in the pipeline fills or flushes its buffer, grep’s read(2) call returns and it can filter the lines that it’s read in from the preceding process. Every line that matches the provided pattern is immediately printed to grep’s stdout, which will be available to the next program in the pipeline.
awk functions exactly like grep does, only on different buffers. When data is available, awk resumes execution and processes the data, incrementally writing its results to its stdout. When data is not available, awk is blocked and unable to run.
sort operates slightly differently than the preceding two processes. As a sorting operation must take place on the entirity of the data, sort maintains a buffer (on disk) of the entire input received thus far. (There’s no point in providing a sorted list of all the data received so far, only to have it invalidated by a later piece of data.) As soon as sort’s stdin closes, sort can print its output to its stdout, as it knows no more data will be read.
tail is somewhat similar to sort in that it cannot produce any data on stdout until the entirity of the data has been received. This invocation doesn’t need to maintain a large internal buffer, as it only cares about the last line of the input.
cut operates as an incremental filter, just like grep and ack do.
cowsay operates just like tail does —– waiting to receive the full input before processing it, as it must calculate metrics based on the length of the input data. (Printing properly aligned ASCII art is no easy task!)

Using a pipeline for this task should seem like a no-brainer. Every task being done here deals with filtering data. Existing data sets are changed at each step. Every process does its own job, and does it quite well, as the Unix philosophy recommends. Each process could be swapped out for another with very little effort.

If you wanted to visualize the pipeline that bash automatically sets up for you, it would look something like this:

Performance and Complexity

One other advantage of pipelines is their inherently good performance. Let’s use a modified version of the command we ran above to find the memory and CPU usage of each filter component in the pipeline.

/usr/bin/time -l cat /usr/share/dict/words 2> cat.time.txt | 
/usr/bin/time -l grep purple 2> grep.time.txt |
/usr/bin/time -l awk '{print length($1), $1}' 2> awk.time.txt |
/usr/bin/time -l sort -n 2> sort.time.txt |
/usr/bin/time -l tail -n 1 2> tail.time.txt |
/usr/bin/time -l cut -d " " -f 2 2> cut.time.txt |
/usr/bin/time -l cowsay -f tux 2> cowsay.time.txt

(Aside: I’m calling /usr/bin/time here to avoid using my shell’s built-in time command, which doesn’t support the -l flag to print out detailed stats. If you’re on Linux, you’ll want to use the -v flag, which does the same thing. The 2> something.time.txt syntax redirects stderr to a file, while leaving stdin pointing at a pipe.)

After running this command and checking the maximum resident set size, as well as the number of voluntary and involuntary context switches, we can start to see a couple very important things.

The maximum amount of memory used by any one filter was 2,830,336 bytes, by cowsay, due to the fact that it’s implemented in Perl. (Just spawning a Perl interpreter on my machine uses 1,126,400 bytes!) The minimum was 389,120 bytes, used by tail.
- Even though our original source file (/usr/share/dict/words) was 2.4 MB in size, most of the filters in the pipeline don’t even use one fifth of that amount of memory! Thanks to the fact that the pipeline only stores what it can process in memory, this solution is very memory-efficient and lightweight. Processing a file of any size would not have changed the memory usage of this solution – the pipeline runs in effectively constant space.
Notice that the first two processes, cat and grep, have a large number of voluntary context switches. This is a fancy way of saying “blocking on IO”. cat must voluntarily context switch into the operating system when reading the original file from disk, then again when writing to its stdout pipe. grep must voluntariy context switch when reading from its stdin pipe and writing to its stdout. The reason that ack, sort, tail and cut don’t have as many context switches is that they deal with less data —– grep has already filtered the data for them, resulting in only twelve lines that match the provided pattern. These twelve lines can fit easily within one pipe buffer.
- cowsay seems to have an unusually high number of involuntary context switches, which are probably caused by the process’s time quantum expiring). I’m going to attribute that to the fact that it’s written in Perl, and that it takes ~30 milliseconds of CPU time to run, compared to the immeasurably small time that the other programs take to run.

Note that although this example pipeline is amazingly simple, if any of these processes were doing complex computations, they could be automatically parallelized on multiple processors. Aren’t pipelines awesome?

Errors

Yes indeed – pipelines are awesome. They make efficient use of memory and CPU time, have automatic and implicit execution scheduling based on data availability, and they’re super easy to create. Why would you not want to use pipelines whenever possible?

The answer: error handling. If something goes wrong in one of the parts of the pipeline, the entire pipeline fails completely.

Let’s try out this pipeline, with an added command that I’ve written in Python. fail.py echoes its standard input to standard output, but has a 50% chance of crashing before reading a line.

cat /usr/share/dict/words |     # Read in the system's dictionary.
grep purple |                   # Find words containing 'purple'
awk '{print length($1), $1}' |  # Count the letters in each word
sort -n |                       # Sort lines ("${length} ${word}")
python fail.py |                # Play Russian Roulette with our data!
tail -n 1 |                     # Take the last line of the input
cut -d " " -f 2 |               # Take the second part of each line
cowsay -f tux                   # Put the resulting word into Tux's mouth

The source of fail.py:

import sys
import random

while True:
    if random.choice([True, False]):
        sys.exit(1)
    line = sys.stdin.readline()
    if not line:
        break  
    sys.stdout.write(line)
    sys.stdout.flush()

So, what happens in this case? When fail.py fails while reading the input, its stdin and stdout pipes close. This essentially cuts the pipeline in half. Let’s take a look at what each process does as you get further and further away from the failed process.

sort, the process immediately before python, immediately receives a SIGPIPE signal to tell it that one of the pipes it has open (its stdout) has closed. It can choose to handle this SIGPIPE immediately, or can try to write(2) again – but that write(2) call will return -1 anyways. No longer able to write its output anywhere, sort will exit, closing its own stdin pipe, causing the process preceding it to do the same thing. This cascading shutdown proceeds all the way up to the first process in the pipeline. (Of course, a process doesn’t have to shut down when it encounters a write error or receives a SIGPIPE, but these processes don’t have any other behaviour if their output pipes close.)
tail, the process immediately after python, also receives a SIGPIPE signal as soon as its stdin pipe closes. It can choose to handle the SIGPIPE with a handler, or to ignore the signal, but either way – its next call to read(2) will return an error code. This event is indistinguishable from the end-of-stream event that tail receives anyways when the input stream is done. Hence, tail will interpret this as a normal end-of-stream event, and will behave as expected.
cut will also behave as expected when the stream closes.
cowsay will behave as expected when the stream closes, printing out the last word in the sorted list that was received before the python process crashed.

The result?

 __________ 
< repurple >
 ---------- 
   \
    \
        .--.
       |o_o |
       |:_/ |
      //   \ \
     (|     | )
    /'\_   _/`\
    \___)=(___/

Notice that Tux is no longer saying unimpurpled, the word that he was saying before. The word is wrong! The output of our command pipeline is incorrect. Although one of the filters crashed, we still got a response back – and all of the steps in the pipeline after the crashed filter still executed as expected.

What’s worse – if we check the return code of the pipeline, we get:

bash-3.2$ echo $?
0

bash helpfully reports to us that the pipeline executed correctly. This is due to the fact that bash only reports the exit status of the last process in the pipeline. The only way to detect an issue earlier in the pipeline is to check bash’s relatively unknown $PIPESTATUS variable:

bash-3.2$ echo ${PIPESTATUS[*]}
0 0 0 0 1 0 0 0

This array stores the return codes of every process in the previous pipe chain – and only here we can see that one of the filters crashed.

This is one of the major drawbacks about using traditional UNIX pipes. Detecting an error while the pipeline is still processing data requires some form of out-of-band signalling to detect a failed process and send a message to the other processes. (This is easy to do when you have more than one input pipe to a filter, but becomes difficult if you’re just using UNIX pipes.)

Other Uses for Pipelines

So, that’s great. We can make a memory-efficient pipeline to create ASCII art penguins. I can hear the questions now:

How are pipes useful in the real world?

How could pipes help me in my web app?

Valid questions. Pipes work great when your data can be divided into very small chunks and when processing can be done incrementally. Here’s a couple examples.

Let’s say you have a folder full of .flac files – very high quality music. You want to put these files on your MP3 player, but it doesn’t support .flac. And for some reason, your computer doesn’t have more than 10 megabytes of available RAM. Let’s use a pipeline:

ls *.flac | 
while read song
do 
    flac -d "$song" --stdout | 
    lame -V2 - "$song".mp3
done

This command is a little more complicated than the simple pipeline we used above. First, we’re using a built in bash construct – the while loop with a read command inside. This reads every line of the input (which we’re piping in from ls) and executes the inner code once per line. Then, the inner loop invokes flac to decode the song, and lame to encode the song to an MP3.

How memory-efficient is this pipeline? After running it on a folder full of 115MB of FLAC files, only 1.3MB of memory was used.

Let’s say you have a web app, and that you’re using your favourite web framework to serve it. If a user submits a form to your app, you need to do some very expensive processing on the back-end. The form data needs to be sanitized, verified by calling an external API, and saved as a PDF file. All of this work can’t happen in the web server itself, as it would be too slow. (Yes – this example is a little bit contrived, but isn’t that far off from some use cases I’ve seen.) Again, let’s use a pipeline:

my_webserver | 
line_sanitizer | 
verifier | 
pdf_renderer

Whenever a user submits a form to my_webserver, it can write a line of JSON to its stdout. Let’s say this line looks like:

{"name": "Raymond Luxury Yacht", "organization": "Flying Circus"}

The next process in the pipeline, line_sanitizer, can then run some logic on each line:

import sys
import json

for line in sys.stdin:
    obj = json.loads(line)

    if "Eric Idle" in obj['name']:
        # Ignore forms submitted by Eric Idle.
        continue

    sys.stdout.write(line)
    sys.stdout.flush()

The next process can verify that the organization exists:

import sys
import json
import requests

for line in sys.stdin:
    obj = json.loads(line)
    org = obj['organization']
    resp = requests.get("http://does.it/exist", data=org)

    if resp.response_code == 404:
        continue

    sys.stdout.write(line)
    sys.stdout.flush()

Finally, the last process can bake any lines that remain into PDF files.

import sys
import json
import magical_pdf_writer_that_doesnt_exist as writer

for line in sys.stdin:
    obj = json.loads(line)
    writer.write_to_file(obj)

And there you have it – an asynchronous, extremely-memory-efficient pipeline that can process huge amounts of data, with a very, very small amount of code.

One question remains in this example, however. How do we handle errors that might occur? If Eric Idle submits a form to our website, and we decide to reject the form, how do we notify him? One very UNIX-y way of doing so would be to create a named pipe that handles all failed requests:

mkfifo errors  # create a named pipe for our errors

my_webserver | 
line_sanitizer 2> errors | 
verifier 2> errors | 
pdf_renderer 2> errors

Any process could read from our own custom “errors” pipe, and each process in the pipeline would output its failed inputs into that pipe. We could attach a reader to that pipe that sends out emails on failure:

mkfifo errors              # create a named pipe for our errors
email_on_error < errors &  # add a reader to this pipe

my_webserver | 
line_sanitizer 2> errors | 
verifier 2> errors | 
pdf_renderer 2> errors

Then, if our line_sanitizer wanted to reject a line, its behaviour would look like this:

import sys
import json

for line in sys.stdin:
    obj = json.loads(line)

    if "Eric Idle" in obj['name']:
        sys.stderr.write(line)
        sys.stderr.flush()
    else:        
        sys.stdout.write(line)
        sys.stdout.flush()

This pipeline would look a little bit different. (Red lines represent stderr output.)

Distributed Pipelines

UNIX pipes are great, but they do have their drawbacks. Not all software can fit directly into the UNIX pipe paradigm, and UNIX pipes don’t scale well to the kind of throughput seen in modern web traffic. However, there are alternatives.

Modern “work queue” software packages have sprung up in recent years, allowing for rudimentary FIFO queues that work across machines. Packages like beanstalkd and celery allow for the creation of arbitrary work queues between processes. These can easily simulate the behaviour of traditional UNIX pipes, and have the major advantage of being distributed across many machines. However, they’re fairly well suited to asynchronous task processing, and their queues typically don’t block processes that try to send messages, which doesn’t allow for the kind of implicit execution control we saw earlier with UNIX pipes. These services act more as messaging systems and work queues rather than as coroutines.

To work around this lack of synchronization and pressure in distributed pipeline systems, I’ve created my own project – a Redis-based, reliable, distributed synchronized pipeline library called pressure. pressure allows you to set up pipes between different processes, but adds the ability to have pipe buffers persist and be used across multiple machines. By using Redis as a stable message broker, all of the inter-process communication is taken care of and is OS and platform agnostic. (Redis also gives a bunch of nice features like reliability and replication.)

pressure’s default implementation is in Python, and it’s still in its infancy. To show its power, let’s try to replicate the pipeline example from the start of this post by using pressure’s included UNIX pipe adapter. (put and get are small C programs that act as a bridge between traditional UNIX pipes and distributed pressure queues kept in Redis.)

# Read in the system's dictionary
cat /usr/share/dict/words | ./put test_1 &

# Find words containing 'purple'
./get test_1 | grep purple | ./put test_2 &

# Count the letters in each word
./get test_2 | awk '{print length($1), $1}' | ./put test_3 &

# Sort lines
./get test_3 | sort -n | ./put test_4 &

# Take the last line of the input
./get test_4 | tail -n 1 | ./put test_5 &

# Take the second part of each line
./get test_5 | cut -d " " -f 2 | ./put test_6 &

# Put the resulting word into Tux's mouth
./get test_6 | cowsay -f tux

The first thing to note – this is an extremely slow operation, as we’re filtering a multi-megabyte file with this method. We end up sending 235,912 messages through Redis, which takes the better part of 4 minutes. (If we move grep to run immediately after cat, and before putting data into Redis, this operation runs more than 1,200 times faster.) However, in the end, we get the correct answer that we’re looking for:

 _____________ 
< unimpurpled >
 ------------- 
   \
    \
        .--.
       |o_o |
       |:_/ |
      //   \ \
     (|     | )
    /'\_   _/`\
    \___)=(___/

However, another peculiar property can be observed – by logging in to redis-cli, the Redis command line tool, we can find out that very little memory is being used by our pipeline, despite the large set of input data:

$ redis-cli info | grep memory                                                                  
used_memory:3126928
used_memory_human:2.98M
used_memory_rss:2850816
used_memory_peak:3127664
used_memory_peak_human:2.98M
used_memory_lua:31744

pressure is still in alpha, and definitely not ready for wide-scale production deployment, but you should definitely try it out!

Pipelines are hugely useful tools in software that can help cut down on resource usage. Bounded pipelines act as coroutines to only do computation when necessary, and can be crucial in certain applications, like real-time audio processing. pressure provides a way to easily use pipelines reliably on multiple machines. Try using the pipes-and-filters paradigm to solve your own software architecture problems, and see how simple and efficient it can be!

Dangerously Convenient APIs

2013-04-22T00:00:00-04:00

The modern trend of providing an API for everything is wonderful. With minimal effort, any developer with an internet connection can programmatically access a wealth of data and powerful functionality. Without APIs, many hackathons wouldn’t exist, and many new developers would languish in frustration instead of participating in the best part of software development – building fun stuff.

However, all of this convenience comes at a cost. Often, that cost is literal, if an API provider decides to charge for access. This is the entire business model of many companies, and there are now even companies that provide API-monetization-as-a-service. This has created a kind of purely digital marketplace, by literally allowing people to buy access to data and functions. (This is a Good Thing™, as it encourages competition and variety in the API market, and reduces time-to-ship for many developers.)

Many of these monetized APIs are providing access to something inherently proprietary – an enormous dataset, neural network, or advanced algorithm. A problem arises when these APIs provide access to something open. Imagine an API that provides access to datasets that are completely public and free, or an API that performs simple operations on provided data. For an extreme example, imagine an API that implements strlen() – the simple, common task of finding the length of a string.

Imagine a group of young, new programmers building the next cool app. They need to find the lengths of their strings, as you do. As C is hard to learn, and reading documentation would take too much time, they instead outsource their character-counting to api.strlen.com. (The venue of the hackathon has great wi-fi, so the additional network overhead is not a big deal.) They launch the app to much fanfare, pitch it, then win first prize at their Startup Weekend.

Months later, thousands of people are using their hot new app. They’re soaking in the TechCrunch coverage and brainstorming monetization ideas when, suddenly, their app stops working. They trace the error down to one call – to api.strlen.com – and finally see:

>HTTP/1.1 429 Too Many Requests

It’s 11pm on a Saturday night. The folks that run strlen.com are nowhere to be found.

Monday morning rolls around, and our favourite team of programmers has barely slept. Their star app has been down all weekend. Finally, they get an email back from support@strlen.com:

Congratulations on the TechCrunch coverage! Unfortunately, you’ve way exceeded our rate limit (in fact, we had to put a rate limit in place just because of your app) and we need to chat. We’re now charging $0.0001 per character counted with our string length API. Let me know if you’re interested in upgrading your free account and I can get you set up!

-Bjørn, CEO, strlen.com

The team runs the numbers to find that, with the new rates, every additional user of the app would lose them a ridiculous amount of money every day. But, hey – they just had a great chat with an angel – and they might be getting some financing soon. They send Bjørn their details and get set up with a paid account. The app starts working again. Their users are happy, TechCrunch comes by for another interview, and the team’s reason for sleeplessness goes from “anxiety” back to “coding.”

It’s been a month since our team – now incorporated as Blue Blanket, Inc. – signed up for their paid strlen.com account. Software engineers are expensive, and while they’ve considered hiring somebody to write their own version of the strlen.com API, they’re really not sure where to start. Their fancy new analytics dashboard shows increasing numbers, minute over minute, until – all at once – the graphs go dead. The app is down again, and once again, it’s due to strlen.com. The team points their web browsers angrily at api.strlen.com, only to find:

>HTTP/1.1 410 Gone

The homepage of strlen.com has an even bleaker message:

Dear friends,

We at strlen.com are very proud to announce that we’ve been acquired by Standard Library Incorporated. It’s been a wild ride counting characters for you over the past six months, but we’re excited to move on and solve hard new problems with the great people at stdlib. All API endpoints will be disabled, effective immediately.

-Bjørn, VP String at stdlib.com

Our trusty team’s app stays down for the better part of a month while they scrounge up a handful of competent engineers to recreate the missing functionality. Once back online, their app is all but forgotten. TechCrunch runs an article a year later – “What ever happened to Blue Blanket?” – that places the blame on a power struggle between the co-founders.

Obviously, implementing strlen as a paid API is an absurd example, but there are real APIs out there that are not much different. If you depend on an external service for your app’s core functionality, that’s okay. But if you can feasibly replicate the API yourself, then relying on the external service is a source of extremely risky technical debt. Your debtors (in this case, the API providers) could demand immediate repayment at any time by rate limiting or shutting down.

Don’t let your app be crippled by someone else’s acquisition.

Thanks to Zameer Manji for proofreading this post.

Interns are Leading the Way

2013-01-19T00:00:00-05:00

I attend the University of Waterloo, one of Canada’s most widely-known engineering schools. Waterloo is famous for a system they call co-op – a regimen of paid internships of 4-8 months in duration in a real-world work environment. Co-op is mandatory for all engineering students, and upon graduation, results in each student having worked at up to 6 different companies for a total of at least 24 months. Each “work term” can happen during the summer, fall, or winter, and can be within Canada or abroad. (We do often go abroad, primarily to Silicon Valley.) Here’s where my class went for internships this past summer:

Over the past year, a number of Waterloo interns have had the pleasure of interning at Khan Academy, the groundbreaking non-profit dedicated to “accelerate learning for students of all ages.” They’ve made such an impression on Sal Khan, its founder, that he’s gone on to speak extremely highly of Waterloo – even suggesting using its model as a base for furthering education – in an extremely well-written article in Communications of the ACM:

Waterloo has already proven that the division between the intellectual and the useful is artificial; I challenge anyone to argue that Waterloo co-op students are in any way less intellectual or broad thinking than the political science or history majors from other elite universities. If anything, based on my experience with Waterloo students, they tend to have a more expansive worldview and are more mature than typical new college graduates—arguably due to their broad and deep experience base.

While every student gains valuable experience and ends up hugely enriched by their time in co-op, those in technical disciplines arguably have the opportunity to make a more lasting impact. In particular, the nature of the software industry allows co-ops to contribute to a company on an extremely meaningful level. Classmates of mine interning at companies like Facebook, Google, Square and Twitter have made contributions that are on par with – if not exceeding – those of senior full-time employees. It’s hugely exciting, and in my experience, it makes us interns forget that we’re only interns.

The Abysmal State of Higher Education

While I can’t extoll the virtues of the co-op system enough, it does set co-op students apart from the general student body in unnerving ways. Co-op students, demanding competitive wages during their internships, often do graduate with little-to-no student debt, while traditional university programs might give no workplace experience and leave a student with loans of tens-of-thousands of dollars. It’s immensely depressing to realize that outside of the little bubble of co-op, that’s the norm for higher education. Almost everybody that goes to college experiences student debt and difficulty finding employment.

It boggles the mind that for many of today’s students, it’s normal to graduate with zero industry experience before immediately searching for a job. Companies shouldn’t have to guess if a candidate, fresh out of school, can apply their theoretical skills to a workplace that is foreign to them. The question:

should we even consider hiring a fresh college grad?

should never have to be asked. Regrettably, with the current state of higher education, being a “fresh college grad” doesn’t give any confidence to an employer, at least for graduates of most schools. (And yet it leaves each student in considerable amounts of debt. What an amazingly flawed system.)

The class of Software Engineering students that I belong to will graduate in just over a year, after having spent five years at Waterloo and abroad. Most will graduate with no debt and with multiple job offers in hand, obtained from real world experience that complements their theoretical knowledge. Why must this be a unique situation? We should not be the outliers – this is how higher education should be.

Text Knockout with Canvas

2012-12-16T00:00:00-05:00

Recently, I’ve been working on a complete visual overhaul of my own website and blog. Instead of the huge, bold lines of my previous site (resembling my old resume), I decided to start fresh with a much smaller, simpler, and subtler design. After fooling around in Photoshop for a while, I came up with the header:

Rather than being decisive and choosing on a single colour to define the site, I decided to let the colour of the background dictate the main colour of the site at any given time. As my old design featured hundreds of random thumbnails on its homepage, I opted to use those exact same thumbnails – just blown up, blurred, and very colourful, as the randomly-chosen backdrop for each page.

To accomplish the text knockout effect in the site’s header, I simply used a transparent PNG, as it’s lighter to bake in the font to a PNG than embed Proxima Nova. However, I also wanted to knock out each h1 and h2 in the body of each blog post. To do so, I turned to HTML5 canvas.

Titles that look like this, punching out to the background!

Canvas has a number of different compositing modes that allow Photoshop-style blending tricks, although somewhat less complex.

To reproduce Photoshop’s knockout effect, I simply turned to the destination-out compositing mode:

ctx.globalCompositeOperation = 'destination-out'

However, it wasn’t quite that simple. To allow smooth fallback for older browsers (or even browsers like Safari that don’t fully support compositing modes) I wanted to dynamically swap each tag with a canvas tag after each page load. To replicate the style of each tag, I turned to a very useful function I hadn’t used before: getComputedStyle. To get every single style on any tag, I had to simply run:

style = document.defaultView.getComputedStyle(this.element, "")
//  style.paddingTop, style.paddingBottom, etc...

Thus, on document.ready, all I had to do was:

Find the computed styles of each tag to be replaced
Create a new element with the same outer width and height
Apply the original styles to a new element
Move the padding of the element to inside its ‘width’ (breaking the box model)
Set the compositing mode to destination-out
Draw the text on the canvas
Swap the original element for the new canvas

Et voilà, it works and looks awesome. (…in some browsers.)

As always, code is on Github:

// Some really hacky code being used in my next blog redesign. // by Peter Sobot (psobot.com) on December 16, 2012 ;(function ( $, window, document, undefined ) { var pluginName = 'punchout', defaults = { }; function Plugin( element, options ) { this.element = element; this.options = $.extend( {}, defaults, options) ; this._defaults = defaults; this._name = pluginName; this.init(); } Plugin.prototype.init = function () { e = $(this.element); i = $('.punchout').length; style = document.defaultView.getComputedStyle(this.element, ""); width = parseInt(e.width()); height = parseInt(e.height()); p_top = parseInt(style.paddingTop || 0); p_bottom = parseInt(style.paddingBottom || 0); p_left = parseInt(style.paddingLeft || 0); p_right = parseInt(style.paddingRight || 0); width += p_left + p_right; height += p_top + p_bottom; id = "punchout_" + i; e.after(""); canvas = document.getElementById(id); ctx = canvas.getContext('2d'); canvas.style.cssText = style.cssText; canvas.style.backgroundColor = 'transparent'; canvas.style.padding = 0; canvas.style.width = width + 'px'; canvas.style.height = height + 'px'; alpha = parseFloat(e.css('background').split(' ').slice(3, 4)); ctx.globalAlpha = 1.0; colour = e.css('background').split(' ').slice(0, 4).join(' '); ctx.fillStyle = colour; ctx.fillRect(0, 0, width, height); ctx.font = e.css('font'); ctx.fillStyle = '#000000'; ctx.textBaseline = 'top'; ctx.globalCompositeOperation = 'destination-out'; text = e.html(); function overflow(text) { return ctx.measureText(text, p_left, p_top).width > (width - p_left - p_right); } if (overflow(text)) { // TODO: This effectively re-implements text wrapping. // It does not take into account the correct text metrics. // It is a giant hack. Fix it. lines = []; while (text.length > 0) { var i; words = text.split(' '); for (i = words.length; overflow(text); i--) { text = words.slice(0, i).join(' '); } lines.push(text); text = words.slice(i + 1, words.length).join(' '); } for (l in lines) { ctx.fillText(lines[l], p_left, l == 0 ? p_top : ((l / lines.length) * height)); } } else { ctx.fillText(text, p_left, p_top); } e.remove(); }; $.fn[pluginName] = function ( options ) { return this.each(function () { if (!$.data(this, 'plugin_' + pluginName)) { $.data(this, 'plugin_' + pluginName, new Plugin( this, options )); } }); } })( jQuery, window, document ); jQuery(window).ready(function(){ isMobileAndWorksAndLooksGood = function() { return navigator.userAgent.toLowerCase().indexOf('mobile') > -1 && navigator.userAgent.toLowerCase().indexOf('safari') > -1 && window.devicePixelRatio == 1 } if (navigator.userAgent.toLowerCase().indexOf('chrome') > -1 || isMobileAndWorksAndLooksGood()) { setTimeout(function(){$('h1.title, h2, h3, h4').punchout();}, 500); } });

Emergency Bandwidth Distribution

2012-11-17T00:00:00-05:00

Late last week, I officially launched forever.fm, an infinite, beatmatched radio stream powered by SoundCloud. This morning, I was happy to discover that it had been featured in Hack A Day – one of my favourite hack-centric blogs. However, such exposure resulted in one small issue:

That’s 25% of my little 512MB Linode’s monthly bandwidth allotment being used up in 6 hours. With Linode (as of this writing) charging $0.10/GB for bandwidth (allotted or through overages), that huge server load could get very expensive, very fast. (At that rate, each listener would cost me roughly $0.25 per day of constant listening. Not viable for a free service!)

So, this afternoon, I was faced with a dilemma. How do I quickly and easily make it cheaper for me to host the site at peak times? A tried and true CDN would be a good solution, but even simple CDNs like Amazon CloudFront would cost more than my existing Linode. (Such systems are generally made for scaling to multiple petabytes, while I’m looking at maybe 1TB tops.)

Instead of going for a large, expensive CDN, I decided to make my own small one. Currently, it contains exactly two nodes: the original forever.fm streaming server, and the hugely overpowered VPS I use to serve the Wub Machine, my other major music hack.

This “CDN” is simple: I’ve added a single Python script to forever.fm that acts as a basic “relay” server. Each relay has a copy of the repo, although it runs python -m forever.relay start rather than python -m forever.server start. Each relay listens to the stream from the “root” url and re-broadcasts it to n users. Then, each time a user requests a new stream from the root, the logic is simple:

if len(self.listeners) > config.relay_limit:
    self.redirect(random.choice(config.relays))

I’ve got more features to add to the relay system – namely, each relay should be able to send back statistics about number of listeners, user agents, and more back to the central server for logging and live status monitoring. Relays could also be smart and stop listening to the “source” stream if nobody is listening to them – preventing additional bandwidth usage. However, with this very simple star pattern, the single stream can be efficiently broadcast to hundreds (if not thousands) of listeners.

As always, the code is available on github.

Introducing forever.fm

2012-11-08T00:00:00-05:00

I’m very proud to announce the launch of my latest project – forever.fm, an automatic, infinite online DJ. Forever.fm is a beatmatched stream of the hottest tracks from SoundCloud, mixed together to sound awesome, and continuing forever. (No advertisements, DJ chatter, or breaks!) Check it out!

WARNING: Past this point, you’ll find only gory technical details of how forever.fm was made.

Overview

Forever is powered by a large number of technologies, some of which I stole from my previous music hack, the Wub Machine:

For this project, I chose to use a 512MB Linode VPS, which has been running spectacularly.
The entire site runs on Python and uses Facebook’s Tornado evented server.
To stream track metadata, waveforms, and other live updates, I’ve used Socket.IO and tornadio2, its Tornado wrapper.
the SoundCloud API provides the songs, metadata and audio streams that you hear.
the Echo Nest Remix API analyzes each song to find the best beats for beatmatching.
The Echo Nest released some cool beatmatching examples back in 2010, and my beatmatching code is heavily based off of their capsule example. (Although heavily patched to fix memory leaks, allow infinite execution and lighter CPU usage.)
LAME and FFMPEG for efficient encoding and decoding of the MP3 stream.
I found some great code for approximating the Travelling Salesman Problem by John Montgomery. This is used for ordering tracks – more on that later.
Scott Schiller’s spectacular SoundManager2 JavaScript library and 360º player UI play back the audio stream in-browser and provide very neat visualizations.
the Python Imaging Library is used to colour, stylize, and fade the waveforms of each song.
Charles Leifer’s algorithm for using k-means to find the dominant colours in images is used to colour each track based on its album artwork.

Streaming MP3 in Python

The toughest problem to solve when creating Forever was that of live streaming. The core beatmatching algorithm at its heart (“Capsule”) has existed for a couple years now. However, making this run infinitely required some different approaches.

Python’s built-in generators provide a great way to implement an iterative beatmatching algorithm, as each generator can carry its own internal state. In this case, that state is the last-played song. Here’s some pseudocode:

def forever(track_queue):
    t1 = track_queue.get()
    while not track_queue.empty():
        t2 = track_queue.get()
        yield make_transition(t1, t2)
        t1 = t2

This code is obviously oversimplified, but basically how the core of Forever works. Assuming that forever constantly yields raw audio data (i.e.: WAV, AIFF, PCM), this needs to be encoded to MP3 and streamed out to the user.

To tackle this MP3 problem, I created a LAME MP3 encoder interface in Python that allows real-time, buffered and synchronized MP3 encoding.

from Queue import Queue import subprocess import threading import traceback import logging import time log = logging.getLogger(__name__) """ Quick and dirty, frame-aware MP3 encoding bridge using LAME. About 75% of the speed of raw LAME. Pass PCM data to the Lame class, get back (via callback, queue or file) MP3 frames. Supports real-time encoding or blocking for the length of the audio stream - useful for an MP3 server, or something else real time, for example. """ """ Some important LAME facts used below: Each MP3 frame is identifiable by a header. This header has, essentially: "Frame Sync" 11 1's (i.e.: 0xFF + 3 bits) "Mpeg Audio Version ID" should be 0b11 for MPEG V1, 0b10 for MPEG V2 "Layer Description" should be 0b11 "Protection Bit" set to 1 by Lame, not protected "Bitrate index" 0000 -> free 0001 -> 32 kbps 0010 -> 40 kbps 0011 -> 48 kbps 0100 -> 56 kbps 0101 -> 64 kbps 0110 -> 80 kbps 0111 -> 96 kbps 1000 -> 112 kbps 1001 -> 128 kbps 1010 -> 160 kbps 1011 -> 192 kbps 1100 -> 224 kbps 1101 -> 256 kbps 1110 -> 320 kbps 1111 -> invalid Following the header, there are always SAMPLES_PER_FRAME samples of audio data. At our constant sampling frequency of 44100, this means each frame contains exactly .026122449 seconds of audio. """ BITRATE_TABLE = [ 0, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, None ] SAMPLERATE_TABLE = [ 44100, 48000, 32000, None ] HEADER_SIZE = 4 SAMPLES_PER_FRAME = 1152 def avg(l): return sum(l) / len(l) def frame_length(header): bitrate = BITRATE_TABLE[ord(header[2]) >> 4] sample_rate = SAMPLERATE_TABLE[(ord(header[2]) & 0b00001100) >> 2] padding = (ord(header[2]) & 0b00000010) >> 1 return int((float(SAMPLES_PER_FRAME) / sample_rate) * ((bitrate / 8) * 1000)) + padding class Lame(threading.Thread): """ Live MP3 streamer. Currently only works for 16-bit, 44.1kHz stereo input. """ safety_buffer = 30 # seconds input_wordlength = 16 samplerate = 44100 channels = 2 preset = "-V3" # Time-sensitive options real_time = False # Should we encode in 1:1 real time? block = False # Regardless of real-time, should we block # for as long as the audio we've encoded lasts? chunk_size = samplerate * channels * (input_wordlength / 8) data = None def __init__(self, callback=None, ofile=None, oqueue=None): threading.Thread.__init__(self) self.lame = None self.buffered = 0 self.oqueue = oqueue self.ofile = ofile self.callback = callback self.finished = False self.sent = False self.ready = threading.Semaphore() self.encode = threading.Semaphore() self.setDaemon(True) self.__write_queue = Queue() self.__write_thread = threading.Thread(target=self.__lame_write) self.__write_thread.setDaemon(True) self.__write_thread.start() @property def pcm_datarate(self): return self.samplerate * self.channels * (self.input_wordlength / 8) def add_pcm(self, data): """ Expects PCM data in the form of a NumPy array. """ if self.lame.returncode is not None: return False self.encode.acquire() samples = len(data) self.__write_queue.put(data) del data put_time = time.time() if self.buffered >= self.safety_buffer: self.ready.acquire() done_time = time.time() if self.block and not self.real_time: delay = (samples / float(self.samplerate)) \ - (done_time - put_time) \ - self.safety_buffer time.sleep(delay) return True def __lame_write(self): while not self.finished: data = self.__write_queue.get() if data is None: break while len(data): chunk = data[:self.chunk_size] data = data[self.chunk_size:] self.buffered += len(chunk) / self.channels * (self.input_wordlength / 8) try: chunk.tofile(self.lame.stdin) del chunk except IOError: self.finished = True break self.encode.release() # TODO: Extend me to work for all samplerates def start(self, *args, **kwargs): call = ["lame"] call.append('-r') if self.input_wordlength != 16: call.extend(["--bitwidth", str(self.input_wordlength)]) call.extend(self.preset.split()) call.extend(["-", "-"]) self.lame = subprocess.Popen( call, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE ) threading.Thread.start(self, *args, **kwargs) def ensure_is_alive(self): if self.finished: return False if self.is_alive(): return True try: self.start() return True except Exception: return False def run(self, *args, **kwargs): try: last = None lag = 0 while True: timing = float(SAMPLES_PER_FRAME) / self.samplerate header = self.lame.stdout.read(HEADER_SIZE) if len(header) == HEADER_SIZE: frame_len = frame_length(header) - HEADER_SIZE frame = self.lame.stdout.read(frame_len) buf = header + frame if len(frame) == frame_len: self.buffered -= SAMPLES_PER_FRAME else: buf = header if self.buffered < (self.safety_buffer * self.samplerate): self.ready.release() if len(buf): if self.oqueue: self.oqueue.put(buf) if self.ofile: self.ofile.write(buf) self.ofile.flush() if self.callback: self.callback(False) if self.real_time and self.sent: now = time.time() if last: delta = (now - last - timing) lag += delta if lag < timing: time.sleep(max(0, timing - delta)) last = now self.sent = True else: if self.callback: self.callback(True) break self.lame.wait() except: log.error(traceback.format_exc()) self.finish() raise def finish(self): """ Closes input stream to LAME and waits for the last frame(s) to finish encoding. Returns LAME's return value code. """ if self.lame: self.__write_queue.put(None) self.encode.acquire() self.lame.stdin.close() self.join() self.finished = True return self.lame.returncode return -1 if __name__ == "__main__": import wave import numpy f = wave.open("test.wav") a = numpy.frombuffer(f.readframes(f.getnframes()), dtype=numpy.int16).reshape((-1, 2)) s = time.time()

With this simple interface, Forever hands off chunks of large, raw audio data to LAME for MP3 encoding. Python then reads single MP3 frames from LAME as they become available and puts those frames into a queue. (If the queue is full, this blocks and prevents more raw audio from being generated, essentially throttling the entire process via negative queue pressure.)

Finally, someone has to read from this MP3 queue. After trying custom thread-based solutions for throttling this properly, I instead settled on a much more stable solution – asking the Tornado web server itself to hand out each MP3 frame as it becomes available. Surprisingly, this is extremely stable, and results in a perfectly real-time stream with no lead or lag:

SECONDS_PER_FRAME = 1152.0 / 44100  # As defined by the spec
seconds_to_buffer = 60
mp3_queue = Queue(int(seconds_to_buffer / SECONDS_PER_FRAME))
. . .
def send():
    frame = mp3_queue.get_nowait()  
    if not frame:
        print "OH NOES, MP3 queue is empty!"
        return
    for listener in listeners:
        listener.send(frame)
. . .
PeriodicCallback(send, SECONDS_PER_FRAME * 1000).start()

This strategy has not (yet) been load tested, but is nevertheless the system I’m using in production at the moment. For all I know, this could fail completely while serving a large number of listeners. Testing locally by spinning up 200 instances of CURL, this performs admirably and causes Python to use only 5% of my aging Macbook’s CPU.

Interestingly, placing a limit on this final MP3 buffer propagates this queue pressure backwards all the way to the audio generator. If the output MP3 buffer is full, the LAME wrapper will be blocked until it can write to the queue. If LAME’s output is blocked, then an internal semaphore will block in the LAME wrapper’s input function, which will delay the audio generator. (Internally, the LAME wrapper writes all of the PCM to the LAME process at once, to prevent encoding delay and decrease memory usage by only storing lightweight MP3 instead of heavy PCM. This blocking behaviour is artifically implemented to save memory.)

Memory Leaks and Python C Extensions

In the process of implementing Forever using the Echo Nest’s action and cAction libraries, I ran across an absurdly annoying bug that took me into the depths of Python C extensions. Each time I executed a Crossfade.render or pydirac.timeStretch call, I lost memory. (A lot of memory – often between 25 and 100 MB.) As Forever is built around these methods, I couldn’t have used them as-is – but I couldn’t find a suitable replacement, as they were written specifically for this purpose by Tristan Jehan, a co-founder of the Echo Nest.

So I busted out Apple’s Instruments, which include a memory profiler. Upon initial search, I found no actual memory leaks. (i.e.: There were no calls to malloc without a corresponding free.) However, memory was increasing linearly over time on the smallest possible test case – something was definitely being leaked!

As Python is a garbage-collected language, I then turned to the internal heap in an attempt to find objects without references. After trying guppy and heapy to find the size of Python’s heap, I realized it wouldn’t help. PCM audio in the Echo Nest API is stored within numpy arrays – which are garbage collected, but whose memory is allocated outside of Python. Searching for numpy arrays with heapy, guppy, or the wonderful objgraph ends up being relatively futile, as the large chunks of memory that you’re searching for won’t be in the scope of Python.

This left one possibility – the C extensions being used were leaving a reference to the large Numpy array somewhere. As it turns out, this was the case. (Discovered by manually reading the code, finding every PyObject*, and tracing it to ensure that its reference count was handled properly.) The solution?

. .  . . 
     215    +    Py_DECREF(inSound1);
     216    +    Py_DECREF(inSound2);
215  217         return PyArray_Return(outSound);
. .  . .

As it turns out, a Numpy array allocation function (NA_InputArray) was being used incorrectly. The docs state that the return value of this function should should always be DECREF’d, but they weren’t. It’s that simple, but with such a huge impact.

The Dreaded GIL

After fixing the memory leaks, optimizing the beatmaching algorithms and ensuring that the system runs indefinitely, I ran into another problem – Python’s global interpreter lock. As Wikipedia explains succintly, a GIL is:

a mutual exclusion lock held by a programming language interpreter thread to avoid sharing code that is not thread-safe with other threads. In languages with a GIL, there is always one GIL for each interpreter process. CPython and CRuby use GILs.

This poses a big problem for Forever. The Tornado server has to deal out MP3 frames in real time to each listener, so any execution delays will cause noticeable audio dropouts. Worse still, the important audio operations needed to beatmatch songs (including calls to the aforementioned C extensions) are very computiationally expensive, often holding the GIL for seconds at a time. (Note: there exists a simple way to release the GIL from extension code that I haven’t yet tried, which might mitigate the issue.)

To work around this and ensure total isolation between heavy, blocking audio operations and efficient, real-time MP3 streaming, Forever makes use of Python’s multiprocessing module to split itself into server and worker processes. By opening a queue (well, multiple queues) between the server and the worker, the server can consistently stream MP3 packets in real time, while the worker thread can block and hold the GIL for any amount of time.

In theory.

Unfortunately, the synchronized Queue class provided in the multiprocessing module buffers its data in the sending process, not the receiving process. This means that even if the Queue is full of audio, the GIL of the worker process could be acquired by one thread, preventing the server process from reading any data from the queue until it is released. This doesn’t help at all.

To work around this restriction yet again, I created a simple BufferedReadQueue class that eagerly fetches all of the data from the child process and simply buffers it in the parent, allowing the parent to read data even when the child’s GIL is blocked, and providing an isolated buffer of audio data to safeguard against dropouts.

import Queue import multiprocessing import threading class BufferedReadQueue(Queue.Queue): def __init__(self, lim=None): self.raw = multiprocessing.Queue(lim) self.__listener = threading.Thread(target=self.listen) self.__listener.setDaemon(True) self.__listener.start() Queue.Queue.__init__(self, lim) def listen(self): try: while True: self.put(self.raw.get()) except: pass @property def buffered(self): return self.qsize()

Cycles of Similar Songs

A significant part of what makes Forever sound good is its choice of tracks. One of the hardest problems that a DJ faces is choosing which tracks to play in their set, and Forever is no different. To solve this problem programatically, I created a module called the Brain and turned to graph theory.

Forever starts by grabbing a list of the top n tracks from SoundCloud, ordered by “hotness.” It then culls this list by removing songs that are too short, too long, or duplicates of other songs in the list. Forever then arranges the remaining tracks in a complete graph, where each song is a vertex, and each edge describes a measure of “distance” between tracks – an inverse of similarity, if you will. For example, if two tracks have similar tempo, tags and genre, their similarity will be high, making the edge weight between them very low.

Here’s an example of the edge weights in a simple three-song graph:

To play these songs forever, the problem reduces to finding the lowest-weight cycle in the entire undirected graph… which requires a solution to the Travelling Salesman Problem. As TSP is one of the great NP-complete problems in computer science, I resort to using approximation algorithms. Forever uses a great bit of Python code for a hill-climbing approximation, written by John Montgomery to “solve” this problem. After running the approximation for something like 10,000 iterations, I accept the solution and use that to order the resulting tracks.

After the track order is determined, the worker process receives these tracks and begins to remix them in order when necessary. At this step, the worker fetches the Echo Nest’s analysis (a very lengthy call) to find the metadata required to beatmatch each track. With this data comes a great “summary” that includes many factors that would be useful for determining track order – including energy, danceability and loudness. Hence, this summary data is cached in a SQLite database to allow the Brain to make better decisions.

Live Brain Transplants (Reloading Modules)

Once Forever was working without dropouts or stalls, I faced another issue. If I wanted to make a code change to any of the core algorithms, I’d usually just restart the server. However, if there are people listening to the radio stream, this would cut them off, as the underlying socket connection to the server would be broken as soon as the Python server process is killed. Hence, to ensure a truly endless stream of audio, I had to find a way to hot-swap portions of code that I might want to update frequently.

To test this, I started with Forever’s Brain module. To wrap the Brain in a container that could easily hot-swap its internal logic, I created another module – aptly-named “skull.” This module performs the infinite loop around the Brain’s logic, calling the Brain as a generator and adding its results directly to a limited queue. In short:

class Skull(threading.Thread):
    def __init__(self, track_queue):
        self.track_queue = track_queue

        import brain
        self.brain = brain
        self.loaded = self.modtime

        threading.Thread.__init__(self)
        self.daemon = True

    @property
    def modtime(self):
        return os.path.getmtime(self.brain.__file__)

    def run(self):
        g = self.brain.add_tracks()
        while True:
            if self.modtime != self.loaded:
                log.info("Hot-swapping brain!")
                self.brain = reload(self.brain)
                self.loaded = self.modtime
                g = self.brain.add_tracks()

            track = g.next()
            log.info("Adding new track to queue.")
            self.track_queue.put(track)

I’ve since extended this concept to a number of other core modules – namely, the beatmatching generator, as I’m currently trying to increase its efficiency every day, and the MP3 decoding/encoding classes, as they’re still a bit too memory-hungry for my liking.

Conclusions

Try out Forever.fm. If you like the kind of music that’s popular on SoundCloud (currently lots of EDM) then you’ll enjoy it. I have big plans for it from here on out, but it was super fun to build, and I learned a ton. As always, feel free to email me or tweet at me if you have any questions.

Rewriting in C++ for Fun, Speed and Masochism

2012-10-10T00:00:00-04:00

A couple months ago, I posted a blog post explaining my use for low-quality smartphone photos. It involved a smart image cropping algorithm written by Michael Macias, using ImageMagick and written in Ruby. I’ve actually used the algorithm quite a bit in preparing new photos for my homepage – although there’s one major problem – it’s amazingly slow. Take a look at the kind of processing it does:

On large JPEGs from my own photo library, like the one above, this Ruby script takes roughly 2 seconds to perform a smart 124px square crop on the most interesting part of the image:

Matched 9 images.
Originals/2012/NYC/IMG_7054.JPG => ./18.jpg in 1801.717ms
Originals/2012/NYC/IMG_7055.JPG => ./19.jpg in 1856.692ms
Originals/2012/NYC/IMG_7052.JPG => ./20.jpg in 1787.717ms
Originals/2012/NYC/IMG_7059.JPG => ./21.jpg in 1727.487ms
Originals/2012/NYC/IMG_7057.JPG => ./22.jpg in 1716.977ms
Originals/2012/NYC/IMG_7056.JPG => ./23.jpg in 1692.648ms
Originals/2012/NYC/IMG_7058.JPG => ./24.jpg in 1887.043ms
Originals/2012/NYC/IMG_7051.JPG => ./25.jpg in 1977.311ms

As I often run this algorithm on entire folders of images at once, I decided to experiment and reimplement the entire program in C++. As a developer that works primarily with Python and Ruby, I’ve always felt a small amount of guilt for incurring the crazy performance overhead of interpreted and heavily dynamic languages. (A friend of mine working in the hardware industry recently got angry over the fact that he was working to make chips faster, while us developers then “throw away” the speed gains by running interpreted languages!)

Trying libjpeg

While the original Ruby script took maybe 2 hours to write, my straight port to C++ took more than 10! Most of this time was spent navigating the API of libjpeg, fumbling with pointers and buffer arithmetic, and hunting down type casting errors that subtly caused inaccurate results. However, check out the speed gains:

Thumbnailing 9 images...
Processing Originals/2012/NYC/IMG_7051.JPG... 63ms.
Processing Originals/2012/NYC/IMG_7052.JPG... 41ms.
Processing Originals/2012/NYC/IMG_7053.JPG... 5ms.
Processing Originals/2012/NYC/IMG_7054.JPG... 33ms.
Processing Originals/2012/NYC/IMG_7055.JPG... 36ms.
Processing Originals/2012/NYC/IMG_7056.JPG... 31ms.
Processing Originals/2012/NYC/IMG_7057.JPG... 35ms.
Processing Originals/2012/NYC/IMG_7058.JPG... 30ms.
Processing Originals/2012/NYC/IMG_7059.JPG... 34ms.

The same sample images that took ~2 seconds to process in Ruby take, on average, 35ms to process in C++. That’s a speed up of more than 50x. This happens to line up roughly with the popular language benchmarks that put Ruby at ~45x slower than gcc-compiled C++, despite the fact that a lot of the work is done by RMagick. (I should point out that these programs are not exactly identical – the libjpeg version makes some small feature concessions in the name of speed. Their output is nearly identical, however.)

What’s also interesting is the cost in developer time and code quantity. The original Ruby script was ~80 lines, give or take comments – while my C++ port is ~350 lines. In this one isolated, little-optimized, amateur test, C++ took 4x the code and 5x the development time to deliver 50x the performance.

Trying Magick++

However, this joyous speed boost was short-lived. I soon discovered that my libjpeg-based solution was quite buggy. Most cameras nowadays don’t rotate the raw image from the sensor before encoding to JPEG, preferring a lossless “orientation” flag in the EXIF data instead, forcing the decoding library to parse this to display the image upright. Unfortunately, libjpeg doesn’t contain any built-in facilities to “right” an image with such a tag, and doing so manually is extremely difficult. In addition, I hadn’t written any custom image scaling code, so I depended on a libjpeg flag to scale down the input image by a power of two before decoding.

Faced with this insurmountable rotation bug, and after spending another 8 hours trying to fix it, I decided to yet again rewrite the solution using Magick++, ImageMagick’s C++ client library. Without further ado, the benchmarks:

Thumbnailing 9 images...
Processing Originals/2012/NYC/IMG_7051.JPG... 340.717ms.
Processing Originals/2012/NYC/IMG_7052.JPG... 287.965ms.
Processing Originals/2012/NYC/IMG_7053.JPG... 94.133ms.
Processing Originals/2012/NYC/IMG_7054.JPG... 279.776ms.
Processing Originals/2012/NYC/IMG_7055.JPG... 286.434ms.
Processing Originals/2012/NYC/IMG_7056.JPG... 281.245ms.
Processing Originals/2012/NYC/IMG_7057.JPG... 289.052ms.
Processing Originals/2012/NYC/IMG_7058.JPG... 280.193ms.
Processing Originals/2012/NYC/IMG_7059.JPG... 283.7ms.

For numerous reasons here (Magick++ overhead, image pre-scaling, orientation correction) the Magick++-using code falls right in the middle in terms of performance. It’s ~8 times slower than using libjpeg directly, but much more correct and still more than 5 times as fast as the equivalent Ruby code.

This final version of the program took about 2 hours to put together, with most of that time spent searching for auto-orientation code (eventually pulling it out of Magick’s Mogrify command-line tool) and optimizing for speed.

The Value of Abstractions

When given the right amount of abstraction – in this case, a fast C++ library – writing the code to be adequately fast was trivial. Using old-school C-style library integration, on the other hand, ended with me wasting hours making little to no progress. The resulting program was indeed much faster, but questionably worth the time and frustration. (My head still hurts from getting improper call to jpeg library in state xyz errors repeatedly, only to find zero helpful documentation on each error state.) Using a low-level library simply requires more knowledge and more mental state than any commonly used high-level language. (Of course, this sounds obvious.)

This brings up an important point on the state of popular (and slow) languages today. When acceptable speeds are measured in seconds rather than in milliseconds, it makes perfect sense to write slow and inefficient code quickly. Every time a Rubyist runs gem install, they’re abstracting away the low level implementation details in favour of a simple interface that helps them solve their problem faster. Amortizing the cost of running the program over its runtime, rather than its development time, is logical considering the absurdly high price of a professional software developer.

The Ubiquitous Capture Device

2012-06-23T00:00:00-04:00

Every so often, I find myself in a camera store, gawking at beautiful, expensive cameras and lenses. DSLRs have dropped in price, and mirrorless interchangeable lens cameras (also known as micro four thirds) now fill the gap between cheap point-and-shoot and semi-pro. However, every single time I go to make such a purchase, I stop myself.

It’s not that I don’t want a good camera, it’s that I already have a camera good enough. Most of us have one on us at all times.

It’s my smartphone, and it can capture images like this:

This image isn’t pristine. It’s vibrant, although it could be moreso. It’s lacking in detail, a bit noisy, and somewhat compressed. (Compressing it for web didn’t help with the presentation, either.) However, none of that is important. It’s a beautiful reminder of that moment – a great lunch with a great person, on a bright and sunny day, in a crowded restaurant. Having that moment captured is vastly more valuable than the quality of the image.

It’s true that my smartphone doesn’t take beautiful 18MP stills, nor does it have an immaculate 10x optical zoom. The lens isn’t removable, nor is the sensor very large. Low-light performance is horrible. The built in HDR mode, while better than nothing, often produces horrible artifacts and delays my next shot for seconds. I have no control over aperture, ISO, shutter speed or white balance.

Most of this doesn’t matter though, as my phone is always in my pocket. What I lose in image quality and configurability, I gain in ubiquity. I might forget to grab my camera before I leave the house – and I’ll definitely have to think twice about bringing a huge DSLR along. But my phone? The only places I might not bring it are the shower and the swimming pool. I certainly use my camera enough, now that I’ve found an interesting use for lots of low-quality smartphone photos.

Apple’s iPhones 4 and 4S are now the two most popular cameras on Flickr.

Interestingly, people often embrace the lack of quality in smartphone photos. Photo filters are all the rage. When beautiful photo quality can’t be achieved, people delight in reducing quality even further by adding artistic emotion with a filter. (Ostensibly, most Instagram users don’t think that deeply about the filter they choose.) Instead of creating art via careful manipulation of a fancy camera, people do it with a one-touch filter.

Extending the idea to another medium, smartphones often have adequate microphones. As a musician with a penchant for experiments in audio, I do a lot of recording. I nearly bought an expensive Zoom field recorder recently, only to stop myself again in favour of my phone’s more-than-adequate microphones. To capture musical ideas or field recordings, it’s perfect for the same reason – my phone is always with me, and is good enough.

Most of my recent songs are built around samples taken with my iPhone. “Train In The Sky” takes its snare sample from the closing of a Vancouver SkyTrain’s doors. “Mace and Anvil” takes part of its melody from the same train’s public alert sounds. “Somnambulist” (below) starts with samples (albeit, altered samples) of myself walking, and is filled with the enthusiastic cries of the chefs at a Japanese restaurant. All of these samples, while relatively low-quality, were taken with my phone.

As Chase Jarvis puts it, “the best camera is the one that’s with you.” The ability to capture any moment at any time, no matter the quality level, is key. For all I care, my pictures and sounds could come out as a grainy, hazy mess. As long as I can extract meaning and value from them, I’ve captured a moment and strengthened a memory.

Using Eight Cores (incorrectly) with Python

2012-05-13T00:00:00-04:00

One of my web apps, The Wub Machine, is very computationally expensive. Audio decoding, processing, encoding, and streaming, all in Python. Naturally, my first instinct was to turn to the multiprocessing module to spread the CPU-bound work across multiple processes, thus avoiding Python’s global interpreter lock.

In theory, it’s simple enough, but I did run into a few very nasty problems when dealing with multiprocessing in Python:

The multiprocessing module, at least on *nixes, forks the current process and communicates with the child with a pipe. This works wonderfully if the data you’re transferring can be easily pickled, and if the child process doesn’t need to modify any global state in the parent. Unfortunately, certain useful constructs in Python can’t be pickled, including functions and lambdas (or pretty much anything callable).

In my app, I had a peculiar use case – I would callback pass lambdas into the constructor of a class, then spawn another process on one of that class’s methods like so:
```
    class MyClass(object):
        def __init__(self, my_callback):
            self.my_callback = my_callback

        def start_work():
            p = multiprocessing.Process(target=self.do_work)
            p.start()
            p.join()

        def do_work():
            # Calculate fibonacci or something, iunno
            self.my_callback("hey look, some data!")
```
This lead to a baffling bug – while the callbacks were being run, their side-effects weren’t persistent. I inserted logging in the callback function to verify, and noted that not only was it running, but at the end of its execution, the global state had been set properly. However, from the perspective of the parent process, nothing had changed.

The reason was simple: the callback had been run in the child process and had modified the global state of the child process, not the parent. A simple fix would be to have eliminated these callbacks, but I instead used some of Armin Ronacher’s bad ideas in Python to create an experimental module that allows pseudo-function-calls between processes. Use (or even just read) at your own risk – it’s a hack.

""" multiprocesscallback.py, by Peter Sobot (psobot.com), May 13, 2012 Handles callback functions in classes that have member functions that are executed in a different process. A crazy experiment in Python magic that breaks a lot of rules. Do not use in production, for any reason. (Although I do.) If your class takes in a callback, like so: class MyClass(object): def __init__(self, callback): you can call MultiprocessCallback.register_all(queue) to auto-create member functions with the names of the callback variables, which, when called, will be safely executed in the parent process. E.g.: class MyClass(object): def __init__(self, my_callback): self._pq = multiprocessing.Queue() MultiprocessCallback.register_all(self._pq) def start_other_process(): target = MultiprocessCallback.target(self.runs_in_another_process) p = multiprocessing.Process(target=target) p.start() MultiprocessCallback.listen() p.join() def runs_in_another_process(): self.my_callback("hey look, some data!") The data in the callback must be picklable, as it will be sent across the multiprocess boundary. The parent process can read from the queue itself and run .execute() on the MultiprocessCallback objects, or it can use the *blocking* MultiprocessCallback.listen(), which provides a basic listener. Known bugs or omissions: - Will straight-up just not work in Windows. (The scenario doesn't exist - you can't use multiprocessing on member functions on Windows.) """ import multiprocessing import traceback import inspect import sys import time __author__ = "psobot" class EndListener(Exception): pass def register_all(queue=None): _locals = inspect.getargvalues(sys._getframe(1))[3] if not queue: queue = multiprocessing.Queue() setattr(_locals['self'], "_mpcq", queue) for n, v in dict([(_n, _v) for (_n, _v) in _locals.iteritems() if _n != "self"]).iteritems(): if hasattr(v, "__call__"): proc = multiprocessing.current_process().name setattr(_locals['self'], n, lambda *args, **kwargs: _safecall(proc, queue, n, v, *args, **kwargs)) def _safecall(proc, queue, n, _c, *args, **kwargs): if multiprocessing.current_process().name != proc: queue.put(MultiprocessCallback(n, *args, **kwargs)) else: return _c(*args, **kwargs) def listen(queue=None): if not queue: _locals = inspect.getargvalues(sys._getframe(1))[3] queue = _locals['self'].__dict__["_mpcq"] data = queue.get() while not isinstance(data, EndListener): if isinstance(data, MultiprocessCallback): data.execute() data = queue.get() def target(_callable, queue=None): def _target(*args, **kwargs): _callable(*args, **kwargs) end(queue) return _target def end(queue=None): if not queue: i = 1 _locals = inspect.getargvalues(sys._getframe(i))[3] while not 'self' in _locals or not '_mpcq' in _locals['self'].__dict__: i += 1 _locals = inspect.getargvalues(sys._getframe(i))[3] queue = _locals['self'].__dict__["_mpcq"] queue.put(EndListener()) class MultiprocessCallback(object): def __init__(self, name, *args, **kwargs): self.name = name self.stackf = traceback.format_stack(sys._getframe(3), 2) self.originator = multiprocessing.current_process().name self.args = args self.kwargs = kwargs def execute(self, search=None): if not search: i = 1 search = inspect.getargvalues(sys._getframe(i))[3] while not self.name in search: i += 1 search = inspect.getargvalues(sys._getframe(i))[3] if not self.name in search and 'self' in search: search = search['self'].__dict__ if self.name in search: if hasattr(search[self.name], '__call__'): try: r = search[self.name](*self.args, **self.kwargs) if r is not None: print "Warning: return value from callback ignored." except Exception, e: e.args = (" ".join(list(e.args) + ["\nOriginally called from %s (most recent call last):\n" % self.originator] + self.stackf), ) raise else: raise ValueError("Function %s not callable." % self.name) else: raise KeyError("Function %s not provided." % self.name) if __name__ == "__main__": class Test(object): def __init__(self, callback = None): register_all() def run_me(self): """ Run self.separate_process in its own process. When the callback is called, it will execute in the parent process. """ p = multiprocessing.Process(target=target(self.separate_process)) p.start() listen() p.join() def separate_process(self): for i in xrange(0, 10): # some intense computation self.callback(i) time.sleep(0.1) count = 0 def my_callback(i): """ Increments a global variable by i in the main process. """ if multiprocessing.current_process().name != "MainProcess": raise multiprocessing.ProcessError("The global is being incremented in the wrong process!") global count count += i print "Counter in main process is now: %s" % count Test(callback=my_callback).run_me()

Logging, the wonderful built-in Python module for meticulously logging everything, is thread-safe. Sadly, it doesn’t seem to be multiprocessing-safe. Logging makes use of its own internal I/O thread, to ensure that all log messages are properly queued and written without clobbering eachother. This thread is locked for every write.

After forking another process, the first call to the logger often hangs while waiting for the logging thread to become unlocked. If the logging thread was in use (i.e.: locked) at the exact instant the process was forked, then the locked thread will be copied to the new process. However, whatever log operation was in progress will then unlock the original thread, not the copied thread, leaving the new process to wait forever on a lock that will never be unlocked.

The solution, in my case, was to replace the logger in use with the one provided by multiprocessing if logging from a new process:
```
def initlog():
    if multiprocessing.current_process().name == "MainProcess":
        _log = logging.getLogger(config.log_name)
    else:
        _log = multiprocessing.get_logger()
    ...
```

To find and fix these bugs took a lot of time, and a good debugging strategy. The most valuable tool turned out, surprisingly, to be GDB. GDB 7 has support for debugging Python runtimes, complete with pseudo-stack traces. Take a look at the following backtrace of a Python process provided by GDB and formatted for clarity:

    [Thread debugging using libthread_db enabled]
    [New Thread 0xb0c2fb70 (LWP 12895)]
    0x006da405 in __kernel_vsyscall ()

    Thread 1 (Thread 0xaf23ab70 (LWP 12894)):
    #0  0x006da405 in __kernel_vsyscall ()
    #1  0x003a27d5 in sem_wait@@GLIBC_2.1 ()
                  from /lib/i386-linux-gnu/libpthread.so.0
    #2  0x080f2139 in PyThread_acquire_lock (...)
                  at ../Python/thread_pthread.h:309
    #3  0x080f2fd8 in lock_PyThread_acquire_lock (...)
                  at ../Modules/threadmodule.c:52
    #4  0x080da7d5 in call_function
                    (f=Frame 0x937b47c,
                      for file /usr/lib/python2.7/threading.py,
                      line 128,
                      in acquire
                      (self=<_RLock(...) at remote 0x9caabec>,
                        blocking=1,
                        me=-1356616848),
                      throwflag=0) at ../Python/ceval.c:4013
    ...
    (goes down 79 frames)

Obviously, this looks much more complicated than a normal Python stack trace, but it’s a huge step up from zero debugability. If I proceed down a couple more frames, I find:

    #7  0x080dac2a in fast_function
                      (f=Frame 0x9ca278c,
                      for file /usr/lib/python2.7/logging/__init__.py,
                      line 693,
                      in acquire (self=



…which is the first piece of familiar code. Line 693 of logging/__init__.py is
surrounded by a short function, and has a comment that brings the first bit of
understanding:

    def acquire(self):
        """
        Acquire the I/O thread lock.
        """
        if self.lock:
            self.lock.acquire()


Well, there you go. After fixing these race conditions and deadlocks, the Wub
Machine’s success rate immediately jumped from horrible to 95% under load.



All it took was GDB and an understanding of fork() to solve these bugs. My only
advice: be very, very, very careful when using multiprocessing.



A Site For Dinner
2012-05-08T00:00:00-04:00
I like to make small, single-serving sites – frivolous sites with only one
page, and one purpose. They’re intended to be dead-simple to use, fun to play
with, and somewhat silly. I’ve made a couple in the past, both
alone and with others, often thinking of the idea over dinner and then
implementing it in the hours (or days) that follow. Last night, I decided to
make another single-serving site – and to make it open-source, to show
others how simple it is to do.

Enter A Meal for Me. Roughly 200 lines of code for a fun site that now
helps me be more adventurous in the kitchen. (Grab the source on GitHub!)



Development took a couple hours, and was simple enough:


Have dinner.
Google for “Recipe API.”
Get an API key.
Layout a simple page in HAML.
Style with SASS.
Wire it up to the API, and use some basic jQuery to munge the data.
Apply a Google Web Font and subtle background pattern to make
things look good.
Apply API caching in Nginx.
Sleep.



To save time (and lines of code), the site is nearly 100% in-browser.  It makes
use of the wonderful Punchfork recipe API to grab data, then simply formats
the resulting recipe cleanly and simply, providing an image of the meal and a
link to instructions.

The site also makes use of HAML, SASS and CoffeeScript, rather
than HTML, CSS and JavaScript. This saved a significant amount of development
time, and allowed me to use third-party style mixins like Bourbon. Images
are sparse – the only .png files are the favicon, Punchfork reference, and the
background, which was graciously taken from SubtlePatterns. Google Web
Fonts also came in handy here, providing a well-suited font after roughly
60 seconds of searching.

To tie it all together are two non-browser components – a Rakefile and an
nginx config. The Rakefile allows me to easily compile the HAML, SASS and
CoffeeScript before deployment, and also fetches the required mixins (Bourbon).
It could very easily be extended to watch the files during development, making
the feedback loop much quicker.

The nginx config, on the other hand, serves two purposes: to cache queries to
the Punchfork API, and to hide my private API key. As their API is rate-limited,
I cache every query for 24 hours to make best use of the data I get. I also
hard-code my API key in the nginx config, to prevent others from reading it from
the client-side code and using it. All of this is quite simple to do with the
proxy_pass and proxy_cache directives:

Although I haven’t load-tested or browser-tested the site, I’m done. Its mission
was to provide an evening’s worth of learning and challenge. Now, it can
hopefully help some others learn how to as well – and if nothing else, it’ll
help me learn how to cook more things.



Note: I’ve definitely made some errors in the code. If you find any, or even
just have any suggestions or comments, please do get in touch.



Startups: Bands for Hackers
2012-04-29T00:00:00-04:00
Growing up as a young musician in suburbia, I fantasized about being in a band:
playing music in front of thousands of people, signing a record contract,
enjoying the successes (and excesses) of stardom and celebrity. As I grew older,
I began to realize how difficult it would be to achieve that goal.

Years later, as I started university and was accepted into VeloCity,
Waterloo’s startup incubator, I noticed a lot of familiar dreams.
Although their domains are vastly different, startups are just bands for hackers.



Like many bands, many startups begin in parents’ garages – serious business
in a non-serious atmosphere. Founders (or band members) spend every waking
moment together, working hard to perfect their craft and their endeavour. While
bands work to perfect their lyrics, melodies and rhythms, startups work to
perfect their pitch, product and business plan. Instead of fans, startups get
users; instead of trying to impress the A&R rep that may show up to a
performance, startups try to impress the investors in the audience on demo day.
For bands, this demo is a rough CD – for startups, it’s their first product.
Both are MVPs.

Then comes recording. Record labels often finance their artists, allowing them to
spend months in studio making a record, toiling day and night behind a recording
console. This is where the product comes from, the hard work that is initially
the band’s raison-d’être. (Their private IP, if you will.) Startups are often
financed by VCs, or by themselves, to make their initial product. For months,
they toil in their studio apartments, toiling day and night behind a Linux
console. By now, both band and startup have something nobody else does – their
work. While they both have funding, neither of them have revenue.

Then comes touring. It’s no secret that this is how many musical artists make
most of their revenue. While the original music may be great, one purchase of
their album makes them pitiful amounts of money. Many SaaS startups are in the
exact same position. Selling their software directly would be equivalent to
selling the copyright on a band’s music – a nice one-time sale, but the loss of
all of the private IP. Startups tour just like musicians do, and attempt to
generate revenue. They acquire customers, attempt to gain virality, live off
user (fan) counts and try to bring in whatever cash they can. They could take
the advertising approach, and play a show sponsored by a large brand, or they
could make money from their users directly by charging for tickets.

Then comes the pivot. A startup can quickly realize that their idea isn’t
profitable, or that nobody’s interested. Their team obviously has the skill, so
they choose to make a different product. A band can quickly realize that their
music isn’t liked, or that nobody’s interested. They obviously have the skill as
musicians, so they broaden their artistic horizons and make new music in a
different genre. Neither type of pivot is bad for the organization, although
they will both lose fans or customers. In either case, the funding
party (VC or Record Label) will definitely be involved in the decision.

Bands and startups are also identical in one other important area – motivation.
There exist bands that are motivated by money, just as there exist startups
whose sole goal is to generate income. These bands still make art, although
it is rarely well accepted (or noticed) by critics, and often written for hire.
(The Backstreet Boys and Justin Bieber are two examples that come to mind.)
Startups with the one goal of profitability may achieve success, but are often
similarly panned, and often stumble frequently on the way. (Groupon.)

On the other hand, musicians who set out to make great music are, at the very
least, taken seriously. Music reviewers devote their attention to those who care
about their work, and it’s rare to see a “classic” or universally-adored piece
of music that was written for hire*. Startups that set out to make a great
product are, on the whole, adored for their work, and often enjoy success as a
side-effect. Google was originally a PhD research project, motivated by passion
and interest rather than profit. Dollar signs were not the focus of Zuck’s
attention when he frantically wrote the first incarnation of Facebook, nor as
Jack Dorsey brainstormed how Twitter would start.

The best bands are made of those who care about their music, not their profits.
The best startups are made by those who care about their work, not their profits.

*EDIT: A few good commenters on Hacker News have reminded me that
European classical music (i.e.: Mozart and the like) was most certainly
for-hire. My point here was merely to highlight the commodity of modern,
hired pop music, and I neglected to think about any time period other than our
own.



A Use for Smartphone Photos
2012-04-21T00:00:00-04:00
As a smartphone user, I take a lot of photos. Since I bought an iPhone 4 nearly
two years ago, I’ve taken just over 6,000 photos with it. 47GB of memories. On
average, 10 photos per day, every day, often of nothing in particular.

These photos aren’t good enough, or meaningful enough to anyone else, to post
on Flickr. 500px would scoff at them. The few people on Facebook that would
recognize the people, places and events in the photos wouldn’t see the point.
They’re tiny fragments of my life, and that’s about it.



Instead of forcing these thousands of photos to stay hidden in my iPhoto
library, I found an outlet for them – my homepage. Crudely modelled after the
stellar TED.com landing page, it’s supplied by a random set of hundreds of
images, all of which I’ve taken, and until now, hand-cropped and hand-selected.

Michael Macias, in a submission to a Codebrawl last November, came up
with a brilliantly simple method of content-aware image cropping. By measuring
the greyscale entropy of a window as it slides over an image, the
highest-interest thumbnail can be determined automatically. I took this solution,
modified it (faster, uses ImageMagick, etc.), and hacked together a quick Ruby
script.

This script automatically chooses 50 random images from a given path (or shell
glob) and crops them to their most “interesting” thumbnails. The thumbnails are
scaled to size, and saved in incrementing order in the destination folder. It’s
highly optimized for my personal workflow, but it does seem to work quite
well. For example, take the following shot of Zameer Manji:



The original photo was poorly exposed, had no clear subject, and was, well,
weird. After automatically cropping it down to a tiny thumbnail, it fits in
nicely on my homepage as an artsy shot of a bike rack in the daylight.

Only one thing left to do: take more photos.



Software, Art, Music and Games
2012-02-17T00:00:00-05:00
I am a software engineering student. The exact definition of that varies among my classmates and professors. Some say that it implies an ability to write software. Others argue that it requires a strong grasp of algorithms and mathematical optimization. Still others say that software engineers need only be able to design large, complex pieces of software, or manage teams of coders, or communicate project specifications, etc.

Few people correlate software engineering with art.

There are those that will argue that “software itself is a form of art,” or that “this code is beautiful.” There are certainly pieces of software, written in different languages, that could be considered their own distinct forms of “poetry.” (And no, I’m not just talking about Lisp poetry.) Elegance, cleverness, and the functionality of the code all contribute to this sense of inherent artistry.

I prefer to write code that is outwardly visible as art. Code that you need to run, not read, to appreciate.



This is the schedule view of a radio station’s website. (CFRU 93.3fm at the University of Guelph, Ontario, to be exact.) I did not design this site – that was done by the wonderful folks at Studio Function. However, I did have the pleasure of implementing the design and creating the website itself during my last work term at The Working Group.

Although there is a fair amount of complexity behind this site, the part that was most enjoyable to implement was this schedule view. It helps that it’s beautiful and eye-catching, but writing code to make this design functional was extremely satisfying. Even more satisfying was the ability to see someone use the site, enjoy it, and being able to say “Yes, I helped make that.”

This site, via its design and partially through its functionality, is a form of art. I’ve spent my first three work terms (one year in total) working at web development shops on client projects, implementing (and sometimes designing) beautiful software that can be appreciated by almost anybody. I had a great time doing that, and enjoyed nearly every minute.



This is the Wub Machine, my online music remixer. If you know me, or if you read this blog, I’m sure you’ve heard enough about it so far. One thing I haven’t talked about yet is the art behind it.

I initially created the Wub Machine as an experiment in computer-generated music. If I were an arts student (or even a grad student in some software programs), it would have made a great thesis project to explore computer-generated art. While the technology used to power it is stunningly awesome, and the site itself is somewhat complex, that’s not the purpose of it. (Although, I did learn a lot.)

The average user of the site is not a technophile. They could care less about the software. However, the average user can definitely appreciate the product – a piece of music (ahem, mostly) that is not only listenable, but danceable and entertaining. Many would call it art.

(Hopefully.)



This is a screenshot from Dead Rising 2, an awesome action-adventure game released a couple years ago by Capcom. Yes, those are zombies, and that’s the main character (Chuck Greene) using a modified yard tool to mow them down. It’s a great game, with a great story, great gameplay, visuals, music, and the like. Save for some vocal critics, most people would consider this art.

Visual artists modeled Chuck Greene’s character. Writers crafted the brilliant story. Software engineers put it all together, and made the entirely immersive experience possible. Their work, while technical and complex, is just as much art as the models, textures, sounds and words in the game. It doesn’t just allow users to interact with art; it forms the fundamental experience that is enjoyed and appreciated.

Using software to make immersive, beautiful, artistic experiences that can be appreciated by anybody is awesome.

TL;DR: Software can be art, in many ways. That’s what I like to make.



The middle ground between form and function
2012-01-22T00:00:00-05:00
I’ve noticed a distinct trend in all of my recent work. Not all of it is useful, and not all of it is feature-complete – but it all places a lot of importance on form over function. Let me give an example:



Earlier this month, I put together a quick site called lndr.me, which tracks the usage of laundry machines at VeloCity, my student residence at the University of Waterloo. It’s simple and email-driven. Residents can email washer@lndr.me to say that they’re using a washer, and they’ll get an email back in ~30 minutes to remind them that their clothes are done. Other residents can also check the site and see if the machines are occupied.

It’s an exceedingly simple idea, with very little code required on the backend. (It’s a Rails app with ~300 lines of ruby.) I’ve even made an API to allow other residents to make apps out of it, or link in hardware sensors with Arduinos and ethernet shields.

However, before I even had the idea fleshed out, or the implementation decided on, I did a mockup. I opened Photoshop, drew some icons, found a simple colour scheme, searched for a viable domain name, and scribbled a UX flow into my Moleskine before ever typing rails new app.

This simple (some call it cute) design was my starting point. I added some things along the way – animation on the waves in the washing machine to show it’s running, or a slightly-shaking dryer icon to show the same – but most of the product was finished before I started writing code. I essentially started from the user’s perspective and then built inwards.

Now, some people will surely think this is obvious. “Of course you wait for designs first before starting implementation, that’s just obvious!” you yell. In the client-and-project-driven world of software contracting, that’s absolutely true. Specs must be finalized, and designs (or at least mockups) finished before the product is built.

A lot of other people, though, are confused by this. “It’s only a side project, who cares how it looks?” you might say. Or “I’m not a designer, I’m a coder.” I’ve heard both of those far too often to dismiss.

Your product’s user experience is just as important as what it does. Most apps do things that are marginally useful – track laundry, wake you up in the morning, play music, or give you directions. Would you use a music player that required a screwdriver to change songs? What about a map that gave directions in a series of JSON-encoded latitude and longitude coordinates, to then be decoded by the user? Of course not.

Products are successful, useful, and a joy to use if they have great user experience. A lot of hackers and coders nowadays don’t realize how important this is.

Let me give another example:



Ninjaquote is a site created by Scott Greenlay, Jinny Kim and myself in 24 hours (21:15, to be exact) during the recent Facebook hackathon at the University of Waterloo. Its goal is simple: it takes two of your Facebook friends, and finds something one of them said in the past, and quizzes you on it. The game is exceedingly simple, and has another dead-simple user experience.


Click to authorize the app to view your Facebook account.
Receive quote.
Click answer.
See if you were correct.
Goto step 2.



This simple UX, coupled with a good domain name and great mascot, makes the site a pleasure to use. So simple to use, in fact, that it won the hackathon.

This confused me at first. Other entries were far more technically complex – Hachi was an in-browser collaborative code editor built in Node.js and Socket.IO. FriendMozaic did some image processing to make your profile picture a mosaic of friends’ pictures. PrivacyVeil used some crazy OpenCV processing to detect faces behind you while you work, and pop up an Excel spreadsheet to cover your Reddit browsing.

Our winning entry was effectively ~1000 lines of Javascript, CSS3 and HTML5. Nothing fancy, nothing new – just a working, effective, and addictive user experience. Having the minimum number of features wasn’t a hinderance, as we had design to make the site appealing anyways.

tl;dr: Find the middle ground between form and function. It’s much more valuable than either extreme.



"The Street Preacher" - A Hyper-Local Twitter Bot
2011-11-30T00:00:00-05:00
I walk through Yonge & Dundas Square in Toronto every day.



That intersection, which some call Toronto’s equivalent of Times Square, has a large number of street preachers. Loud, startling, obnoxious people that yell warnings of doom or urge repentance. Silly people.

I decided to use Twitter’s real-time streaming API to make an extremely specific location-based Twitter bot. The purpose? To respond to you if you tweet near the street preachers at Yonge & Dundas, with similar messages. Call it art, or a statement about society, or making fun of those preachers, whatever – I call it a fun technical and social experiment.

Using an excellent ArsTechnica article as a guide, I created a quick Python script that watches the Twitter stream for a given area, and replies to tweets in a very specific location. (±10 meters or so, by my guess.) If you’re one of the lucky few to tweet within those bounds, you’ll get a reply from @yonge_dundas:



A day later, I decided to clean up the script (rewrite it in Ruby, too) and open-source it. Well, here it is, in a quick Github gist:






Feel free to fork it, repurpose it, and do whatever! (Just keep my name at the top, if you please.)