Content-encoding gzip, plus HTTP range requests, equals bad mojo

This is going to be long. You’ve been warned.

Act 1: Wherein our hero is oblivious to the trouble

Years ago, early in the life of my podcast, someone waved their phone at me and said, “sometimes the podcast playback jumps back to the beginning, and then I cannot skip or scrub forward to resume where I was.” I shrugged. What’s one problem report for Google Podcasts on Android, particularly since this was early days for Google’s Podcasts app.

Curious.

Aside: “scrub” is audio lingo for manually sliding through audio. On apps, there is usually a small grab-here marker at the current play position. If you touch that and slide, the audio will scrub along until the time where you let go. (Versus “skip” buttons which jump forward or backward a set amount of time.)

I have a vague memory of some later time, where some automated analysis of our podcast feed reported that we “don’t support range requests.” I initially ignored this, but made a note. One day—months later—I looked up what an [HTTP] range request is, and verified that our web site does in fact support range requests.

Curioser.

A week ago, I got another problem report. From an Android user with the Google Podcast app. First off, it’s no longer early days for that app, so I’m less inclined to just “blame the app” when someone speaks up. Second, the callout was more thorough. This person had tried several different of our episodes, (all of which exhibited the problem,) and they had verified that some other podcasts they subscribe to did not exhibit this problem.

“Curioser and curioser,” said Alice.

Our podcast is self-hosted. We run a virtual instance of CentOS on BlueHost, with Apache and WordPress, with the Seriously Simple Podcasting (SSP) plugin producing our podcast RSS feed. Seriously, none of that is simple. But it does mean that we have a tremendous amount of control—if we want to look under the hood. (Stop here. Take 4 minutes to watch that if you’ve never seen Mike’s New Car.)

Act 2: Wherein our hero heads into the belly of the beast

I asked the person who waved their problem at me years ago, “hey, uh, do you still see that problem?” (Yes they do.) …and I reached out to James Cridland at podnews.net and he verified that he too sees this behavior with my podcast files. …and he pointed out that he was seeing, (it’s not clear exactly what tool he used—but it doesn’t matter for this story,) content-type: gzip for the media file that we were serving.

Wait, wat.

Why am I serving a compressed (i.e., gzip’d) version of an MP3 file? That’s already a file containing compressed data; It probably increases in size when you gzip it. Not to mention the CPU cycles wasted gzip’ing the many-megabyte sized files for each reqeust.

Next the voice in the back of my head started pointing out that HTTP range requests—where the web client (in this story a podcast player app) can ask for a specific range of bytes from a resource—sure feels like the sort of thing that might be related to pulling down some of a file now, and then more of the file later after you’ve listened to it for half an hour. Maybe if we didn’t support range requests that would mess up skipping and scrubbing? But wait, no, I checked two years ago, (and I just rechecked,) that we support range requests. So what the heck?! Is the problem related to compression, to range requests, the combination, or something else?

Spock mode on. Start checking everything methodically. When you’ve eliminated all other possibilities, whatever remains, must be the case.

What if we don’t actually support range requests on our media files? So I started digging into how Seriously Simple Podcasting (SSP) handles the actual feeding out of files.

Aside: I know enough about Apache and PHP to know that just because Apache supports range requests on files (“here’s 100 bytes from that MP3 you asked for…”) doesn’t mean that a PHP program would necessarily be able to answer a range request. Spoiler: It’s very hard to support a range request programmatically in PHP. So I need to know what exactly—Apache or SSP, which is just a pile of PHP code—actually feeds the media file?

So I posted on the SSP support forum…

I’m trying to troubleshoot a problem reported with the Google Podcasts player on Android. (I’ve one reporting user and I cannot personally reproduce the problem.) In the process, I went down a rabbit hole looking into HTTP range requests.

I’m wondering: If the the SSP plugin is serving the MP3 audio files via PHP (which would require the PHP code to implement supporting range requests) or if, after a redirection from the stats-collection URL, it let’s my web server (Apache) just send out the static file (in which case Apache itself handles range requests.)

One of the devs responded:

Do you perhaps have a URL describing HTTP range requests and how they relate to serving files behind the PHP redirect, so that I can understand how it could be causing the problem? From the cursory review I’ve done, it would appear we should update the plugin to support range requests, would you agree?

Which startled me both because range requests are apparently more obscure than I was thinking they are, and that this was a very nice olive branch from a developer right out of the gate. Anyway. I was already really doubting that SSP was causing this problem, so I put on my big-boy detective pants and dug deeper.

Leading me to post:

…I think it’s not actually a problem [with SSP], but I wanted to double-check my analysis with someone familiar with the code.

For range requests, it’s RFC7233 — but before you even bother looking at that. I think the answer is that SSP doesn’t handle the serving of the audio file via PHP, but rather leaves that to the underlying web server. (In my case, that’s Apache, which handles range requests of static assets.)

Straight from my RSS feed, I have (for example) <enclosure url="https://moversmindset.com/podcast-download/4734/062-chris-and-shirley-darlington-rowat-serendipity-family-and-relationships.mp3" length="29493071" type="audio/mpeg"></enclosure> and if I fetch that URL, I get SSP doing a redirection. Here I’m asking Curl to get me a range of bytes:

Craigs-iMac:~ craig$ curl -I --range 500-600 https://moversmindset.com/podcast-download/4734/062-chris-and-shirley-darlington-rowat-serendipity-family-and-relationships.mp3
HTTP/1.1 302 Found
Date: Fri, 11 Oct 2019 14:55:48 GMT
Server: Apache
Pragma: no-cache
Expires: 0
Cache-Control: must-revalidate, post-check=0, pre-check=0
Robots: none
X-Redirect-By: WordPress
Set-Cookie: PHPSESSID=6576b49ab4d78ab7628bb05a727805dd; path=/
Location: https://moversmindset.com/wp-content/uploads/2019/10/MM_62_Chris_and_Shirley.mp3
Content-Type: text/html; charset=UTF-8

Aside: -I with curl says just give me the headers for a response for the requested resource. Not the actual resource. The 302 HTTP status, combined with the Location: header is standard web-speak for a web server saying, “please go get this resource instead.” Critically this is a 302 which is a “temporary” redirect, not a 301 which is a “permanent” redirect. With 302, if you want this resource again or more of it with another range request, you should ask for it again at the original URL. Versus with a 301, where you should not ask again, you should use the new location going forward with any subsequent requests. tl;dr: 302 + Location is what I expected to see.

…that curl request gives me a standard redirection. As expected(!) since SSP wants to track statistics. That new 302 location is a direct-link into the WP assets storage. When I curl that, making a range request again, it works perfectly. (Apache is happy to give me the 101 bytes I’m asking for.) Below is both the headers-only (-I in Curl) and a full fetch….

Craigs-iMac:~ craig$ curl -I --range 500-600 https://moversmindset.com/wp-content/uploads/2019/10/MM_62_Chris_and_Shirley.mp3
HTTP/1.1 206 Partial Content
Date: Fri, 11 Oct 2019 14:58:33 GMT
Server: Apache
Last-Modified: Sun, 06 Oct 2019 14:51:01 GMT
Accept-Ranges: bytes
Content-Length: 101
Vary: Accept-Encoding
Content-Range: bytes 500-600/29493071
Content-Type: audio/mpeg

Craigs-iMac:~ craig$ curl --range 500-600 https://moversmindset.com/wp-content/uploads/2019/10/MM_62_Chris_and_Shirley.mp3 > ./foo
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   101  100   101    0     0    389      0 --:--:-- --:--:-- --:--:--   388
Craigs-iMac:~ craig$ ls -alh ./foo
-rw-r--r--  1 craig  staff   101B Oct 11 10:58 ./foo

Aside: First a bunch of headers saying that I would receive [if I actually asked] 101 bytes of content-length, and then an actual request where I end up with 101 bytes in a file on my computer. tl;dr: everything as expected.

So I think the answer is that SSP doesn’t interfere with HTTP range requests. And that means the problem I’m trying to solve can’t be caused by my site not correctly answering range requests.

At this point, I folded my arms with one of those “hurumph” noises. Then I thought of something: Ya’ know, since it’s Apache that is going to feed me that MP3 file, it would totally be able to change its behavior based on what the web client, (aka, the podcast player app, Google Podcasts,) said it would accept as a response.

Aside: The Web is a conversation between web clients and web servers. Every request—and there can be hundreds of requests to show you one page—starts with the client asking for a resource and listing the types of responses it will accept. Think: am I wanting an image resource, an audio file, a blob of HTML, etc. Also, what types of encoding of those resources can the client understand. tl;dr: No more tl;dr’s here. We’re in the belly of the beast now.

So how do I tell curl to manipulate the encodings it should tell the server it would accept. Answer: By adding a header via the -H flag.

So reviewing: Here’s a normal ask for the headers for a specific media file. This isn’t a range request, this is just an ask for the headers for an entire resource:

Craigs-iMac:~ craig$  curl -I https://moversmindset.com/wp-content/uploads/2019/10/MM_62_Chris_and_Shirley.mp3
HTTP/1.1 200 OK
Date: Thu, 17 Oct 2019 13:09:27 GMT
Server: Apache
Last-Modified: Sun, 06 Oct 2019 14:51:01 GMT
ETag: "220031-1c2074f-5943f1052a358"
Accept-Ranges: bytes
Content-Length: 29493071
Vary: Accept-Encoding
Content-Type: audio/mpeg

That’s exactly what I expect: If I actually asked for the resource, I’d get about 30 megabytes of content back.

And what would happen if I tell curl (note the -H argument on this one) to tell the server that I’d be happy with a gzip’d response:

Craigs-iMac:~ craig$ curl -H "Accept-Encoding: gzip" -I https://moversmindset.com/wp-content/uploads/2019/10/MM_62_Chris_and_Shirley.mp3
HTTP/1.1 200 OK
Date: Thu, 17 Oct 2019 13:11:03 GMT
Server: Apache
Last-Modified: Sun, 06 Oct 2019 14:51:01 GMT
ETag: "220031-1c2074f-5943f1052a358-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Type: audio/mpeg

Oh shit. It would send me a gzip encoded version of my MP3 file. And critically, it doesn’t tell me how big that would be—no Content-length is given—because the server would have to actually compress it with gzip to see how big it would actually be.

Aside: If you know about Apache’s ability to serve out pre-compressed versions of files—so you have the .mp3 and the .mp3.gz files laying on disk ready to go—if you know about that, then you don’t need to read any of this article. I was tempted to set that up just to have Content-length and the gzip encoding header in the shot because you would have noticed. ;)

So IF the client . . . say for example, oh, I don’t know, the Google Podcasts app maybe? . . . happens to mention that it could accept a gzip’d response, then boy-howdy a gzip’d response our Apache would send.

…and that’s a problem why? Because it turns out that you cannot combine compression (any sort, not just gzip) content encoding with range requests. The long version is this Stack Overflow thread, Is it possible to send HTTP response using GZIP and byte ranges at the same time? The short answer is, no, because even if I wanted to waste my time compressing it just to give you 100 bytes out of the middle, it’s not possible for you to then uncompress those 100 bytes. All modern compression algorithms work on streams of data—you have to start decompressing from the first byte.

Aside: The next version of HTTP addresses this because it turns out that being able to have range requests on resources compressed in flight would be very useful.

Ok smart guy, what happens if you try to make a range request and accept compression?

Craigs-iMac:~ craig$ curl --range 500-600 -H "Accept-Encoding: gzip" -I https://moversmindset.com/wp-content/uploads/2019/10/MM_62_Chris_and_Shirley.mp3
HTTP/1.1 206 Partial Content
Date: Sat, 19 Oct 2019 01:58:16 GMT
Server: Apache
Last-Modified: Sun, 06 Oct 2019 14:51:01 GMT
ETag: "220031-1c2074f-5943f1052a358"
Accept-Ranges: bytes
Content-Length: 101
Content-Range: bytes 500-600/29493071
Content-Type: audio/mpeg

Honestly? That’s not what I expected. I was expecting some sort of actual error from the server.

But nope, that’s a perfectly happy, 101 bytes—or it would be if hadn’t specified I just wanted the headers—out of the full 30 megabytes-or-so, and it wouldn’t be compressed. This confuses the hell out of me because it’s exactly what you’d want. The app asked for something we can’t do so we skip the compression part—the app said it would accept compression, not that it demands compression.

Oh, who cares. Compressing MP3 files—especially live on the fly each time they are served—is totally the wrong thing to be doing. Let’s just stop that and hope the problem goes away.

Aside: You thought the wizard behind the curtain always figures it out? I’ve got some bad news for you sunshine, Pink isn’t well, he stayed back at the hotel…

Act 3: Wherein our hero vanquishes the problem by typing four characters

Still in Spock mode, let’s describe the actual problem…

There’s crazy-level complexity with compression and range requests
It’s not clear what exactly the Google Podcast app is requesting—I don’t have any Android devices and I’d have to capture TCP data from the network to even find out
So the app makes some sort of request…
…and the server responds
Making scrub and skip not work.

Oh, well that’s perfectly clear then, isn’t it?

Seriously, screw this. I’m just turning off compression of MIME type “audio/mpeg” files. (MP3 files are an example of MIME type “audio/mpeg”.)

Aside: Yes, I said MIME—the Multipart Internet Mail Extension specification is how we ended up classifying what something is on the web. Major type “audio”, minor type “mpeg”. I know, this stuff is bonkers… it’s just turtles all the way down.

Since Apache does not compress things by default, all I have to do is find where the “DEFLATE”—that’s really what it’s called, gzip is one way of “deflating” files—output filter is assigned to handle files of MIME type “audio/mpeg.” That’s actually easy to do if you are fluent in Apache.

I’m an Apache configuration file wizard. I’ve been using Apache since it was spun off from something else in—I had to look it up—1995. Trivia: It was “a patchy server” cobbled together from some open-source work done—never mind, go read it on Wikipedia. ANYWAY.

I read over the entire Apache configuration, it’s complicated in files that include other files with nesting and logic and lions and tigers and bears… but it’s all perfectly clear and straight-forward to me…

…and yet I can’t see why it would EVER decide to apply the DEFLATE output filter to an audio/mpeg file.

Strike one. Off with the kid gloves.

If I can’t figure out where it’s turned on and remove that, the next best thing to do is to just add a rule that turns if off. But when I went to do that, I found that whoever designed the system architecture had already stumbled upon this mess. (Compressing media files is wrong-headed.) They already had a rule, which confesses the sin of wacky configuration by saying, “yo! for these media files, knock it off with the DEFLATE output filter!”

SetEnvIfNoCase Request_URI .(?:gif|jpg|png|ico|zip|gz|mp4|flv)$ no-gzip

As soon as I saw this, I was like, “you’re kidding me, right?” That says if the thing being requested ends with a period followed by any of those file extensions, then set an environment flag telling the DEFLATE module not to gzip.

…and “mp3” is not listed.

So I cursed like a sailor, threw my hands up in the air, and added “mp3|” to that string right after “mp4|”, restarted Apache, checked with James, and the problem is fixed.

omfg where’s my Tylenol?

Craig Constantine

Caution: Blogging. Randomly.