Xiph logo

Jean-Marc Valin : The Myth of the $100,000 Listening Test


Ever since we started working on Opus at the IETF, it's been a recurring theme. "You guys don't know how to test codecs", "You can't be serious unless you spend $100,000 testing your codec with several independent labs", or even "designing codecs is easy, it's testing that's hard". OK, subjective testing is indeed important. After all, that's the main thing that differentiates serious signal processing work from idiots using $1000 directional, oxygen-free speaker cable. However, just like speaker cables, more expensive listening tests do not necessarily mean more useful results. In this post I'm going to explain why this kind of thinking is wrong. I will avoid naming anyone here because I want to attack the myth of the $100,000 listening test, not the people who believe in it.

In the Beginning

Back in the 70s and 80s, digital audio equipment was very expensive, complicated to deploy, and difficult to test at all. Not everyone could afford analog-to-digital converters (ADC) or digital-to-analog converters (DAC), so any testing required using expensive, specialized labs. When someone came up with a new piece of equipment or a codec, it could end up being deployed for several decades, so it made sense to give it to one of these labs to test the hell out of it. At the same time, it wasn't too hard to do a good job in testing because algorithms were generally simple and codecs only supported one or two modes of operation. For example, a codec like G.711 only has a single bit-rate and can be implemented in less than 10 lines of code. With something that simple, it's generally not too hard to have 100% code coverage and make sure all corner cases are handled correctly. Considering the investments involved, it just made sense to pay tens or hundreds of thousands of dollars to make sure nothing blows up. This was paid by large telcos and their suppliers, so they could afford it anyway.

Things remained pretty much the same through the 90s. When G.729 was standardized in 1995, it still only had a single bit-rate, and the computational complexity was still beyond what a PC could do in real-time. A few years later, we finally got codecs like AMR-NB that supported several bit-rates, though the number was still small enough that you could test each of them.

Enter Opus

When we first attempted to create a codec working group (WG) at the IETF, some folks were less than thrilled to have their "codec monopoly" challenged. The first objection we heard was "you're not competent enough to write a codec". After pointing out that we already had three candidate codecs on the table (SILK, CELT, BroadVoice), created by the authors of 3 already-deployed codecs (iSAC, Speex, G.728), the objection quickly switched to testing. After all, how was the IETF going to review this work and make sure it was any good?

The best answer came from an old-time ("gray beard") IETF participant and was along the lines of: "we at the IETF are used to reviewing things that are a lot harder to evaluate, like crypto standards. When it comes to audio, at least all of us have two ears". And it makes sense. Among all the things the IETF does (transport protocols, security, signalling, ...), codecs are among the easiest to test because at least you know the criteria and they're directly measurable. Audio quality is a hell of a lot easier to measure than "is this cipher breakable?", "is this signalling extensible enough?", or "Will this BGP update break the Internet?"

Of course, that was not the end of the testing story. For many months in 2011 we were again faced with never-ending complaints that Opus "had not been tested". There was this implicit assumption that testing the final codec improves the codec. Yeah right! Apparently, the Big-Test-At-The-End is meant to ensure that the codec is good and if it's not then you have to go back to the drawing board. Interestingly, I'm not aware of a single ITU-T codec for which that happened. On the other hand, I am aware of at least one case where the Big-Test-At-The-End revealed someting wrong. Let's look at the listening test results from the AMR-WB (a.k.a. G.722.2) codec. AMR-WB has 9 bitrates, ranging from 6.6 kb/s to 23.85 kb/s. The interesting thing with the results is that when looking at the two highest rates (23.05 and 23.85) one notices that the 23.85 kb/s mode actually has lower quality than the lower 23.05 bitrate. That's a sign that something's gone wrong somewhere. I'm not aware of why that was the case or what exactly happened from there, but apparently it didn't bother people enough to actually fix the problem. That's the problem with final tests, they're final.

A Better Approach

What I've learned from Opus is that it's possible to have tests that are far more useful and much cheaper. First, final tests aren't that useful. Although we did conduct some of those, ultimately their main use ends up being for marketing and bragging rights. After all, if you still need these tests to convince yourself that your codec is any good, something's very wrong with your development process. Besides, when you look at a codec like Opus, you have about 1200 possible bitrates, using three different coding modes, four different frame sizes, and either mono or stereo input. That's far more than one can reliably test with traditional subjective listening tests. Even if you could, modern codecs are complex enough that some problems may only occur with very specific audio signals.

The single testing approach that gave us the most useful results was also the simplest: just put the code out there so people can use it. That's how we got reports like "it works well overall, but not on this rare piece of post-neo-modern folk metal" or "it worked for all our instruments except my bass". This is not something you can catch with ITU-style testing. It's one of the most fundamental principles of open-source development: "given enough eyeballs, all bugs are shallow". Another approach was simply to throw tons of audio at it and evaluate the quality using PEAQ-style objective measurement tools. While these tools are generally unreliable for precise evaluation of a codec quality, they're pretty good at flagging files the codec does badly on for further analysis.

We ended up using more than a dozen different approaches to testing, including various flavours of fuzzing. In the end, when it comes to the final testing, nothing beats having the thing out there. After all, as our Skype friends would put it:

Which codec do you trust more? The codec that's been tested by dozens of listeners in a highly controlled lab, or the codec that's been tested by hundreds of millions of listeners in just about all conditions imaginable?
It's not like we actually invented anything here either. Software testing has evolved quite a bit since the 80s and we've mainly attempted to follow the best practices rather than use antiquated methods "because that's what we've always done".

March 18, 2013 07:02 PM

Monty : Wait, dude, what?


Oh. Oh my. After a decade of the MPEG LA saying they were coming to destroy the FOSS codec movement, with none other than the late Steve Jobs himself chiming in, today the Licensing Authority announced what we already knew.

They got nothing. There will be no Theora patent pool. There will be no VP8 patent pool. There will be no VPnext patent pool.

We knew that of course, we always did. It's just that I never, in a million years, expected them to put it in writing and walk away. The wording suggests Google paid some money to grease this along, and the agreement wording is interesting [and instructive] but make no mistake: Google won. Full stop.

This is not an unconditional win for FOSS, of course, the LA narrowed the scope of the agreement as much as they could in return for agreeing to stop being a pissy, anti-competetive brat. But this is still huge. We can work with this.

For at least the immediate future, I shall have to think some uncharacteristically nice things about the MPEG LA.*

And now... Discuss!

*Apologies to Rep. Barney Frank

by Monty (monty@xiph.org) at March 08, 2013 04:58 AM

Monty : It's Out! It's Finally Out!


We did it. We finally finished Xiph's second big video: Episode 2: Digital Show & Tell

"The second video from Xiph.Org explores multiple facets of digital audio signals and how they really behave in the real world. Sampling, quantization, dither, band-limiting, and vintage bench equipment all in one video!" Go see it!

by Monty (monty@xiph.org) at February 26, 2013 11:38 AM

Jean-Marc Valin : Defending Opus IPR Status


For those who had been wondering what we thought of the recent France Telecom IPR declaration against Opus, here's our response. It's nice to be working for a company that isn't afraid of speaking publicly about patents.

February 06, 2013 09:11 PM

Chris Pearce : HTML5 video playbackRate and Ogg chaining support landed in Firefox


Paul Adenot has recently landed patches in Firefox to enable the playbackRate attribute on HTML5 <audio> and <video> elements.

This is a cool feature that I've been looking forward to for a while; it means users can speed up playback of videos (and audio) so that you can for example watch presentations sped up and only slow down for the interesting bits. Currently Firefox's <video> controls don't have support for playbackRate, but it is accessible from JavaScript, and hopefully we'll get support added to our built-in controls soon.

Paul has also finally landed support for Ogg chaining. This has been strongly desired for quite some time by community members, and the final patch also had contributions from "oneman" (David Richards), who also was a strong advocate for this feature.

We decided to reduce the scope of our chaining implementation in order to make it easier and quicker to implement. We targeted the features most desired by internet radio providers, and so we only support chaining in Ogg Vorbis and Opus audio files and we disable seeking in chained files.

Thanks Paul and David for working on these features!

by Chris Pearce (noreply@blogger.com) at December 23, 2012 02:28 AM

Jean-Marc Valin : Releasing Opus 1.1-alpha


We just released Opus 1.1-alpha, which includes more than one year of development compared to the 1.0.x branch. There are quality improvements, optimizations, bug fixes, as well as an experimental speech/music detector for mode decisions. That being said, it's still an alpha release, which means it can also do stupid things sometimes. If you come across any of those, please let us know so we can fix it. You can send an email to the mailing list, or join us on IRC in #opus on irc.freenode.net. The main reason for releasing this alpha is to get feedback about what works and what does not.

Quality improvements

Most of the quality improvements come from the unconstrained variable bitrate (VBR). In the 1.0.x encoder VBR always attempts to meet its target bitrate. The new VBR code is free to deviate from its target depending on how difficult the file is to encode. In addition to boosting the rate of transients like 1.0.x goes, the new encoder also boosts the rate of tonal signals which are harder to code for Opus. On the other hand, for signals with a narrow stereo image, Opus can reduce the bitrate. What this means in the end is that some files may significantly deviate from the target. For example, someone encoding his music collection at 64 kb/s (nominal) may find that some files end up using as low as 48 kb/s, while others may use up to about 96 kb/s. However, for a large enough collection, the average should be fairly close to the target.

There are a few more ways in which the alpha improves quality. The dynamic allocation code was improved and made more aggressive, the transient detector was once again rewritten, and so was the tf analysis code. A simple thing that improves quality of some files is the new DC rejection (3-Hz high-pass) filter. DC is not supposed to be present in audio signals, but it sometimes is and harms quality. At last, there are many minor improvements for speech quality (both on the SILK side and on the CELT side), including changes to the pitch estimator.

Speech/music detector

Another big feature is automatic detection of speech and music. This is useful for selecting the optimal encoding mode between SILK-only/hybrid and CELT-only. Unlike what some people think, it's not as simple as encoding all music with CELT and all speech with SILK. It also depends on the bitrate (at very low rate, we'll use SILK for music and at high rate, we'll use CELT for speech). Automatic detection isn't easy, but doing so in real-time (with no look-ahead) is even harder. Because of that the detector tends to take 1-2 seconds before reacting to transitions and will sometimes make bad decisions. We'd be interested in knowing about any screw ups of the algorithm.

Bandwidth detection

The new encoder can also detect the bandwidth of the input signal. This is useful to avoid wasting bits encoding frequencies that aren't present in the signal. While easier than speech/music detection, bandwidth detection isn't as easy as it sounds because of aliasing, quantization and dithering. The current algorithm should do a reasonable job, but again we'd be interested in knowing about any failure.

December 22, 2012 05:04 PM

Silvia Pfeiffer : What is “interoperable TTML”?


I’ve just tried to come to terms with the latest state of TTML, the Timed Text Markup Language.

TTML has been specified by the W3C Timed Text Working Group and released as a RECommendation v1.0 in November 2010. Since then, several organisations have tried to adopt it as their caption file format. This includes the SMPTE, the EBU (European Broadcasting Union), and Microsoft.

Both, Microsoft and the EBU actually looked at TTML in detail and decided that in order to make it usable for their use cases, a restriction of its functionalities is needed.

EBU-TT

The EBU released EBU-TT, which restricts the set of valid attributes and feature. “The EBU-TT format is intended to constrain the features provided by TTML, especially to make EBU-TT more suitable for the use with broadcast video and web video applications.” (see EBU-TT).

In addition, EBU-specific namespaces were introduce to extend TTML with EBU-specific data types, e.g. ebuttdt:frameRateMultiplierType or ebuttdt:smpteTimingType. Similarly, a bunch of metadata elements were introduced, e.g. ebuttm:documentMetadata, ebuttm:documentEbuttVersion, or ebuttm:documentIdentifier.

The use of namespaces as an extensibility mechanism will ascertain that EBU-TT files continue to be valid TTML files. However, any vanilla TTML parser will not know what to do with these custom extensions and will drop them on the floor.

Simple Delivery Profile

With the intention to make TTML ready for “internet delivery of Captions originated in the United States”, Microsoft proposed a “Simple Delivery Profile for Closed Captions (US)” (see Simple Profile). The Simple Profile is also a restriction of TTML.

Unfortunately, the Microsoft profile is not the same as the EBU-TT profile: for example, it contains the “set” element, which is not conformant in EBU-TT. Similarly, the supported style features are different, e.g. Simple Profile supports “display-region”, while EBU-TT does not. On the other hand, EBU-TT supports monospace, sans-serif and serif fonts, while the Simple profile does not.

Thus files created for the Simple Delivery Profile will not work on players that expect EBU-TT and the reverse.

Fortunately, the Simple Delivery Profile does not introduce any new namespaces and new features, so at least it is an explicit subpart of TTML and not both a restriction and extension like EBU-TT.

SMPTE-TT

SMPTE also created a version of the TTML standard called SMPTE-TT. SMPTE did not decide on a subset of TTML for their purposes – it was simply adopted as a complete set. “This Standard provides a framework for timed text to be supported for content delivered via broadband means,…” (see SMPTE-TT).

However, SMPTE extended TTML in SMPTE-TT with an ability to store a binary blob with captions in another format. This allows using SMPTE-TT as a transport format for any caption format and is deemed to help with “backwards compatibility”.

Now, instead of specifying a profile, SMPTE decided to define how to convert CEA-608 captions to SMPTE-TT. Even if it’s not called a “profile”, that’s actually what it is. It even has its own namespace: “m608:”.

Conclusion

With all these different versions of TTML, I ask myself what a video player that claims support for TTML will do to get something working. The only chance it has is to implement all the extensions defined in all the different profiles. I pity the player that has to deal with a SMPTE-TT file that has a binary blob in it and is expected to be able to decode this.

Now, what is a caption author supposed to do when creating TTML? They obviously cannot expect all players to be able to play back all TTML versions. Should they create different files depending on what platform they are targeting, i.e. a EBU-TT version, a SMPTE-TT version, a vanilla TTML version, and a Simple Delivery Profile version? Should they by throwing all the features of all the versions into one TTML file and hope that the players will pick out the right things that they require and drop the rest on the floor?

Maybe the best way to progress would be to make a list of the “safe” features: those features that every TTML profile supports. That may be the best way to get an “interoperable TTML” file. Here’s me hoping that this minimal set of features doesn’t just end up being the usual (starttime, endtime, text) triple.

UPDATE:

I just found out that UltraViolet have their own profile of SMPTE-TT called CFF-TT (see UltraViolet FAQ and spec). They are making some SMPTE-TT fields optional, but introduce a new @forcedDisplayMode attribute under their own namespace “cff:”.

by silvia at September 19, 2012 11:01 AM

Jean-Marc Valin : Opus is out, it rocks, and it's a standard


We finally made it! Opus is now standardized by the IETF as RFC 6716. See the Mozilla hacks post and the Xiph.Org press release for more details. Of course, feel free to help spread the word around.

We're also releasing both version 1.0.0, which is the same code as the RFC, and version 1.0.1, which is a minor update on that code (mainly with the build system). As usual, you can get those from http://opus-codec.org/

Thanks to everyone who contributed by fixing bugs, reporting issues, implementing Opus support, testing, advocating, ... It was a lot of work, but it was worth it.

September 11, 2012 07:20 PM

Silvia Pfeiffer : Why I became a HTML5 co-editor


A few weeks ago, I had the honor to be appointed as part of the editorial team of the W3C HTML5 specification.

Since Ian Hickson had recently decided to focus solely on editing the WHATWG HTML living standard specification, the W3C started looking for other editors to take the existing HTML5 specification to REC level. REC level is what other standards organizations call a “ratified standard”.

But what does REC level really mean for HTML?

In my probably somewhat subjective view, recommendation level means that a snapshot is taken of the continuously evolving HTML spec, which has a comprehensive feature set, that is implemented in a cross-browser interoperable way, has a complete test set for the features, and has received wide review. The latter implies that other groups in the W3C have had a chance to look at the specification and make sure it satisfies their basic requirements, which include e.g. applicability to all users (accessibility, internationalization), platforms, and devices (mobile, TV).

Basically it means that we stop for a “moment”, take a deep breath, polish the feature set that we’ve been working on this far, and make sure we all agree on it, before we get back to changing the world with cool new stuff. In a software project we would call it a release branch with feature freeze.

Now, as productive as that may sound for software – it’s not actually that exciting for a specification. Firstly, the most exciting things happen when writing new features. Secondly, development of browsers doesn’t just magically stop to get the release (REC) happening. And lastly, if we’ve done our specification work well, there should be only little work to do. Basically, it’s the unthankful work of tidying up that we’re looking at here. :-)

So, why am I doing it? I am not doing this for money – I’m currently part-time contracting to Google’s accessibility team working on video accessibility and this editor work is not covered by my contract. It wasn’t possible to reconcile polishing work on a specification with the goals of my contract, which include pushing new accessibility features forward. Therefore, when invited, I decided to offer my spare time to the W3C.

I’m giving this time under the condition that I’d only be looking at accessibility and video related sections. This is where my interest and expertise lie, and where I’m passionate to get things right. I want to make sure that we create accessibility features that will be implemented and that we polish existing video features. I want to make sure we don’t digress from implementations which continue to get updated and may follow the WHATWG spec or HTML.next or other needs.

I am not yet completely sure what the editorship will entail. Will we look at tests, too? Will we get involved in HTML.next? This far we’ve been preparing for our work by setting up adequate version control repositories, building a spec creation process, discussing how to bridge to the WHATWG commits, and analysing the long list of bugs to see how to cope with them. There’s plenty of actual text editing work ahead and the team is shaping up well! I look forward to the new experiences.

by silvia at August 15, 2012 01:25 PM

Jean-Marc Valin : Opus will be mandatory to implement for WebRTC


I just got back from the 84th IETF meeting in Vancouver. The most interesting part (as far as I was concerned anyway) was the rtcweb working group meeting. One of the topics was selecting the mandatory-to-implement (MTI) codecs. For audio, we proposed having both Opus and G.711 as MTI codecs. Much to our surprise, most of the following discussion was over whether G.711 was a good idea. In the end, there was strong consensus (the IETF believes in "rough consensus and running code") in favor of Opus+G.711, so that's what's going to be in rtcweb. Of course, implementers will probably ship with a bunch of other codecs for legacy compatibility purposes.

The video codec discussion was far less successful. Not only is there still no consensus over which codec to use (VP8 vs H.264), but there's also been no significant progress in getting to a consensus. Personally, I can't see how anyone could possibly consider H.264 as a viable option. Not only is it incompatible with open-source, but it's like signing a blank check, nobody knows how much MPEG-LA will decide to charge for it in the next years, especially for the encoder, which is currently not an issue for HTML5 (which only requires a decoder). The main argument I have heard against VP8 is "we don't know if there are patents". While this is true in some sense, the problem is much worse for H.264: not only are there tons of known patents for which we only know the licensing fees in the short term, but there's still at least as much risk when it comes to unlicensed patents (see the current Motorola v. Microsoft case).

August 05, 2012 07:24 PM

Jean-Marc Valin : Opus approved by the IETF


Three years after we first tried convincing the IETF to standardize an audio codec, Opus has finally been approved by the IETF. The only remaining step until it's officially an RFC is the RFC editor (fixing last minor issues, typos, ...). That should take in the order of 6-8 weeks (variable), at which point we'll have the RFC and the 1.0 release. Thanks to everyone who helped developing, testing, supporting or advocating Opus.

July 03, 2012 02:59 PM

Ben Schwartz : It’s Google


I’m normally reticent to talk about the future; most of my posts are in the past tense. But now the plane tickets are purchased, apartment booked, and my room is gradually emptying itself of my furniture and belongings. The point of no return is long past.

A few days after Independence Day, I’ll be flying to Mountain View for a week at the Googleplex, and from there to Seattle (or Kirkland), to start work as a software engineer on Google’s WebRTC team, within the larger Chromium development effort. The exact project I’ll be working on initially isn’t yet decided, but a few very exciting ideas have floated by since I was offered the position in March.

Last summer I told a friend that I had no idea where I would be in a year’s time, and when I listed places I might be — Boston, Madrid, San Francisco, Schenectady — Seattle wasn’t even on the list. It still wasn’t in March, when I was offered this position in the Cambridge (MA) office. It was an unfortunate coincidence that the team I’d planned to join was relocated to Seattle shortly after I’d accepted the offer.

My recruiters and managers were helpful and gracious in two key ways. First, they arranged for me to meet with ~5 different leaders in the Cambridge office whose teams I might be able to join instead of moving. Second, they flew me out to Seattle (I’d never been to the city, nor the state, nor any of the states or provinces that it borders) and arranged for meetings with various managers and developers in the Kirkland office, just so I could learn more about the office and the city. I spent the afternoon wandering the city and (with help from a friend of a friend), looking at as many districts as I could squeeze between lunch and sleep.

The visit made all the difference. It made the city real to me … and it seemed like a place that I could live. It also confirmed an impressive pattern: every single Google employee I met, at whichever office, seemed like someone I would be happy to work alongside.

When I returned there were yet more meetings scheduled, but I began to perceive that the move was essentially inevitable. The hiring committee had done their job well, and assigned me to the best fitting position. Everything else was second best at best.

It’s been an up and down experience, with the drudgery of packing and schlepping an unwelcome reminder of the feeling of loss that accompanies leaving history, family, and friends behind. I am learning in the process that, having never really moved, I have no idea how to move.

But there’s also sometimes a sense of joy in it. I am going to be an independent, free adult, in a way that cannot be achieved by even the happiest potted plant.

After signing the same lease on the same student apartment for the seventh time, I worried about getting stuck, in some metaphysical sense, about failure to launch from my too-comfortable cocoon. It was time for a grand adventure.

This is it.

by Ben at June 29, 2012 05:10 PM

David Schleef : GStreamer Streaming Server Library


Introducing the GStreamer Streaming Server Library, or GSS for short.

This post was originally intended to be a release announcement, but I started to wander off to work on other projects before the release was 100% complete.  So perhaps this is a pre-announcement.  Or it’s merely an informational piece with the bonus that the source code repository is in a pretty stable and bug-free state at the moment.  I tagged it with “gss-0.5.0″.

What it is

GSS is a standalone HTTP server implemented as a library.  Its special focus is to serve live video streams to thousands of clients, mainly for use inside an HTML5 video tag.  It’s based on GStreamer, libsoup, and json-glib, and uses Bootstrap and BrowserID in the user interface.

GSS comes with a streaming server application that is essentially a small wrapper around the library.  This application is referred to as the Entropy Wave Streaming Server (ew-stream-server); the code that is now GSS was originally split out of this application.  The app can be found in the tools/ directory in the source tree.

Features

  • Streaming formats: WebM, Ogg, MPEG-TS.  (FLV support is waiting for a flvparse element in GStreamer.)
  • Streams in different formats/sizes/bitrates are bundled into a single “program”.
  • Streaming to Flash via HTTP.
  • Authentication using BrowserID.
  • Automatic conversion from properly formed MPEG-TS to HTTP Live Streaming.
  • Automatic conversion to RTP/RTSP (Experiemental, works for Ogg/Theora/Vorbis only.)
  • Stream upload via HTTP PUT (3 different varieties), Icecast, raw TCP socket.
  • Stream pull from another HTTP streaming server.
  • Content protection via automatic one-time URLs.
  • (Experimental) Video-on-Demand stream types.
  • Per-stream, per-program, and server metrics.
  • HTTP configuration interface and REST API is used to control the server, allowing standalone operation and easy integration with other web servers.

What’s not there?

  • Other types of authentication, LDAP or other authorization.
  • RTMP support.  (Maybe some day, but there are several good open-source Flash servers out there already.)
  • Support for upload using HTTP PUT with no 100-Continue header.  Several HTTP libraries do this.
  • Decent VOD support, with rate-controlled streaming, burst start, and seeking.

The details

 

by David Schleef at June 14, 2012 06:21 PM

Silvia Pfeiffer : Video Conferencing in HTML5: WebRTC via Web Sockets


A bit over a week ago I gave a presentation at Web Directions Code 2012 in Melbourne. Maxine and John asked me to speak about something related to HTML5 video, so I went for the new shiny: WebRTC – real-time communication in the browser.

Presentation slides

I only had 20 min, so I had to make it tight. I wanted to show off video conferencing without special plugins in Google Chrome in just a few lines of code, as is the promise of WebRTC. To a large extent, I achieved this. But I made some interesting discoveries along the way. Demos are in the slide deck.

UPDATE: Opera 12 has been released with WebRTC support.

Housekeeping: if you want to replicate what I have done, you need to install a Google Chrome Web Browser 19+. Then make sure you go to chrome://flags and activate the MediaStream and PeerConnection experiment(s). Restart your browser and now you can experiment with this feature. Big warning up-front: it’s not production-ready, since there are still changes happening to the spec and there is no compatible implementation by another browser yet.

Here is a brief summary of the steps involved to set up video conferencing in your browser:

  1. Set up a video element each for the local and the remote video stream.
  2. Grab the local camera and stream it to the first video element.
  3. (*) Establish a connection to another person running the same Web page.
  4. Send the local camera stream on that peer connection.
  5. Accept the remote camera stream into the second video element.

Now, the most difficult part of all of this – believe it or not – is the signalling part that is required to build the peer connection (marked with (*)). Initially I wanted to run completely without a server and just enter the remote’s IP address to establish the connection. This is, however, not a functionality that the PeerConnection object provides [might this be something to add to the spec?].

So, you need a server known to both parties that can provide for the handshake to set up the connection. All the examples that I have seen, such as https://apprtc.appspot.com/, use a channel management server on Google’s appengine. I wanted it all working with HTML5 technology, so I decided to use a Web Socket server instead.

I implemented my Web Socket server using node.js (code of websocket server). The video conferencing demo is in the slide deck in an iframe – you can also use the stand-alone html page. Works like a treat.

While it is still using Google’s STUN server to get through NAT, the messaging for setting up the connection is running completely through the Web Socket server. The messages that get exchanged are plain SDP message packets with a session ID. There are OFFER, ANSWER, and OK packets exchanged for each streaming direction. You can see some of it in the below image:

WebRTC demo

I’m not running a public WebSocket server, so you won’t be able to see this part of the presentation working. But the local loopback video should work.

At the conference, it all went without a hitch (while the wireless played along). I believe you have to host the WebSocket server on the same machine as the Web page, otherwise it won’t work for security reasons.

A whole new world of opportunities lies out there when we get the ability to set up video conferencing on every Web page – scary and exciting at the same time!

by silvia at June 14, 2012 07:43 AM

Silvia Pfeiffer : HTML5 multi-track audio or video


In the last months, we’ve been working hard at the WHATWG and W3C to spec out new HTML markup and a JavaScript interface for dealing with audio or video content that has more than just one audio and video track.

This is particularly relevant when a Web page author wants to add a sign language track to a video or audio resource for deaf people, or an audio description track (i.e. a sound track in which a speaker explains the key things that can be seen on screen) for blind people. It is also relevant when a Web page author wants to publish a video with multiple audio tracks that are each a different language dub for the video and can be used for less common cases such as a director’s comment track, or making available different camera angles for an event.

Just to be clear: this is not a means to introduce video editing functionality into the Web browser. If you want to do edits, you’re better off with an application that will eventually render a new piece of content and includes fancy transitions etc. Similarly, this is not a means to introduce mixing functionality (as in what DJs do when they play with multiple audio recordings). You’re better off with an actual audio mixing or DJ application that will provide you all sorts of amazing effects and filters.

So, multi-track is squarely focused on synchronizing alternative or additional tracks to a single resource with a single timeline to which all tracks are slaved.

Two means of publishing such multi-track media content are possible:

  • In-band multi-track
  • Synchronized resources

1. In-band multi-track

In in-band multi-track, there is a single file that has all all the tracks inside it. For this single file, there is now an API in HTML5 that allows addressing and controlling these tracks.

Of the video file formats that Web browsers support, WebM is currently not defined to contain more than one audio or video track. However, since WebM is using the Matroska container format, which supports multi-track, it is possible to extend WebM for multi-track resources. I have seen multitrack Ogg, MP4 and Matroska files in the wild and most media players support their display.

The specification that has gone into HTML5 to support in-band multi-track looks as follows:

interface HTMLMediaElement : HTMLElement {
  [...]
  // tracks
  readonly attribute MultipleTrackList audioTracks;
  readonly attribute ExclusiveTrackList videoTracks;
};

interface TrackList {
  readonly attribute unsigned long length;
  DOMString getID(in unsigned long index);
  DOMString getKind(in unsigned long index);
  DOMString getLabel(in unsigned long index);
  DOMString getLanguage(in unsigned long index);

           attribute Function onchange;
};

interface MultipleTrackList : TrackList {
  boolean isEnabled(in unsigned long index);
  void enable(in unsigned long index);
  void disable(in unsigned long index);
};

interface ExclusiveTrackList : TrackList {
  readonly attribute unsigned long selectedIndex;
  void select(in unsigned long index);
};

You will notice that every audio and video track gets an index to address them. You can enable() and disable() individual audio tracks and you can select() a single video track for display. This means that one or more audio tracks can be active at the same time (e.g. main audio and audio description), but only one video track will be active at a time (e.g. main video or sign language).

Through the getID(), getKind(), getLabel() and getLanguage() functions you can find out more about what actual content is available in the individual tracks so as to activate/deactivate them correctly and display the right information about them.

getKind() identifies the type of content that the track exposes such as “description” (for audio description), “sign” (for sign language), “main” (for the default displayed track), “translation” (for a dubbed audio track), and “alternative” (for an alternative to the default track).

getLabel() provides a human readable string that describes the content of the track aiming to be used in a menu.

getID() provides a short machine-readable string that can be used to construct a media fragment URI for the track. The use case for this will be discussed later.

getLanguage() provides a machine-readable language code to identify which language is spoken or signed in an audio or sign language video track.

Example 1:

The following uses a video file that has a main video track, a main audio track in English and French, and an audio description track in English and French. (It likely also has caption tracks, but we will ignore text tracks for now.) This code sample switches the French audio tracks on and all other audio tracks off.

<video id="v1" poster=“video.png” controls>
 <source src=“video.ogv” type=”video/ogg”>
 <source src=“video.mp4” type=”video/mp4”>
</video>

<script type="text/javascript">
video = document.getElementsByTagName("video")[0];

for (i=0; i< video.audioTracks.length; i++) {
  if (video.audioTracks.getLanguage(i) == "fr") {
    video.audioTracks.enable(i);
  } else {
    video.audioTracks.disable(i);
  }
}
</script>

Example 2:

The following uses a audio file that has a main audio track in English, no main video track, but sign language video tracks in ASL (American Sign Language), BSL (British Sign Language), and ASF (Australian Sign Language). This code sample switches the Australian sign language track on and all other video tracks off.

<video id="a1" controls>
 <source src=“audio_sign.ogg” type=”video/ogg”>
 <source src=“audio_sign.mp4” type=”video/mp4”>
</video>

<script type="text/javascript">
video = document.getElementsByTagName("video")[0];

for (i=0; i< video.videoTracks.length; i++) {
  if (video.videoTracks.getLanguage(i) == "asf") {
    video.videoTracks.select(i);
    break;
  }
}
</script>

If you have more tracks in both examples that conflict with your intentions, you may need to further filter your activation / deactivation code using the getKind() function.

2. Synchronized resources

Sometimes the production process of media creates not a single resource with multiple contained tracks, but multiple resources that all share the same timeline. This is particularly useful for the Web, because it means the user can download only the required resources, typically saving a substantial amount of bandwidth.

For this situation, an attribute called @mediagroup can be added in markup to slave multiple media elements together. This is administrated in the JavaScript API through a MediaController object, which provides events and attributes for the combined multi-track object.

The new IDL interfaces for HTMLMediaElement are as follows:

interface HTMLMediaElement : HTMLElement {
  [...]
  // media controller
           attribute DOMString mediaGroup;
           attribute MediaController controller;
};

interface MediaController {
  readonly attribute TimeRanges buffered;
  readonly attribute TimeRanges seekable;
  readonly attribute double duration;
           attribute double currentTime;

  readonly attribute boolean paused;
  readonly attribute TimeRanges played;
  void play();
  void pause();

           attribute double defaultPlaybackRate;
           attribute double playbackRate;

           attribute double volume;
           attribute boolean muted;

           attribute Function onemptied;
           attribute Function onloadedmetadata;
           attribute Function onloadeddata;
           attribute Function oncanplay;
           attribute Function oncanplaythrough;
           attribute Function onplaying;
           attribute Function onwaiting;
           attribute Function ondurationchange;
           attribute Function ontimeupdate;
           attribute Function onplay;
           attribute Function onpause;
           attribute Function onratechange;
           attribute Function onvolumechange;
};

You will notice that the MediaController replicates some of the states and events of the slave media elements. In general the approach is that the attributes represent the summary state from all the elements and the writable attributes when set are handed through to all the slave elements.

Importantly, if the individual media elements have @controls activated, then the displayed controls interact with the MediaController thus allowing synchronized playback and interaction with the combined multi-track object.

Example 3:

The following uses a video file that has a main video track, a main audio track in English. There is another video file with the ASL sign language for the video, and an audio file with the audio description in English. This code sample creates controls on the first file, which then also control the audio description and the sign language video, neither of which have controls. Since the audio description doesn’t have controls, it doesn’t get visually displayed. The sign language video will just sit next to the main video without controls.

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid">
 <source src=“video.webm” type=”video/webm”>
 <source src=“video.mp4” type=”video/mp4”>
</video>

<video id="v2" poster=“sign.png” mediagroup="a11y_vid">
 <source src=“sign.webm” type=”video/webm”>
 <source src=“sign.mp4” type=”video/mp4”>
</video>

<audio id="a1" mediagroup="a11y_vid">
 <source src=“audio.ogg” type=”audio/ogg”>
 <source src=“audio.mp3” type=”audio/mp3”>
</audio>

Example 4:

We now accompany a main video with three sign language video tracks in ASL, BSL and ASF. We could just do this in JavaScript and replace the currentSrc of a second video element with the links to BSL and ASF as required, but then we need to run our own media controls to list the available tracks. So, instead, we create a video element for each one of the tracks and use CSS to remove the inactive ones from the page layout. The code sample activates the ASF track and deactivates the other sign language tracks.

<style>
  video.inactive { display: none; }
</style>

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid">
 <source src=“video.webm” type=”video/webm”>
 <source src=“video.mp4” type=”video/mp4”>
</video>

<video id="v2" poster=“sign_asl.png” mediagroup="a11y_vid" class="active">
 <source src=“sign_asl.webm” type=”video/webm”>
 <source src=“sign_asl.mp4” type=”video/mp4”>
</video>

<video id="v3" poster=“sign_bsl.png” mediagroup="a11y_vid" class="inactive">
 <source src=“sign_bsl.webm” type=”video/webm”>
 <source src=“sign_bsl.mp4” type=”video/mp4”>
</video>

<video id="v4" poster=“sign_asf.png” mediagroup="a11y_vid" class="inactive">
 <source src=“sign_asf.webm” type=”video/webm”>
 <source src=“sign_asf.mp4” type=”video/mp4”>
</video>

<script type="text/javascript">
videos = document.getElementsByTagName("video");

for (i=0; i< videos.length; i++) {
  if (video[i].videoTracks.getLanguage(0) == "asf") {
    video[i].setAttribute("class", "active");
  } else {
    video[i].setAttribute("class", "inactive");
  }
}
</script>

Example 5:

In this final example we look at what to do when we have a in-band multi-track resource with multiple video tracks that should all be displayed on screen. This is not a simple problem to solve because a video element is only allowed to display a single video track at a time. Therefore for this problem you need to use both approaches: in-band and synchronized resources.

We take a in-band multitrack resource with a main video and audio track and three sign language tracks in ASL, BSL and ASF. The second resource will be made up from the URI of the first resource with a media fragment address of the sign language tracks. (If required, these can be discovered using the getID() function on the first resource.) The markup will look as follows:

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid">
 <source src=“video.ogv#track=v_main&track=a_main” type=”video/ogv”>
 <source src=“video.mp4#track=v_main&track=a_main” type=”video/mp4”>
</video>

<video id="v2" poster=“sign.png” controls mediagroup="a11y_vid">
 <source src=“video.ogv#track=asl&track=bsl&track=asf” type=”video/ogv”>
 <source src=“video.mp4#track=asl&track=bsl&track=asf” type=”video/mp4”>
</video>

Note that with multiple video elements you can always style them in the way that you want them displayed on screen. E.g. if you want a picture-in-picture display, you scale the second video down and absolutely position it on top of the first one in the appropriate location. You can even grab the second video into a canvas, chroma-key your sign language speaker on a green or blue screen and remove that background through some canvas processing before popping it on top of the video.

The world is all yours!

HOWEVER: There is one big caveat on all these specs – while they have all found entry into the HTML5 specification, it would be expecting a bit much to have browser support already. :-)

by silvia at June 02, 2012 09:36 AM

David Schleef : GStreamer backend for video in Firefox


Good news to hear that the GStreamer backend for video playback in Firefox has landed, due to a flurry of work by Alessandro Decina in the last few months.  Of course, this isn’t part of the standard Firefox build (but maybe some day?), but it’s very useful for putting Firefox on mobile and embedded platforms, since GStreamer has a well-established ecosystem of vendor-provided plugins for hardware decoding.

by David Schleef at April 29, 2012 10:19 PM

David Schleef : OggStreamer: audio capture and streaming device


Recently learned about a cool new open hardware project called OggStreamer.  They’re designing and making a small device that records an analog audio signal and streams it using Ogg/Vorbis.  It’s an open hardware project, so all the schematics and PCB layout is provided.

by David Schleef at April 06, 2012 12:30 AM

David Schleef : Update on the GStreamer DeckLink Elements


A little more than a year ago, I posted about GStreamer support for SDI and HD-SDI using DeckLink hardware from BlackMagic Design.  In the meantime, the decklinksrc and decklinksink elements have grown up a bit, and work with most devices in the DeckLink and Intensity line of hardware.  A laundry list of features:

  • Multiple device support
  • Multiple input and output support on a single device
  • HDMI, component analog, and composite input and output with Intensity Pro
  • Analog, AES/EBU, and embedded (HDMI/SDI) audio input
  • SDI, HD-SDI, and Optical SDI input and output with DeckLink
  • Works on Linux, OS/X (new), and Windows
  • 8-bit and 10-bit support for SDI/HD-SDI
  • Supports most video modes in the DeckLink SDK
  • Implements GstPropertyProbe interface for proper detection as a source element
  • Lots of bug fixes from previous releases

Kudos to Blake Tregre and Joshua Doe for submitting several of the patches implementing the above list.  There still a bunch of outstanding bug reports (some with patches) that need to be fixed.  Several of these relate to output, which is currently rather clumsy and broken.

People have asked me about automatically detecting the video mode for input.  Some DeckLink hardware has this capability, but not any of the hardware I have to test with.  However, I’ve had some success with cycling through the video modes at the application level, with a 200 ms timeout between modes, stopping when it finds a mode than generates output.  This works ok, except that it tends to confuse 60i and 30p modes (and 50i with 25p), which can be differentiated with a bit of processing on the images.  At some point I’d like to integrate this functionality into decklinksrc, but wouldn’t be upset if someone else did it first.

by David Schleef at April 03, 2012 04:56 AM

David Schleef : HDTV Color Matrix


Digital video is a time series of pictures, and each picture is comprised of an array of pixels, and each pixel is comprised of three numbers representing how brightly the red, green, and blue LCD dots (or CRT phosphors, if you’re old school) glow.  The representation in memory, however, is not of RGB values, but of YCbCr values, which one calculates by multiplying a 3×3 matrix with the RGB values, and then adding/subtracting some offsets.  This converts the components into a gray value (Y, or luma) and Cb and Cr (chroma blue and chroma red). The reason for doing this is because the human visual system is more sensitive to variations in luma compared to variations in chroma (er, actually luminance and chrominance, see below).  Furthermore, for this reason, typically half or 3/4 of the chroma values are dropped and not stored — the missing ones are interpolated when converting back to RGB for display.

There are various theoretical reasons for choosing a particular matrix, and I’ve recently become interested if these reasons are actually valid.  For historical reasons, early digital video copied analog precedent and used a matrix that is theoretically suboptimal.  This matrix is used in standard definition (SD) video, but was changed to the theoretically correct matrix for high-definition (HD) video.  There are other technical differences between SD and HD video, but this is the most significant for color accuracy.

For some time, I’ve been curious how much of a visual difference there is between the two matrices.  Here are two stills from Big Buck Bunny, the first is the original, correct image, and the second is the same picture converted to YCbCr with the HDTV matrix and then back to RGB with the SDTV matrix.  (To best see the differences, open the images in separate browser tabs and flip between them.)

Big Buck Bunny frame 660, originalBig Buck Bunny frame 660, wrong matrixIf you are like me, you probably have trouble seeing the difference side by side, but flipping between them makes it fairly obvious.  I chose this image because it has relatively saturated green and greenish-yellow, which shows off some of the largest differences.

The RGB values for the pixels that are used in computation are not proportional to the actual amount of power output by a monitor.  This is known as gamma correction, and is a clever byproduct of the fact that the response curve of television phosphors (the amount of light output for a given voltage) is approximately similar to the response curve of the eye (the perceived brightness based on the amount of light).  Thus voltage became synonymous with perceived brightness, televisions had fewer vacuum tubes, and we’re left with that legacy.  But it’s not a bad legacy, because just like dropping chroma values, it makes it easier to compress images.

However, color comes along and messes with that simplicity a bit.  Luminance in color theory is used to describe how the brain interprets the brightness of a particular pixel, which is proportional to the RGB values in linear light space, i.e., the amount of light emanating from a display.  Luma is proportional to the RGB values in gamma-corrected (actually, gamma-compressed) space.  This means that luma doesn’t simply depend on luminance, and contains some variation due to color.  This messes with our idea that matrixing RGB values will separate variations in brightness from variations in color.  How visible is it?  I took the above picture and squashed the luma to one value, leaving chroma values the same (HD matrix):

Big Buck Bunny, frame 660, luma squashed

What you see here is that saturated areas appear brighter than the grey areas.  This is chroma (i.e., the color values we use in calculations) feeding into luminance (i.e., the perception of brightness).

How much does this matter for image and video compression efficiency?  It’s a minor inefficiency of a subtle visual difference.  In other words, not very much.

Earlier I mentioned that the HD matrix was theoretically more correct than the SD matrix.  What about in practice?  Here’s the same luma-squashed image with the SD matrix.  Notice that there’s a lot more leakage from chroma into luminance, especially in the green leaves:

Big Buck Bunny, frame 660, chroma squashed with SD matrix

by David Schleef at March 24, 2012 03:07 AM

Ben Schwartz : Ethics in an unethical world: Ethics Offsets


The recent hubbub regarding the (admirably public) debate within Mozilla about codec support has set me thinking about how to deal with untenable situations. After rightly railing against H.264 on the web for several years, and pushing free codecs with the full thrust of the organization, Mozilla may now be approaching consensus that they cannot win, and that continued refusal to capitulate to the cartel is tantamount to organizational suicide.

So what can you do, when you find yourself compelled to do something that goes against your ethics? To make a choice that you feel is wrong on its own because it benefits you in other ways, a choice you would like to make only when really necessary and never otherwise? Any thinking person will have this problem, to greater and lesser degrees, throughout their lives. We are not martyrs, so we do what we have to do to survive and try to keep in mind our need to escape from the trap.

Organizations cannot simply keep something in mind, but they can adopt structures that remind their members of their values even when those values are compromised. A common structure of this type is the sin tax, a tax designed (in a democracy) by members of a state to help them break or prevent their own bad habits. Sin taxes work by countering the locally perceived benefit of some action that’s harmful in a larger way, by reminding us of less visible but still important negative considerations. Some of their effect is straightforwardly economic, but some is psychological, to help us remember the bigger picture.

Sin taxes are more or less involuntary, but when the government does not impose these reminders, we often choose to remind ourselves. One currently popular implementation of this concept is the Carbon offset, a payment typically made when burning fuel to counter the effect of global warming. Organizations that buy carbon offsets for their fuel consumption do so to send a message, both internally and externally, that they place real value on minimizing carbon emissions. They may send this message both explicitly (by publicizing the purchase) and implicitly (by its effect on internal and external economic incentives).

Carbon offsets may be in fashion this decade, but there are many older forms of this concept. Maybe the most quotidian is the Curse Jar*, traditionally a place in a home or small office where individuals may make a small payment when using discouraged vocabulary. The Curse Jar provides a disincentive to coarse language despite being strictly voluntary, and despite not purchasing any effect on the linguistic environment (although the coffee fund may help for some). The Curse Jar works simply by reminding group members which behaviors are accepted and which are not.

For Mozilla, the difficulty is not emissions, verbal or vaporous, but ethical behavior. How can Mozilla publicly commit to a standard of behavior while violating it? I humbly submit that the answer is to balance its karmic books, by introducing an Ethics Offset**. When Mozilla finds itself cornered, it may take the necessary unfortunate action … and introduce a proportionate positive action as a reminder about its real values.

In the case at hand, a reasonable Ethics Offset might look like an internal “tax” on all uses of patented codecs. For example, for every Boot2Gecko device that is sold, Mozilla could commit to an offset equal to double the amount spent on patent licenses for the device. The offset could be donated to relevant worthy causes, like organizations that oppose software patents or contribute to the development of patent-free multimedia … but the actual recipient matters much less than the commitment. By accumulating and periodically (and publicly) “losing” this money, Mozilla would remind us all about its commitment to freedom in the multimedia realm. A similar scheme may be appropriate for Firefox Mobile if it is also configured for H.264 support.

Without a reminder of this kind, Mozilla risks becoming dangerously complacent and complicit to the cartel-controlled multimedia monopolies. As long as H.264 support appears to serve Mozilla’s other goals, Mozilla’s commitment to multimedia freedom will remain uncomfortable, inconvenient, and tempting to forget. Greater organizations have slid down off their ethical peaks, on paths paved all along with good intentions.

Most companies would not even consider a public and persistent admission of compromise, but Mozilla is not most companies. Neither are the companies that produce free operating systems, and many other components of the free software ecosystem. None of them should be ashamed to admit when they are forced to compromise their values and support enterprises that, on ethical grounds, they despise … but they should make their position clear, by committing to an Ethics Offset until they can escape from the compromise entirely.

*: Why is there no Wikipedia entry for “Curse Jar”!?
**: Let’s not call it an indulgence.

by Ben at March 14, 2012 04:43 AM

Monty : Why 24-bit/192kHz music downloads make no sense


(by Monty and the Xiph.Org community)

Articles last month revealed that musician Neil Young and Apple's Steve Jobs discussed offering digital music downloads of 'uncompromised studio quality'. Much of the press and user commentary was particularly enthusiastic about the prospect of uncompressed 24 bit 192kHz downloads. 24/192 featured prominently in my own conversations with Mr. Young's group several months ago.

Unfortunately, there is no point to distributing music in 24-bit/192kHz format. Its playback fidelity is slightly inferior to 16/44.1 or 16/48, and it takes up 6 times the space.

If you just said 'Whaa?', you may want to read the whole article.

It's fairly long... but hearing, perception and fidelity are complicated topics. Shysters and charlatans exploit that nuance (and misunderstanding) to bilk unsuspecting consumers of their money, all the while convincing them they're paying for 'quality'.

Anyway, happy reading and comments welcome!

by Monty (monty@xiph.org) at March 05, 2012 06:29 PM

Jean-Marc Valin : A Pitch-Energy Quantizer for Codec2


During LCA 2012, I got to meet face-to-face (for only the second time) with David Rowe and discuss Codec2. This led to a hacking session where we figured out how to save about 10 bits on LSP quantization by using vector quantization (VQ). This may not sound like a lot, but for a 2 kb/s codec, 10 bits every 20 ms is 500 b/s, so one quarter of the bit-rate. That new code is now in David's hands and he's been doing a good job of tweaking it to get optimal quality/bitrate. This led me to look at the rest of the bits, which are taken mostly by the pitch frequency (between 50 Hz and 400 Hz) and the excitation energy (between -10 dB and 40 dB). The pitch is currently coded linearly (constant spacing in Hz) with 7 bits, while the energy is coded linearly in dB using 5 bits. That's a total of 12 bits for pitch and energy. Now, how can we improve that?

The first assumption I make here is that David already checked that both gain and energy are encoded at the "optimal" resolution that balances bitrate and coding artefacts. To reduce the rate, we need a smarter quantizer. Below is the distribution of the pitch and energy for my training database.



So what if we were to use vector quantization to reduce the bit-rate. In theory, we could reduce the rate (for equal error) by having more codevectors in areas where the figure above shows more data. Same error, lower rate, but still a bad idea. It would be bad because it would mean that for some people, whose pitch falls into the range that is less likely, codec2 wouldn't work well. It would also mean that just changing the audio gain could make codec2 do worse. That is clearly not acceptable. We need to not just care about the mean square error (MSE), but also about the outliers. We need to be able to encode any amplitude with increments of 1-2 dB and any pitch with an increment around 0.04-0.08 (between half a semitone and a semitone). So it looks like we're stuck and the best we could do is to have uniform VQ, which wouldn't save much compared to scalar quantization.

The key here is to relax our resolution constraint above. In practice, we only need such good resolution when the signal is stationnary. For example, when the pitch in unvoiced frames jumps around randomly, it's not really important to encode it accurately. Similarly, energy error are much more perceivable when the energy is stable than when it's fluctuating. So this is where prediction becomes very useful, because stationary signals are exactly the ones that are easily predicted. By using a simple first-order recursive predictor (prediction = alpha*previous_value), we can reduce the range for which we need good resolution by a factor (1-alpha). For example, if we have a signal that ranges from 0 to 100 and we want a resolution of 1, then using alpha=0.1, the prediction error (current_value-prediction) will have a range of 0 to 10 when the signal is stationary. We still need to have quantizer values outside that range to encode variations, but we don't need a good resolution.

Now that we have reduced the domain for which we need good resolution, we can actually start using vector quantization too. By combining prediction and vector quantization, it's possible to have a good enough quantizer using only 8 bits for both the energy and the pitch, saving 4 bits, so 200 b/s. The figure below illustrates how the quantizer is trained, with the distribution of the prediction residual (actual value minus prediction) in blue, and the distribution of the code vectors in red. The prediction coefficients are 0.8 for pitch and 0.9 for energy.



First thing we notice from the residual distribution is that it's much less uniform and there's two higher-density areas that stand out. The first is around (0.3,0), which corresponds to the case where the pitch and energy are stationary and is about one fifth of the range for pitch (which has a prediction coefficient of 4/5) and one tenth of the range for energy (which has a prediction coefficient of 9/10). The second higher-density area is a line around residual energy of -2.5, and it corresponds to silence. Now looking at the codebook in red, we can see a very high density of vectors in the area of stationary speech, enough for a resolution of 1-2 dB energy and 1/2 to 1 semitone for pitch. The difference is that this time the high resolution is only needed for much smaller range. Now, the reason we see such a high density of code vectors around stationary speech and not so much around the "silence line" is that the last detail of this quantizer: weighting. The whole codebook training procedure uses weighting based on how important the quantization error is. The weight given to pitch and energy error on stationary voiced speech is much higher than it is for non-stationary speech or silence. This is why this quantizer is able to give good enough quality with 8 bits instead of 12.

March 05, 2012 02:43 PM

Silvia Pfeiffer : A systematic approach to making Web Applications accessible


With the latest developments in HTML5 and the still fairly new ARIA (Accessible Rich Interface Applications) attributes introduced by the W3C WAI (Web Accessibility Initiative), browsers have now implemented many features that allow you to make your JavaScript-heavy Web applications accessible.

Since I began working on making a complex web application accessible just over a year ago, I discovered that there was no step-by-step guide to approaching the changes necessary for creating an accessible Web application. Therefore, many people believe that it is still hard, if not impossible, to make Web applications accessible. In fact, it can be approached systematically, as this article will describe.

This post is based on a talk that Alice Boxhall and I gave at the recent Linux.conf.au titled “Developing accessible Web apps – how hard can it be?” (slides, video), which in turn was based on a Google Developer Day talk by Rachel Shearer (slides).

These talks, and this article, introduce a process that you can follow to make your Web applications accessible: each step will take you closer to having an application that can be accessed using a keyboard alone, and by users of screenreaders and other accessibility technology (AT).

The recommendations here only roughly conform to the requirements of WCAG (Web Content Accessibility Guidelines), which is the basis of legal accessibility requirements in many jurisdictions. The steps in this article may or may not be sufficient to meet a legal requirement. It is focused on the practical outcome of ensuring users with disabilities can use your Web application.

Step-by-step Approach

The steps to follow to make your Web apps accessible are as follows:

  1. Use native HTML tags wherever possible
  2. Make interactive elements keyboard accessible
  3. Provide extra markup for AT (accessibility technology)

If you are a total newcomer to accessibility, I highly recommend installing a screenreader and just trying to read/navigate some Web pages. On Windows you can install the free NVDA screenreader, on Mac you can activate the pre-installed VoiceOver screenreader, on Linux you can use Orca, and if you just want a browser plugin for Chrome try installing ChromeVox.

1. Use native HTML tags

As you implement your Web application with interactive controls, try to use as many native HTML tags as possible.

HTML5 provides a rich set of elements which can be used to both add functionality and provide semantic context to your page. HTML4 already included many useful interactive controls, like <a>, <button>, <input> and <select>, and semantic landmark elements like <h1>. HTML5 adds richer <input> controls, and a more sophisticated set of semantic markup elements like such as <time>, <progress>, <meter>, <nav>, <header>, <article> and <aside>. (Note: check browser support for browser support of the new tags).

Using as much of the rich HTML5 markup as possible means that you get all of the accessibility features which have been implemented in the browser for those elements, such as keyboard support, short-cut keys and accessibility metadata, for free. For generic tags you have to implement them completely from scratch.

What exactly do you miss out on when you use a generic tag such as <div> over a specific semantic one such as <button>?

  1. Generic tags are not focusable. That means you cannot reach them through using the [tab] on the keyboard.
  2. You cannot activate them with the space bar or enter key or perform any other keyboard interaction that would be regarded as typical with such a control.
  3. Since the role that the control represents is not specified in code but is only exposed through your custom visual styling, screenreaders cannot express to their users what type of control it is, e.g. button or link.
  4. Neither can screenreaders add the control to the list of controls on the page that are of a certain type, e.g. to navigate to all headers of a certain level on the page.
  5. And finally you need to manually style the element in order for it to look distinctive compared to other elements on the page; using a default control will allow the browser to provide the default style for the platform, which you can still override using CSS if you want.

Example:

Compare these two buttons. The first one is implemented using a <div> tag, the second one using a <button> tag. Try using a screenreader to experience the difference.

<style> .custombutton { cursor: pointer; border: 1px solid #000; background-color: #F6F6F6; padding: 2px 5px; } </style>
Send
<style>
 .custombutton {
  cursor: pointer;
  border: 1px solid #000;
  background-color: #F6F6F6;
  display: inline-block;
  padding: 2px 5px;
}
</style>
<div class="custombutton" onclick="alert('sent!')">
  Send
</div>
<button onclick="alert('sent!')">
Send
</button>

2. Make interactive elements keyboard accessible

Many sophisticated web applications have some interactive controls that just have no appropriate HTML tag equivalent. In this case, you will have had to build an interactive element with JavaScript and <div> and/or <span> tags and lots of custom styling. The good news is, it’s possible to make even these custom controls accessible, and as a side benefit you will also make your application smoother to use for power users.

The first thing you can do to test usability of your control, or your Web app, is to unplug the mouse and try to use only the [TAB] and [ENTER] keys to interact with your application.

the tab key on the keyboardthe enter key on the keyboard

Try the following:

  • Can you reach all interactive elements with [TAB]?
  • Can you activate interactive elements with [ENTER] (or [SPACE])?
  • Are the elements in the right tab order?
  • After interaction: is the right element in focus?
  • Is there a keyboard shortcut that activates the element (accesskey)?

No? Let’s fix it.

2.1. Reaching interactive elements

If you have an element on your page that cannot be reached with [TAB], put a @tabindex attribute on it.

Example:

Here we have a <span> tag that works as a link (don’t do this – it’s just a simple example). The first one cannot be reached using [TAB] but the second one has a tabindex and is thus part of the tab order of the HTML page.

(Note: since we experiment lots with the tabindex in this article, to avoid confusion, click on some text in this paragraph and then hit the [TAB] key to see where it goes next. The click will set your keyboard focus in the DOM.)

<style> .customlink { text-decoration: underline; cursor: pointer; } </style>

Click

<style>
.customlink {
  text-decoration: underline;
  cursor: pointer;
}
</style>
<span class="customlink" onclick="alert('activated!')">
Click
</span>
Click
<style>
.customlink {
  text-decoration: underline;
  cursor: pointer;
}
</style>
<span class="customlink" onclick="alert('activated!')" tabindex="0">
Click
</span>

You set @tabindex=0 to add an element into the native tab order of the page, which is the DOM order.

2.2. Activating interactive elements

Next, you typically want to be able to use the [ENTER] and [SPACE] keys to activate your custom control. To do so, you will need to implement an onkeydown event handler. Note that the keyCode for [ENTER] is 13 and for [SPACE] is 32.

Example:

Let’s add this functionality to the <span> tag from before. Try tabbing to it and hit the [ENTER] or [SPACE] key.

Click
<span class="customlink" onclick="alert('activated!')" tabindex="0">
Click
</span>
<script> function handlekey(event) { var target = event.target || event.srcElement; if (event.keyCode == 13 || event.keyCode == 32) { target.onclick(); } } </script>
Click
<span class="customlink" onclick="alert('activated!')" tabindex="0"
      onkeydown="handlekey(event);">
Click
</span>
<script>
function handlekey(event) {
  var target = event.target || event.srcElement;
  if (event.keyCode == 13 || event.keyCode == 32) {
    target.onclick();
  }
}
</script>

Note that there are some controls that might need support for keys other than [tab] or [enter] to be able to use them from the keyboard alone, for example a custom list box, menu or slider should respond to arrow keys.

2.3. Elements in the right tab order

Have you tried tabbing to all the elements on your page that you care about? If so, check if the order of tab stops seems right. The default order is given by the order in which interactive elements appear in the DOM. For example, if your page’s code has a right column that is coded before the main article, then the links in the right column will receive tab focus first before the links in the main article.

You could change this by re-ordering your DOM, but oftentimes this is not possible. So, instead give the elements that should be the first ones to receive tab focus a positive @tabindex. The tab access will start at the smallest non-zero @tabindex value. If multiple elements share the same @tabindex value, these controls receive tab focus in DOM order. After that, interactive elements and those with @tabindex=0 will receive tab focus in DOM order.

Example:

The one thing that always annoys me the most is if the tab order in forms that I am supposed to fill in is illogical. Here is an example where the first and last name are separated by the address because they are in a table. We could fix it by moving to a <div> based layout, but let’s use @tabindex to demonstrate the change.

<style> .customtabs input { width: 50px; } </style>
Firstname:
Address:
Lastname:
City:
<table class="customtabs">
  <tr>
    <td>Firstname:
      <input type="text" id="firstname">
    </td>
    <td>Address:
      <input type="text" id="address">
    </td>
  </tr>
  <tr>
    <td>Lastname:
      <input type="text" id="lastname">
    </td>
    <td>City:
      <input type="text" id="city">
    </td>
  </tr>
</table>
Click here to test this form,
then [TAB]:

Firstname:
Address:
Lastname:
City:
<table class="customtabs">
  <tr>
    <td>Firstname:
      <input type="text" id="firstname" tabindex="10">
    </td>
    <td>Address:
      <input type="text" id="address" tabindex="30">
    </td>
  </tr>
  <tr>
    <td>Lastname:
      <input type="text" id="lastname" tabindex="20">
    </td>
    <td>City:
      <input type="text" id="city" tabindex="40">
    </td>
  </tr>
</table>

Be very careful with using non-zero tabindex values. Since they change the tab order on the page, you may get side effects that you might not have intended, such as having to give other elements on the page a non-zero tabindex value to avoid skipping too many other elements as I would need to do here.

2.4. Focus on the right element

Some of the controls that you create may be rather complex and open elements on the page that were previously hidden. This is particularly the case for drop-downs, pop-ups, and menus in general. Oftentimes the hidden element is not defined in the DOM right after the interactive control, such that a [TAB] will not put your keyboard focus on the next element that you are interacting with.

The solution is to manage your keyboard focus from JavaScript using the .focus() method.

Example:

Here is a menu that is declared ahead of the menu button. If you tab onto the button and hit enter, the menu is revealed. But your tab focus is still on the menu button, so your next [TAB] will take you somewhere else. We fix it by setting the focus on the first menu item after opening the menu.

<style> #custommenu { background-color:#777; padding: 3px; border:1px solid #666; } .squarebuttons button { border: 1px solid black; } </style>

<script> function displayMenu(value) { document.getElementById("custommenu").style.display=value; } </script>

<div id="custommenu" style="display:none;">
  <button id="item1" onclick="displayMenu('none');">Menu item1</button>
  <button id="item2" onclick="displayMenu('none');">Menu item2</button>
</div>
<button onclick="displayMenu('block');">Menu</button>
<script>
function displayMenu(value) {
 document.getElementById("custommenu").style.display=value;
}
</script>
<style> #custommenu2 { background-color:#777; padding: 3px; border:1px solid #666; } </style>

<script> function displayMenu2(value) { document.getElementById("custommenu2").style.display=value; document.getElementById("item1").focus(); } </script>

<div id="custommenu" style="display:none;">
  <button id="item1" onclick="displayMenu('none');">Menu item1</button>
  <button id="item2" onclick="displayMenu('none');">Menu item2</button>
</div>
<button onclick="displayMenu('block');">Menu</button>
<script>
function displayMenu(value) {
 document.getElementById("custommenu").style.display=value;
 document.getElementById("item1").focus();
}
</script>

You will notice that there are still some things you can improve on here. For example, after you close the menu again with one of the menu items, the focus does not move back onto the menu button.

Also, after opening the menu, you may prefer not to move the focus onto the first menu item but rather just onto the menu <div>. You can do so by giving that div a @tabindex and then calling .focus() on it. If you do not want to make the div part of the normal tabbing order, just give it a @tabindex=-1 value. This will allow your div to receive focus from script, but be exempt from accidental tabbing onto (though usually you just want to use @tabindex=0).

Bonus: If you want to help keyboard users even more, you can also put outlines on the element that is currently in focus using CSS”s outline property. If you want to avoid the outlines for mouse users, you can dynamically add a class that removes the outline in mouseover events but leaves it for :focus.

2.5. Provide sensible keyboard shortcuts

At this stage your application is actually keyboard accessible. Congratulations!

However, it’s still not very efficient: like power-users, screenreader users love keyboard shortcuts: can you imagine if you were forced to tab through an entire page, or navigate back to a menu tree at the top of the page, to reach each control you were interested in? And, obviously, anything which makes navigating the app via the keyboard more efficient for screenreader users will benefit all power users as well, like the ubiquitous keyboard shortcuts for cut, copy and paste.

HTML4 introduced so-called accesskeys for this. In HTML5 @accesskey is now allowed on all elements.

The @accesskey attribute takes the value of a keyboard key (e.g. @accesskey="x") and is activated through platform- and browser-specific activation keys. For example, on the Mac it’s generally the [Ctrl] key, in IE it’ the [Alt] key, in Firefox on Windows [Shift]-[Alt], and in Opera on Windows [Shift]-[ESC]. You press the activation key and the accesskey together which either activates or focuses the element with the @accesskey attribute.

Example:


<script> var button = document.getElementById('accessbutton'); if (button.accessKeyLabel) { button.innerHTML += ' (' + button.accessKeyLabel + ')'; } </script>
<button id="accessbutton" onclick="alert('sent!')" accesskey="e">
Send
</button>
<script>
  var button = document.getElementById('accessbutton');
  if (button.accessKeyLabel) {
    button.innerHTML += ' (' + button.accessKeyLabel + ')';
  }
</script>

Now, the idea behind this is clever, but the execution is pretty poor. Firstly, the different activation keys between different platforms and browsers make it really hard for people to get used to the accesskeys. Secondly, the key combinations can conflict with browser and screenreader shortcut keys, the first of which will render browser shortcuts unusable and the second will effectively remove the accesskeys.

In the end it is up to the Web application developer whether to use the accesskey attribute or whether to implement explicit shortcut keys for the application through key event handlers on the window object. In either case, make sure to provide a help list for your shortcut keys.

Also note that a page with a really good hierarchical heading layout and use of ARIA landmarks can help to eliminate the need for accesskeys to jump around the page, since there are typically default navigations available in screen readers to jump directly to headings, hyperlinks, and ARIA landmarks.

3. Provide markup for AT

Having made the application keyboard accessible also has advantages for screenreaders, since they can now reach the controls individually and activate them. So, next we will use a screenreader and close our eyes to find out where we only provide visual cues to understand the necessary interaction.

Here are some of the issues to consider:

  • Role may need to get identified
  • States may need to be kept track of
  • Properties may need to be made explicit
  • Labels may need to be provided for elements

This is where the W3C’s ARIA (Accessible Rich Internet Applications) standard comes in. ARIA attributes provide semantic information to screen readers and other AT that is otherwise conveyed only visually.

Note that using ARIA does not automatically implement the standard widget behavior – you’ll still need to add focus management, keyboard navigation, and change aria attribute values in script.

3.1. ARIA roles

After implementing a custom interactive widget, you need to add a @role attribute to indicate what type of controls it is, e.g. that it is playing the role of a standard tag such as a button.

Example:

This menu button is implemented as a <div>, but with a role of “button” it is announced as a button by a screenreader.

Menu
<div tabindex="0" role="button">Menu</div>

ARIA roles also describe composite controls that do not have a native HTML equivalent.

Example:

This menu with menu items is implemented as a set of <div> tags, but with a role of “menu” and “menuitem” items.

Cut
Copy
Paste

<div role="menu">
  <div tabindex="0" role="menuitem">Cut</div>
  <div tabindex="0" role="menuitem">Copy</div>
  <div tabindex="0" role="menuitem">Paste</div>
</div>

3.2. ARIA states

Some interactive controls represent different states, e.g. a checkbox can be checked or unchecked, or a menu can be expanded or collapsed.

Example:

The following menu has states on the menu items, which are here not just used to give an aural indication through the screenreader, but also a visual one through CSS.

<style> .custombutton:before { content: ""; } .custombutton[aria-checked=true]:before { content: "\2713 "; } </style>
Left
Center
Right

<style>
.custombutton[aria-checked=true]:before {
   content:  "\2713 ";
}
</style>
<div role="menu">
  <div tabindex="0" role="menuitem" aria-checked="true">Left</div>
  <div tabindex="0" role="menuitem" aria-checked="false">Center</div>
  <div tabindex="0" role="menuitem" aria-checked="false">Right</div>
</div>

3.3. ARIA properties

Some of the functionality of interactive controls cannot be captured by the role attribute alone. We have ARIA properties to add features that the screenreader needs to announce, such as aria-label, aria-haspopup, aria-activedescendant, or aria-live.

Example:

The following drop-down menu uses aria-haspopup to tell the screenreader that there is a popup hidden behind the menu button together with an ARIA state of aria-expanded to track whether it’s open or closed.

<style> .menu { border: 1px solid black; } .menuitem:hover { background: grey; } .menuitem[aria-checked=true]:before { content: "\2713 "; } </style>
Justify

<script> var button = document.getElementById("button"); var menu = document.getElementById("menu"); var items = document.getElementsByClassName("menuitem"); var focused = 0; function showMenu(evt) { evt.stopPropagation(); menu.style.visibility = 'visible'; button.setAttribute('aria-expanded','true'); focused = getSelected(); items[focused].focus(); } function hideMenu(evt) { evt.stopPropagation(); menu.style.visibility = 'hidden'; button.setAttribute('aria-expanded','false'); button.focus(); } function getSelected() { for (var i=0; i < items.length; i++) { if (items[i].getAttribute('aria-checked') == 'true') { return i; } } } function setSelected(elem) { var curSelected = getSelected(); items[curSelected].setAttribute('aria-checked', 'false'); elem.setAttribute('aria-checked', 'true'); } function selectItem(evt) { setSelected(evt.target); hideMenu(evt); } function getPrevItem(index) { var prev = index - 1; if (prev < 0) { prev = items.length - 1; } return prev; } function getNextItem(index) { var next = index + 1; if (next == items.length) { next = 0; } return next; } function handleButtonKeys(evt) { evt.stopPropagation(); var key = evt.keyCode; switch(key) { case (13): /* ENTER */ case (32): /* SPACE */ showMenu(evt); default: } } function handleMenuKeys(evt) { evt.stopPropagation(); var key = evt.keyCode; switch(key) { case (38): /* UP */ focused = getPrevItem(focused); items[focused].focus(); break; case (40): /* DOWN */ focused = getNextItem(focused); items[focused].focus(); break; case (13): /* ENTER */ case (32): /* SPACE */ setSelected(evt.target); hideMenu(evt); break; case (27): /* ESC */ hideMenu(evt); break; default: } } button.addEventListener('click', showMenu, false); button.addEventListener('keydown', handleButtonKeys, false); for (var i = 0; i < items.length; i++) { items[i].addEventListener('click', selectItem, false); items[i].addEventListener('keydown', handleMenuKeys, false); } </script>

<div class="custombutton" id="button" tabindex="0" role="button"
   aria-expanded="false" aria-haspopup="true">
    <span>Justify</span>
</div>
<div role="menu"  class="menu" id="menu" style="display: none;">
  <div tabindex="0" role="menuitem" class="menuitem" aria-checked="true">
    Left
  </div>
  <div tabindex="0" role="menuitem" class="menuitem" aria-checked="false">
    Center
  </div>
  <div tabindex="0" role="menuitem" class="menuitem" aria-checked="false">
    Right
  </div>
</div>
[CSS and JavaScript for example omitted]

3.4. Labelling

The main issue that people know about accessibility seems to be that they have to put alt text onto images. This is only one means to provide labels to screenreaders for page content. Labels are short informative pieces of text that provide a name to a control.

There are actually several ways of providing labels for controls:

  • on img elements use @alt
  • on input elements use the label element
  • use @aria-labelledby if there is another element that contains the label
  • use @title if you also want a label to be used as a tooltip
  • otherwise use @aria-label

I'll provide examples for the first two use cases - the other use cases are simple to deduce.

Example:

The following two images show the rough concept for providing alt text for images: images that provide information should be transcribed, images that are just decorative should receive an empty @alt attribute.

shocked lolcat titled 'HTML cannot do that!
Image by Noah Sussman
<img src="texture.jpg" alt="">
<img src="lolcat.jpg"
alt="shocked lolcat titled 'HTML cannot do that!">
<img src="texture.jpg" alt="">

When marking up decorative images with an empty @alt attribute, the image is actually completely removed from the accessibility tree and does not confuse the blind user. This is a desired effect, so do remember to mark up all your images with @alt attributes, even those that don't contain anything of interest to AT.

Example:

In the example form above in Section 2.3, when tabbing directly on the input elements, the screen reader will only say "edit text" without announcing what meaning that text has. That's not very useful. So let's introduce a label element for the input elements. We'll also add checkboxes with a label.












<label>Doctor title:</label>
  <input type="checkbox" id="doctor"/>
<label>Firstname:</label>
  <input type="text" id="firstname2"/>

<label for="lastname2">Lastname:</label>
  <input type="text" id="lastname2"/>

<label>Address:
  <input type="text" id="address2">
</label>
<label for="city2">City:
  <input type="text" id="city2">
</label>
<label for="remember">Remember me:</label>
  <input type="checkbox" id="remember">

In this example we use several different approaches to show what a different it makes to use the <label> element to mark up input boxes.

The first two fields just have a <label> element next to a <input> element. When using a screenreader you will not notice a difference between this and not using the <label> element because there is no connection between the <label> and the <input> element.

In the third field we use the @for attribute to create that link. Now the input field isn't just announced as "edit text", but rather as "Lastname edit text", which is much more useful. Also, the screenreader can now skip the labels and get straight on the input element.

In the fourth and fifth field we actually encapsulate the <input> element inside the <label> element, thus avoiding the need for a @for attribute, though it doesn't hurt to explicity add it.

Finally we look at the checkbox. By including a referenced <label> element with the checkbox, we change the screenreaders announcement from just "checkbox not checked" to "Remember me checkbox not checked". Also notice that the click target now includes the label, making the checkbox not only more usable to screenreaders, but also for mouse users.

4. Conclusions

This article introduced a process that you can follow to make your Web applications accessible. As you do that, you will noticed that there are other things that you may need to do in order to give the best experience to a power user on a keyboard, a blind user using a screenreader, or a vision-impaired user using a screen magnifier. But once you've made a start, you will notice that it's not all black magic and a lot can be achieved with just a little markup.

You will find more markup in the WAI ARIA specification and many more resources at Mozilla's ARIA portal. Now go and change the world!

Many thanks to Alice Boxhall and Dominic Mazzoni for their proof-reading and suggested changes that really helped improve the article!

by silvia at February 22, 2012 06:31 AM

Silvia Pfeiffer : My crazy linux.conf.au week


In January I attended the annual Australian Linux and Open Source conference (LCA). But since I was sick all of January and had a lot to catch up on, I never got around to sharing all the talks that I gave during that time.

Drupal Down Under

It started with a talk at Drupal Down Under, which happened the weekend before LCA. I gave a talk titled “HTML5 video specifications” (video, slides).

<iframe allowfullscreen="allowfullscreen" frameborder="0" height="315" src="http://www.youtube.com/embed/-1hWHQBm4cE" width="420"></iframe>

I spoke about the video and audio element in HTML5, how to provide fallback content, how to encode content, how to control them from JavaScript, and briefly about Drupal video modules, though the next presentation provided much more insight into those. I explained how to make the HTML5 media elements accessible, including accessible controls, captions, audio descriptions, and the new WebVTT file format. I ran out of time to introduce the last section of my slides which are on WebRTC.

Linux.conf.au

On the first day of LCA I gave a talk both in the Multimedia Miniconf and the Browser Miniconf.

Browser Miniconf

In the Browser Miniconf I talked about “Web Standardisation – how browser vendors collaborate, or not” (slides). Maybe the most interesting part about this was that I tried out a new slide “deck” tool called impress.js. I’m not yet sure if I like it but it worked well for this talk, in which I explained how the HTML5 spec is authored and who has input.

I also sat on a panel of browser developers in the Browser Miniconf (more as a standards than as a browser developer, but that’s close enough). We were asked about all kinds of latest developments in HTML5, CSS3, and media standards in the browser.

Multimedia Miniconf

In the Multimedia Miniconf I gave a “HTML5 media accessibility update” (slides). I talked about the accessibility problems of Flash, how native HTML5 video players will be better, about accessible video controls, captions, navigation chapters, audio descriptions, and WebVTT. I also provided a demo of how to synchronize multiple video elements using a polyfill for the multitrack API.

I also provided an update on HTTP adaptive streaming APIs as a lightning talk in the Multimedia Miniconf. I used an extract of the Drupal conference slides for it.

Main conference

Finally, and most importantly, Alice Boxhall and myself gave a talk in the main linux.conf.au titled “Developing Accessible Web Apps – how hard can it be?” (video, slides). I spoke about a process that you can follow to make your Web applications accessible. I’m writing a separate blog post to explain this in more detail. In her part, Alice dug below the surface of browsers to explain how the accessibility markup that Web developers provide is transformed into data structures that are handed to accessibility technologies.

<iframe allowfullscreen="allowfullscreen" frameborder="0" height="315" src="http://www.youtube.com/embed/sVZ3tJj8DxI" width="420"></iframe>

by silvia at February 09, 2012 08:48 PM

David Schleef : New Schrödinger Release


I recently added support for 10- and 16-bit encoding and decoding to Schrödinger, so I did a little release. Presenting Schrödinger-1.0.11. Also pushed changes to GStreamer to handle the new features. Although these changes have been in the works for some time, a little prompting from j-b caused me to finish this off, so this will probably appear in VLC soon, too.
This was the last piece needed to create a 10-bit master of Sintel, which I’ve been planning to do for some time.

by David Schleef at January 23, 2012 04:23 PM

Jean-Marc Valin : Back From LCA, Video Available


I just got back from linux.conf.au 2012 in Ballarat. The video for the talk I gave, Opus, the Swiss Army Knife of Audio Codecs, is now available on the Opus presentations page. For the Ogg-impaired, a lower-quality version is also available on YouTube.

For those who are into speech codecs, I also recommend watching David Rowe's presentation: Codec 2 - Open Source Speech Coding at 2400 bit/s and Below. His presentation was selected as one of the four best talks at LCA this year -- well worth watching.

January 22, 2012 06:25 AM

Monty : XiphQT components, MAC OS X and 64 bit iTunes


Camilla forwarded a necessary tip for installing the XiphQT components on a 64 bit Mac OS X so that it works with iTunes. This is a reasonably well known tip, but it wasn't in our FAQ or installation instructions (well it is now as of about ten minutes ago) so I'm passing it along now too...

I upgraded to Lion, and my ogg files stopped being able to play in iTunes (silently). Here's how to make it go:
  1. "show in finder" your iTunes binary (either navigate to the Applications folder, or right/control click on it in the dock, and choose "show in finder")
  2. right/control click on iTunes in the finder, and select "Get Info"
  3. Under General, check the box marked "Open in 32-bit mode"

You should put the above on something linked from: http://www.xiph.org/quicktime/download.html I paraphrased it from roaringapps.com.

If XiphQT can be rebuilt in 64 bit mode, and that shipped that way to Lion users, that would also be a good solution.

That last comment is actually a bit of an embarrassment for us at the moment; neither the XiphQT builds nor code have been updated since 2009 or so, despite multiple releases, fundamental improvements and new features in the Xiph codecs since. There are actually more recent beta builds of updated Mac OS X and Win32 XiphQT components than never got bumped to the official XiphQT download page, but even these builds are from mid 2009.

We don't have any high-powered Mac OS hackers in the core Xiph group at the moment. I have some relatively insignificant amount of experience coding for Mac OS X and Quicktime, but I've been hoping for a volunteer with more chops. Any takers?

by Monty (monty@xiph.org) at January 21, 2012 08:24 AM

Monty : Ghost Update: Demo 4


Turns out I missed blogging about the latest Ghost update... back in November...

Ghost Demo4 is up on the demo list showing the sinusoidal extractor doing some very early sinusoidal tracking frame to frame, and a very early example of the analysis performing real sinusoidal/non-sinusoidal audio splitting. Pictures and interactive listening, oh my!

It looks like I'll be putting a month or two into transOgg before getting back to Ghost work (and demo 5). The work that went into demo4 raised a number of questions I'm not sure how to approach answering yet, so I'm going to let that percolate for a bit.

by Monty (monty@xiph.org) at January 21, 2012 07:19 AM

Monty : Planet Xiph posts now featured on Xiph.Org front page


I made a quick change to the Xiph.Org front page that a few people have suggested now over the past few years.

The top few blog posts aggregated by Planet Xiph now appear as a five-item teaser list near the top of the Xiph.Org home page. The idea is both to get some more live content on the front page as well as to draw more attention to both the Planet and our developer community.

Comments and feedback welcome!

by Monty (monty@xiph.org) at January 19, 2012 02:33 PM

Jean-Marc Valin : Opus quality update


Those who have been following the Opus git repository in the past few weeks probably haven't noticed much work going on. The reason is pretty simple, most of the work has been going on elsewhere in an experimental branch (exp_wip3 names for now) of my private repository. The reason it's in an experimental branch is that its not fully converted to fixed-point and hasn't been tested on any frame size other than 20 ms. Here's an (incomplete) list of changes for now:

  • Really unconstrained VBR (not trying to keep the same average rate)
  • Tonality detection to give highly tonal audio a boost in bit-rate
  • (yet another) rewrite of the transient detection code
  • New dynamic allocation code that boosts the rate of bands that have significant spectral leakage caused by short blocks

Thanks to these changes, the quality has (as far as we can tell) gone up compared to the current master branch. I invite you to judge for yourself by comparing the audio coded with the current master branch with the audio coded with the new exp_wip3 experimental branch. This is 64 kb/s, so fairly low rate for stereo music. The original is here. Let me know what you think.

January 10, 2012 09:39 AM

Chris Pearce : Changes to DOM full-screen API in Firefox 11


We've made some changes to how the HTML full-screen API exits full-screen mode in Firefox 11, which is scheduled to ship in March 2012. Previously Document.mozCancelFullScreen() would fully-exit full-screen and return the browser to "normal" mode. Starting in Firefox 11, Document.mozCancelFullScreen() will restore full-screen state to the element that was previously full-screen. If there is no previous full-screen element in either the document or a parent document (full-screen mode isn't restored to former full-screen elements in child documents), then the browser will "fully-exit full-screen", and return the browser to normal mode.

To see how this is useful, consider the case of a PowerPoint clone or presentation web app that wants to run full-screen. One way to implement such a web app would be to have a full-screen <div> element where the slides are shown. The developer may want to be able to switch full-screen mode seamlessly between the slide deck <div> and (say) a <video>, and then return to having the slide deck <div> as the full-screen element so that the user can carry on with the presentation. Before this change, if the <video> was in a cross-origin subdocument (like a YouTube embedded player in an <iframe>) returning full-screen mode to the slide deck <div> from the <video> was a two-step process; users would have to fully-exit full-screen, and re-request full-screen mode on the slide deck element. Now developers can simply call Document.mozCancelFullScreen() and seamlessly switch back. The browser won't drop out of full-screen mode during the transition.

Note that if users press the escape key they will always fully-exit full-screen, i.e. Firefox won't restore the previous full-screen element to full-screen state on escape key press. So to seamlessly restore full-screen to the previous full-screen element, developers must explicitly call Document.mozCancelFullScreen(), they can't rely on the user pressing the escape key.

We've also added webconsole logging upon full-screen request failures to Firefox 11, to make debugging denied full-screen requests easier.

Another change coming in Firefox 11 is we'll no longer deny full-screen requests in web pages which contain windowed plugins. Now we'll exit full-screen when a windowed plugin is focused instead (on Windows and Linux, MacOSX is unaffected).

by Chris Pearce (noreply@blogger.com) at December 19, 2011 08:48 PM

Ben Schwartz : Route9.js


I was really impressed by Michael Bebenita’s Broadway.js, the recent port of an H.264 decoder to pure Javascript using Emscripten, a LLVM-based C-to-JS converter … but of course this is the opposite of what we want! Who needs H.264? We want WebM!

I’ve spent the past few weekends digging into Broadway.js, stripping out the H.264 bits and replacing them with libvpx and libnestegg. Now it’s working, to a degree. You can see it for yourself at the demo page (so far tested only in Firefox 7…).

I’m not going to be able to take this much further … at least not right now. It’s been a fun exercise though. I invite all interested comers to read some more details and then fork the repo.

Take this thing, and make it your own.

Reactions: Hacker News, r/programming, and BadassJS, Twitter.

by Ben at November 30, 2011 05:18 AM

Chris Pearce : Firefox's HTML full-screen API enabled in Nightly builds


A few days ago I enabled the HTML full-screen API in Firefox nightly builds. This enables developers to make an arbitrary HTML element "full-screen", hiding the browser's UI and stretching the element to encompass the entire screen. This will be particularly useful for HTML5 video and games.

If all goes well, this feature will ship in Firefox 10 at the end of January.

The API has changed slightly since I last blogged about it. The current API is Mozilla-specific, but is similar to the W3C's Fullscreen draft specification.

To enter full-screen mode, call the following method on the HTML Element you'd like to enter full-screen:
  • void mozRequestFullScreen() : posts an asynchronous request to make the HTML element the full-screen element. If the request is granted, some time later a bubbling "mozfullscreenchange" event is dispatched to the element which requested full-screen. If the request is denied, a "mozfullscreenerror" event is dispatched to the element's owning document. We only grant requests for full-screen when:
    • mozRequestFullScreen() is called in a user-generated event handler, e.g. a mouse click handler, and
    • the requesting element is in its document, and
    • there are no windowed plugins present in any document/iframe in the current page, and
    • all iframes containing the requesting element (if any) have the mozallowfullscreen attribute.
We added the following method and attributes to HTML Document:
  • void mozCancelFullScreen() : exits the document from full-screen mode. This dispatches a "mozfullscreenchange" event to the document containing the (now former) full-screen element. Note that the "mozfullscreenchange" event which is dispatched when you enter full-screen is targeted at the full-screen element, so if you want to receive the "mozfullscreenchange" on both entering and exiting full-screen in the same listener you should add your listener to the document, rather than the full-screen element.
  • readonly attribute boolean mozFullScreen : true when the document is in full-screen mode.
  • readonly attribute Element mozFullScreenElement : reference to the current full-screen element.
  • readonly attribute boolean mozFullScreenEnabled : returns true if calls to mozRequestFullScreen() would be granted in the current document. This returns false if there are any windowed plugins present in any document/iframe in the current page, or if any iframes containing this document don't have the mozallowfullscreen attribute present, or if the user has disabled the API by preference. If this returns false you may want to not show the user your enter-full-screen button in your page, since you know it won't work!
We also added the :-moz-full-screen css pseudo class, which applies to the full-screen element while in full-screen mode.

We added the mozallowfullscreen attribute to iframe elements. Without this, full-screen requests made by script in the iframe's content (i.e embedded ads, or a YouTube player in an iframe for that matter) will be denied.

While in full-screen mode, the user can press the ESC key (or F11) to exit. Alpha-numeric keyboard input while in full-screen mode causes a warning message to pop-up to guard against phishing attacks. The only key input which doesn't cause the warning message to pop up are: left, right, up, down, space, shift, control, alt, page up, page down, end, home, tab, and meta.

Navigating, changing tab, changing app (ALT+TAB) while in full-screen mode will cause full-screen mode to exit.

Here's about a simple example, which will work in current Firefox nightly builds:

<video controls="controls" height="180" id="bruce_video" poster="http://people.mozilla.org/~cpearce/bruce-poster.jpg" preload="metadata" src="http://people.mozilla.org/%7Ecpearce/bruce.webm" width="320"></video>


(Press ESC to exit full-screen)

The code for that button's onclick handler is simply:
document.getElementById('bruce_video').mozRequestFullScreen();

How is Firefox's full-screen API different from Webkit/Chrome/Safari's full-screen API? Firefox's API adds a "width: 100%; height: 100%;" CSS rule to the element which requests full-screen, so that it's stretched to occupy the entire screen. Chrome's API does not do this, but instead it centers the full-screen element in the window and blacks-out the underlying webpage. So the full-screen element won't occupy the entire screen with Chrome's API unless you specify a "width: 100%; height: 100%;" rule yourself. Conversely if you want to vertically and horizontally center something while in full-screen with Firefox's API, you need to make the containing element of your desired centered element full-screen instead, and apply CSS rules to vertically and horizontally center the contained element.

For a cross-browser full-screen API example, see html5-demos.appspot.com's full-screen demo.

Edit: 11 Nov 2011, clarified Document.mozCancelFullScreen() and Document.mozFullScreenEnabled, fixed typos.

by Chris Pearce (noreply@blogger.com) at November 10, 2011 11:01 PM

Chris Pearce : Mozilla full-screen API progress update


Update 10 November 2011: the full-screen API has been changed slightly and enabled in Firefox Nightly builds, see http://blog.pearce.org.nz/2011/11/firefoxs-html-full-screen-api-enabled.html for details.

I've been working on implementing Robert O'Callahan's HTML full-screen API proposal in Firefox (bug 545812). Support for the base API has landed, disabled by default, in Firefox nightly builds. To enable the full-screen API, set the pref full-screen-api.enabled to true.

We have implemented a general purpose full-screen API which can make any HTML element the full-screen element (it seems WebKit based browsers' full-screen API allow only making <video> elements full-screen).

This feature makes the following API changes to HTML Element:
  1. void mozRequestFullScreen() : makes an HTML element the full-screen element. Causes browser chrome to hide, and expands the element to encompass the entire screen. Upon success, this dispatches a "mozfullscreenchange" event to the requesting full-screen element, or the element's owner document if the element is not in a document. We only grant requests for full-screen when running in user-generated event handlers, e.g. a mouse click handler.
This feature makes the following API changes to HTML Document:
  1. void mozCancelFullScreen() : exits the document from full-screen mode.
  2. readonly attribute mozFullScreen : true when the document is in full-screen mode.
  3. readonly attribute mozFullScreenElement : reference to the current full-screen element, if it's in the current document.
This feature adds the :-moz-full-screen css pseudo class, which applies to the full-screen element while in full-screen mode.

For a request for full-screen to be granted in content inside an iframe, the containing iframe needs to have the mozallowfullscreen attribute present. This is a boolean attribute, so the attribute only needs to be present, it doesn't matter what value it's set to.

Keyboard input is restricted in full-screen mode. When alpha-numeric key input occurs in full-screen mode, full-screen mode immediately exits. This is to help protect against phishing attacks.

We also plan to deny requests for full-screen mode when windowed plugins are present (since we can't easily monitor key events to windowed plugins on non-MacOSX platforms). We will exit full-screen mode when a windowed plugin is added to a document as well. I have a patch for this, but its dependencies haven't landed yet.

Work remaining to be done before this can be enabled:
  1. Adding a warning message when we enter DOM full-screen mode (on desktop Firefox, and on Fennec too).
  2. Making the full-screen API work in multi-process Firefox/Fennec (bug 684620). This requires a way of getting the PBrowserParent from C++ in the chrome process to be implemented, there's not a way to do that yet unfortunately.
  3. Make change/open tab cause full-screen mode to exit (bug 685402).
  4. A security review must be completed, and concerns raised there must be addressed. This could involve changing the API.
We also want a clearer transition effect when entering full-screen, to somehow show the full-screen element "stretching out" to encompass the screen.

You can test out our work-in-progress full-screen implementation, by grabbing the latest Firefox nightly build, setting the pref full-screen-api.enabled to true, and pointing your browser at my not-very-exciting full-screen API demo page.

by Chris Pearce (noreply@blogger.com) at November 09, 2011 10:43 PM

Silvia Pfeiffer : Open Media Developers Track at OVC 2011


The Open Video Conference that took place on 10-12 September was so overwhelming, I’ve still not been able to catch my breath! It was a dense three days for me, even though I only focused on the technology sessions of the conference and utterly missed out on all the policy and content discussions.

Roughly 60 people participated in the Open Media Software (OMS) developers track. This was an amazing group of people capable and willing to shape the future of video technology on the Web:

  • HTML5 video developers from Apple, Google, Opera, and Mozilla (though we missed the NZ folks),
  • codec developers from WebM, Xiph, and MPEG,
  • Web video developers from YouTube, JWPlayer, Kaltura, VideoJS, PopcornJS, etc.,
  • content publishers from Wikipedia, Internet Archive, YouTube, Netflix, etc.,
  • open source tool developers from FFmpeg, gstreamer, flumotion, VideoLAN, PiTiVi, etc,
  • and many more.

To provide a summary of all the discussions would be impossible, so I just want to share the key take-aways that I had from the main sessions.

WebRTC: Realtime Communications and HTML5

Tim Terriberry (Mozilla), Serge Lachapelle (Google) and Ethan Hugg (CISCO) moderated this session together (slides). There are activities both at the W3C and at IETF – the ones at IETF are supposed to focus on protocols, while the W3C ones on HTML5 extensions.

The current proposal of a PeerConnection API has been implemented in WebKit/Chrome as open source. It is expected that Firefox will have an add-on by Q1 next year. It enables video conferencing, including media capture, media encoding, signal processing (echo cancellation etc), secure transmission, and a data stream exchange.

Current discussions are around the signalling protocol and whether SIP needs to be required by the standard. Further, the codec question is under discussion with a question whether to mandate VP8 and Opus, since transcoding gateways are not desirable. Another question is how to measure the quality of the connection and how to report errors so as to allow adaptation.

What always amazes me around RTC is the sheer number of specialised protocols that seem to be required to implement this. WebRTC does not disappoint: in fact, the question was asked whether there could be a lighter alternative than to re-use dozens of years of protocol development – is it over-engineered? Can desktop players connect to a WebRTC session?

We are already in a second or third revision of this part of the HTML5 specification and yet it seems the requirements are still being collected. I’m quietly confident that everything is done to make the lives of the Web developer easier, but it sure looks like a huge task.

The Missing Link: Flash to HTML5

Zohar Babin (Kaltura) and myself moderated this session and I must admit that this session was the biggest eye-opener for me amongst all the sessions. There was a large number of Flash developers present in the room and that was great, because sometimes we just don’t listen enough to lessons learnt in the past.

This session gave me one of those aha-moments: it the form of the Flash appendBytes() API function.

The appendBytes() function allows a Flash developer to take a byteArray out of a connected video resource and do something with it – such as feed it to a video for display. When I heard that Web developers want that functionality for JavaScript and the video element, too, I instinctively rejected the idea wondering why on earth would a Web developer want to touch encoded video bytes – why not leave that to the browser.

But as it turns out, this is actually a really powerful enabler of functionality. For example, you can use it to:

  • display mid-roll video ads as part of the same video element,
  • sequence playlists of videos into the same video element,
  • implement DVR functionality (high-speed seeking),
  • do mash-ups,
  • do video editing,
  • adaptive streaming.

This totally blew my mind and I am now completely supportive of having such a function in HTML5. Together with media fragment URIs you could even leave all the header download management for resources to the Web browser and just request time ranges from a video through an appendBytes() function. This would be easier on the Web developer than having to deal with byte ranges and making sure that appropriate decoding pipelines are set up.

Standards for Video Accessibility

Philip Jagenstedt (Opera) and myself moderated this session. We focused on the HTML5 track element and the WebVTT file format. Many issues were identified that will still require work.

One particular topic was to find a standard means of rendering the UI for caption, subtitle, und description selection. For example, what icons should be used to indicate that subtitles or captions are available. While this is not part of the HTML5 specification, it’s still important to get this right across browsers since otherwise users will get confused with diverging interfaces.

Chaptering was discussed and a particular need to allow URLs to directly point at chapters was expressed. I suggested the use of named Media Fragment URLs.

The use of WebVTT for descriptions for the blind was also discussed. A suggestion was made to use the voice tag <v> to allow for “styling” (i.e. selection) of the screen reader voice.

Finally, multitrack audio or video resources were also discussed and the @mediagroup attribute was explained. A question about how to identify the language used in different alternative dubs was asked. This is an issue because @srclang is not on audio or video, only on text, so it’s a missing feature for the multitrack API.

Beyond this session, there was also a breakout session on WebVTT and the track element. As a consequence, a number of bugs were registered in the W3C bug tracker.

WebM: Testing, Metrics and New features

This session was moderated by John Luther and John Koleszar, both of the WebM Project. They started off with a presentation on current work on WebM, which includes quality testing and improvements, and encoder speed improvement. Then they moved on to questions about how to involve the community more.

The community criticised that communication of what is happening around WebM is very scarce. More sharing of information was requested, including a move to using open Google+ hangouts instead of Google internal video conferences. More use of the public bug tracker can also help include the community better.

Another pain point of the community was that code is introduced and removed without much feedback. It was requested to introduce a peer review process. Also it was requested that example code snippets are published when new features are announced so others can replicate the claims.

This all indicates to me that the WebM project is increasingly more open, but that there is still a lot to learn.

Standards for HTTP Adaptive Streaming

This session was moderated by Frank Galligan and Aaron Colwell (Google), and Mark Watson (Netflix).

Mark started off by giving us an introduction to MPEG DASH, the MPEG file format for HTTP adaptive streaming. MPEG has just finalized the format and he was able to show us some examples. DASH is XML-based and thus rather verbose. It is covering all eventualities of what parameters could be switched during transmissions, which makes it very broad. These include trick modes e.g. for fast forwarding, 3D, multi-view and multitrack content.

MPEG have defined profiles – one for live streaming which requires chunking of the files on the server, and one for on-demand which requires keyframe alignment of the files. There are clear specifications for how to do these with MPEG. Such profiles would need to be created for WebM and Ogg Theora, too, to make DASH universally applicable.

Further, the Web case needs a more restrictive adaptation approach, since the video element’s API is already accounting for some of the features that DASH provides for desktop applications. So, a Web-specific profile of DASH would be required.

Then Aaron introduced us to the MediaSource API and in particular the webkitSourceAppend() extension that he has been experimenting with. It is essentially an implementation of the appendBytes() function of Flash, which the Web developers had been asking for just a few sessions earlier. This was likely the biggest announcement of OVC, alas a quiet and technically-focused one.

Aaron explained that he had been trying to find a way to implement HTTP adaptive streaming into WebKit in a way in which it could be standardised. While doing so, he also came across other requirements around such chunked video handling, in particular around dynamic ad insertion, live streaming, DVR functionality (fast forward), constraint video editing, and mashups. While trying to sort out all these requirements, it became clear that it would be very difficult to implement strategies for stream switching, buffering and delivery of video chunks into the browser when so many different and likely contradictory requirements exist. Also, once an approach is implemented and specified for the browser, it becomes very difficult to innovate on it.

Instead, the easiest way to solve it right now and learn about what would be necessary to implement into the browser would be to actually allow Web developers to queue up a chunk of encoded video into a video element for decoding and display. Thus, the webkitSourceAppend() function was born (specification).

The proposed extension to the HTMLMediaElement is as follows:

partial interface HTMLMediaElement {
  // URL passed to src attribute to enable the media source logic.
  readonly attribute [URL] DOMString webkitMediaSourceURL;

  bool webkitSourceAppend(in Uint8Array data);

  // end of stream status codes.
  const unsigned short EOS_NO_ERROR = 0;
  const unsigned short EOS_NETWORK_ERR = 1;
  const unsigned short EOS_DECODE_ERR = 2;

  void webkitSourceEndOfStream(in unsigned short status);

  // states
  const unsigned short SOURCE_CLOSED = 0;
  const unsigned short SOURCE_OPEN = 1;
  const unsigned short SOURCE_ENDED = 2;

  readonly attribute unsigned short webkitSourceState;
};

The code is already checked into WebKit, but commented out behind a command-line compiler flag.

Frank then stepped forward to show how webkitSourceAppend() can be used to implement HTTP adaptive streaming. His example uses WebM – there are no examples with MPEG or Ogg yet.

The chunks that Frank’s demo used were 150 video frames long (6.25s) and 5s long audio. Stream switching only switched video, since audio data is much lower bandwidth and more important to retain at high quality. Switching was done on multiplexed files.

Every chunk requires an XHR range request – this could be optimised if the connections were kept open per adaptation. Seeking works, too, but since decoding requires download of a whole chunk, seeking latency is determined by the time it takes to download and decode that chunk.

Similar to DASH, when using this approach for live streaming, the server has to produce one file per chunk, since byte range requests are not possible on a continuously growing file.

Frank did not use DASH as the manifest format for his HTTP adaptive streaming demo, but instead used a hacked-up custom XML format. It would be possible to use JSON or any other format, too.

After this session, I was actually completely blown away by the possibilities that such a simple API extension allows. If I wasn’t sold on the idea of a appendBytes() function in the earlier session, this one completely changed my mind. While I still believe we need to standardise a HTTP adaptive streaming file format that all browsers will support for all codecs, and I still believe that a native implementation for support of such a file format is necessary, I also believe that this approach of webkitSourceAppend() is what HTML needs – and maybe it needs it faster than native HTTP adaptive streaming support.

Standards for Browser Video Playback Metrics

This session was moderated by Zachary Ozer and Pablo Schklowsky (JWPlayer). Their motivation for the topic was, in fact, also HTTP adaptive streaming. Once you leave the decisions about when to do stream switching to JavaScript (through a function such a wekitSourceAppend()), you have to expose stream metrics to the JS developer so they can make informed decisions. The other use cases is, of course, monitoring of the quality of video delivery for reporting to the provider, who may then decide to change their delivery environment.

The discussion found that we really care about metrics on three different levels:

  • measuring the network performance (bandwidth)
  • measuring the decoding pipeline performance
  • measuring the display quality

In the end, it seemed that work previously done by Steve Lacey on a proposal for video metrics was generally acceptable, except for the playbackJitter metric, which may be too aggregate to mean much.

Device Inputs / A/V in the Browser

I didn’t actually attend this session held by Anant Narayanan (Mozilla), but from what I heard, the discussion focused on how to manage permission of access to video camera, microphone and screen, e.g. when multiple applications (tabs) want access or when the same site wants access in a different session. This may apply to real-time communication with screen sharing, but also to photo sharing, video upload, or canvas access to devices e.g. for time lapse photography.

Open Video Editors

This was another session that I wasn’t able to attend, but I believe the creation of good open source video editing software and similar video creation software is really crucial to giving video a broader user appeal.

Jeff Fortin (PiTiVi) moderated this session and I was fascinated to later see his analysis of the lifecycle of open source video editors. It is shocking to see how many people/projects have tried to create an open source video editor and how many have stopped their project. It is likely that the creation of a video editor is such a complex challenge that it requires a larger and more committed open source project – single people will just run out of steam too quickly. This may be comparable to the creation of a Web browser (see the size of the Mozilla project) or a text processing system (see the size of the OpenOffice project).

Jeff also mentioned the need to create open video editor standards around playlist file formats etc. Possibly the Open Video Alliance could help. In any case, something has to be done in this space – maybe this would be a good topic to focus next year’s OVC on?

Monday’s Breakout Groups

The conference ended officially on Sunday night, but we had a third day of discussions / hackday at the wonderful New York Lawschool venue. We had collected issues of interest during the two previous days and organised the breakout groups on the morning (Schedule).

In the Content Protection/DRM session, Mark Watson from Netflix explained how their API works and that they believe that all we need in browsers is a secure way to exchange keys and an indicator of protection scheme is used – the actual protection scheme would not be implemented by the browser, but be provided by the underlying system (media framework/operating system). I think that until somebody actually implements something in a browser fork and shows how this can be done, we won’t have much progress. In my understanding, we may also need to disable part of the video API for encrypted content, because otherwise you can always e.g. grab frames from the video element into canvas and save them from there.

In the Playlists and Gapless Playback session, there was massive brainstorming about what new cool things can be done with the video element in browsers if playback between snippets can be made seamless. Further discussions were about a standard playlist file formats (such as XSPF, MRSS or M3U), media fragment URIs in playlists for mashups, and the need to expose track metadata for HTML5 media elements.

What more can I say? It was an amazing three days and the complexity of problems that we’re dealing with is a tribute to how far HTML5 and open video has already come and exciting news for the kind of applications that will be possible (both professional and community) once we’ve solved the problems of today. It will be exciting to see what progress we will have made by next year’s conference.

Thanks go to Google for sponsoring my trip to OVC.

UPDATE: We actually have a mailing list for open media developers who are interested in these and similar topics – do join at http://lists.annodex.net/cgi-bin/mailman/listinfo/foms.

by silvia at October 11, 2011 04:12 AM

Silvia Pfeiffer : WebVTT at W3C


Today we started a community group (CG) at the W3C for “Web Media Text Tracks”: http://www.w3.org/community/texttracks/.

The group has been created to work on many aspects of video text tracks of which captioning and the WebVTT format are key parts.

The main reason behind creating this group is to create a forum at the W3C for working on WebVTT to allow all browsers to support this format and be involved in its development.

We’ve not gone the full way to creating a Working Group, although that was the initial intention. We had objections from W3C members for going down that path, so are using the CG path for now.

This is actually a good thing because CGs are open for anyone to join, while WGs are only open to W3C members. The key difference is that specs coming out of WGs can become RECs (“standards”), while CG’s specs cannot.

If we eventually see a need to move WebVTT to a REC, that move will be straight forward, since there is a clear path for work to transition from a CG to a WG.

by silvia at September 30, 2011 06:59 AM

Silvia Pfeiffer : 3rd W3C Web and TV Workshop, Hollywood


Curious about any new requirements that the TV community may have for HTML5 video, I attended the W3C Web and TV Workshop in Hollywood last week. It’s already the third of its kind and was also the largest to date showing an increasing interest of the TV community to converge with the Web community.

The Workshop Aim

I went into the Workshop not quite knowing what to expect. My previous contact with members of this community was restricted to email exchanges on the W3C Web and TV Interest Group (IG) mailing list. I knew there was some interest in video accessibility (well: particularly captions) and little knowledge of existing HTML5 specifications around text tracks and why the browsers were going with WebVTT. So I had decided to attend the workshop to get a better understanding of the community, it’s background, needs, and issues, and to hopefully teach some of the ways of HTML5. For that reason I had also submitted a WebVTT presentation/demo.

As it turned out, the workshop had as its key target the facilitation of communication between the TV and the HTML5 community. The aim was to identify features that need to be added to the HTML5 video element to satisfy the needs of the TV community. I obviously came to the right workshop.

The process that is being used by the W3C in the Interest Group is to have TV community members express their needs, then have HTML5 experts express how these needs can be satisfied with existing HTML5 features, then make trial implementations and identify any shortcomings, then move forward to progress these through HTML5 or HTML.next. This workshop clearly focused on the first step: expressing needs.

Often times it was painful for me to watch presenters defending their requirements and trying to impress on the audience how important a certain feature is to them when that features actually already has a HTML5 specification, but just not yet a browser implementations. That there were so few HTML5 video experts present and that they were given very little space to directly reply to the expressed needs and actually explain what is already possible (or specified to be possible) was probably one of the biggest drawbacks of the workshop.

To be fair, detailed technical discussions were not possible in a room with 150 attendees with a panel sitting at the front discussing topics and taking questions. Solving a use case with existing HTML5 markup and identifying the gaps requires smaller break-out groups of a maximum of maybe 20 people and sufficient HTML5 knowledge in the room. Ultimately they require a single person to try to implement it using JavaScript alone, and, failing that, writing browser extensions. Only such code actually proves that a feature is missing.

Now, the video features of HTML5 are still continuing to change almost on a daily basis. Much development is, for example, happening around real-time communication features and around the track element as we speak. So, focusing on further requirements finding around HTML5 video for now is probably a good thing.

The TV Community Approach

Before I move on to some of the topics covered by the workshop, I have to express some concern about the behaviour that I observed with lots of the TV community folks. Many people tried pushing existing solutions from other spaces into the Web unchanged with a claim of not re-inventing the wheel and following paved cowpaths, which are some of the underlying design principles for HTML5. I can understand where such behaviour originates thinking that having solved the same problems elsewhere before, those solutions should apply here, too. But I would like to warn people of this approach.

If we blindly apply solutions that were not developed for HTML5 into HTML we will end up with suboptimal solutions that will hurt us further down the track. The principles of not re-inventing the wheel and following paved cowpaths were introduced for features that were already implemented by browsers or in de-facto standard use by JavaScript libraries. They were not created for new features in HTML. The video element is a completely new feature in HTML thus everything around it is new.

I would therefore like to see some more respect given to HTML5 and the complexities involved in finding the best possible technical solutions for the Web given that the video element does not stand alone in HTML5, but is part of a much larger picture of technical capabilities on the Web where many of the requested features for TV applications may already be solved by existing HTML markup that is not part of the video element.

Also, HTML5 is not just about the HTML markup, but also about CSS and JavaScript and HTTP. There are several layers of technology involved in creating a Web application: not only a separation of work between client and servers, but also between the Operating System, the media framework, the browser, browser plugins, and JavaScript has to be balanced. To get this balance right is a fine art that will take many discussion, many experiments and sometimes several design approaches. We need patience and calm to work through this, not a rushed adoption of existing solutions from other spaces.

New Requirements

Now let’s get to the take-aways I had from the workshop’s sessions:

Session 1 / Content Provider and Consumer Perspective:

The sessions participants postulate that we will see the creation of application stores for TV applications similar to how we have experienced this for mobile phones and tablets. People enjoy collecting apps like they collect badges. Right now, the app store domain is dominated by native apps and now Web apps. The reason is that we haven’t got a standard platform for setting up Web app stores with Web apps that work in all browsers on all operating systems. Thus, developers have to re-deploy their app for many environments.

While essentially an orthogonal need to HTML standardisation, this seems to be one of the key issues that keep Web apps back from making big market inroads and W3C may do well in setting up a new WG to define a standard Web app manifest format and JS APIs.

Session 2+3 / Multi-screen TV in the Home Network:

Several technologies of hybrid TV broadcast and set-top-box Web content delivery were being pointed out, including the European HbbTV and the Japanese Hybridcast, the latter of which gave an in-depth demo.

Web purists would probably say that it would be simpler to just deliver all content over the Web and not have to worry about any further technical challenges encountered by having to synchronize content received via two vastly different delivery mechanisms. I personally believe this development is one of business models: we don’t yet know exactly how to earn money from TV content delivered over the Internet, but we do know how to do so with TV content. So, hybrids allow the continuation of existing income streams while allowing the features to be augmented with those people enjoy from the Internet.

Should requirements that emerge from such a use case for HTML5 video be taken seriously? I think they absolutely should. What I see happening is that a new way of using the Web is starting to emerge. The new way is video-focused rather than text-focused. We receive our Web content by watching video programming online – video channels, not Web pages are the core content that we consume in the living room. Video channels are where we start our browsing experience from. Search may still be our first point of call, but it will be search for video content or a video-centric app rather than search for a Web site.

And it will be a matter of many interconnected devices in the house that contribute to the experience: the 5.1 stereos that are spread all over the house and should receive our video’s sound, the different screens in the different areas of our house between which we move around, and remote controls, laptops or tablets that function as remote controls and preview stations and are used to determine our viewing experience and provide a back-channel to the publishers.

We have barely begun to identify how such interconnected devices within a home fit within the server-client-based view of the Web world, and the new Web Sockets functionality. The Home Networking Task Force of the Web and TV IG is looking at the issues and analysing existing protocols and standards that solve this picture. But I have a gnawing feeling that the best solution will be something new that is more Web-specific and fits better with the technology layers of the Web.

Session 4 / Synchronized Metadata:

The TV environment offers many data services, some of which have been legally prescribed. This session analysed TV needs and how they can be satisfied with current HTML5.

Subtitles and closed captioning support are one of the key requirements that have been legally prescribed to allow for equal access of non-native speakers, and blind and vision-impaired users to TV content. After demonstration of some key features defined into the HTML5 track element and the WebVTT format, it was generally accepted that HTML5 is making big progress in this space, in particular that browsers are in the process of implementing support for the track element. A concern still exists for complete coverage of all the CEA-608/708 features in WebVTT.

Further concern was raised for support of audio descriptions and audio translations, in particular since no browser has as yet committed to implementing the HTML5′s media multitrack API with the @mediagroup attribute. In this context I am excited to see first JavaScript polyfills emerge (see captionator.js & mediagroup.js).

Another concern was that many captions are actually delivered as raster images (in particular DVD captions) and how that would work in the Web context. The proposal was to use WebVTT and encode the raster images as data-URIs included in timed cues, then render them by JavaScript as an overlay. This is something to explore further.

Demos were shown using WebVTT to synchronize ads with videos, to display related metadata from a user’s life log with videos, to display thumbnails along a video’s timeline, and to show the rendering of text descriptions through screen readers. General agreement by the panel was that WebVTT offers many opportunities and that this area will continue to need further development and that we will see new capabilities on the Web around metadata that were not previously possible on TV.

Session 5 / Content Format and Codecs: DASH and Codec standards

The introduction of HTTP adaptive streaming into HTML5 was one of the core issues that kept returning in the discussions. This panel focused on MPEG DASH, but also mentioned the need for programmatic implementation of adaptive streaming functionality.

The work around MPEG DASH would require specifications of how to use DASH with WebM and Ogg Theora, as well as a specification of a HTML5 profile for DASH, which would limit the functionality possible in DASH files to the ones needed in a HTML5 video element. One criticism of DASH was its verbosity. Another was its unclear patent position. Panel attendees with included Qualcomm, Apple and Microsoft made very clear that their position is pro a royalty-free use of DASH.

The work around a programmatic implementation for adaptive streaming would require at least a JavaScript API to measure the quality of service of a presented video element and a JavaScript API to feed the video element with chunks of (encrypted) video content on the fly. Interestingly enough, there are existing experiments both around Video metrics and MediaSource extensions, so we can expect some progress in this space, even if these are not yet a strong focus of the HTML WG.

I would personally support the creation of Community Group at the W3C around HTTP adaptive streaming and DASH. I think it would work towards alleviating the perceived patent issues around DASH and allow the right members of the community to participate in preparing a specification for HTML5 without requiring them to become W3C members.

Session 6 / Content Protection and DRM

A core concern of the TV community is around content protection. The requirements in this space seem, however, very confused.

The key assumption here is that Web browsers should support the decoding of DRM-protected content in the HTML5 video element because the video element provides a desirable JavaScript API, accessibility features (the track element), default controls, and the possibility to synchronize multiple media elements. However, at the same time, the video element is part of the core content of a Web page and thus allows direct access to the image content in a canvas etc, so some of its functionality is not desirable.

The picture is further confused by requests for authentication, authorization, encryption, obfuscation, same-origin, secure transmission, secure decryption key delivery, unique content identification and other “content protection” techniques without a clear understanding of what is already possible on the Web and what requirements to content publishers actually have for delivering their content on the Web. This is further complicated by the fact that there are many competing solutions for DRM systems in the market with no clear standard that all browsers could support.

A thorough analysis of the technologies and solutions available in this space as well as an analysis of the needs for HTML5 is required before it becomes clear what solution HTML5 browsers may need to support. There seemed to be agreement in the group, though, that browsers would not need to implement DRM solutions, but rather only hand through the functionality of the platform on which they are running (including the media frameworks and operating system functionalities). How this is supposed to work was, however, unclear.

Session 7 / Web & TV: Additional Device & User Requirements

This was a catch-all session for topics that had not been addressed in other sessions. Among the topics addressed in this group were:

  • Parental Guidance: how to deal with ratings in an internationally inconsistent ratings landscape, how to deliver the ratings with the content, and how to enforce the viewing restrictions
  • Emergency Notifications: how to replicate on the Web the emergency notification functionality of TV by providing text overlays to alert users
  • TV channels: how to detect what channels of programming are available to users

Overall, the workshop was a worthwhile experience. It seems there is a lot of work still ahead for making HTML5 video the best it can be on the Web.

by silvia at September 29, 2011 08:16 AM

Jean-Marc Valin : Opus, the Swiss Army Knife of Audio Codecs


I just got the news today that LCA 2011 has accepted my talk proposal: "Opus, the Swiss Army Knife of Audio Codecs". I'll be presenting it in Ballarat, Australia in January. If there's any specific topic you'd like me to include in the talk, please let me know (by email or comment on this post).

September 08, 2011 01:46 AM

Chris Pearce : New media element APIs and better media seeking resolution


French intern Paul Adenot has recently implemented the seekable and played attributes on the HTML5 video and audio elements in Firefox. The seekable attribute enables script to see what regions of the media can be seeked into (particularly handy with live streams), and the played attribute enables script to see what regions of the media has already been played. Paul has also done some work improving the built in controls on media elements. Thanks for your hard work Paul! These should be available in release builds in November (Firefox 8).

Also in Firefox 8 are my changes to media seeking resolution. Now media seeking should be accurate to the nearest microsecond. It's been reported elsewhere how important accurate seeking for video is. We were previously accurate to the nearest video frame, but we could still be up to one audio packet off (often between 4 and 8 ms out). Now we prune audio samples when seeking so we're down to microsecond resolution.

by Chris Pearce (noreply@blogger.com) at August 24, 2011 11:08 PM

Silvia Pfeiffer : The new FOMS: Open Media Developers at OVC


Since 2007 I have organised the annual Foundations of Open Media Software (FOMS) developers workshop. Last year it was held for the first time in the northern hemisphere, in fact on the two days straight after the Open Video Conference (OVC).

This year I’m really excited to announce that the workshop will be an integral part of the Open Video Conference on 10-12 September 2011.

FOMS 2011 will take place as the Open Media Developers track at OVC and I would like to see as many if not more open media software developers attend as we had in last year’s FOMS.

Why should you go?

Well, firstly of course the people. As in previous years, we will have some of the key developers in open media software attend – not as celebrities, but to work with other key developers on hard problems and to make progress.

Then, secondly we believe we have some awesome sessions in preparation:

How we run it

I’m actually not quite satisfied with just these sessions. I’d like to be more flexible on how we make the three days a success for everyone. And this implies that there will continue to be room to add more sessions, even while at the conference, and create breakout groups to address really hard issues all the way through the conference.

I insist on this flexibility because I have seen in past years that the most productive outcomes are created by two or three people breaking away from the group, going into a corner and hacking up some demos or solutions to hard problems and taking that momentum away after the workshop.

To allow this to happen, we will have a plenary on the first day during which we will identify who is actually present at the workshop, what they are working on, what sessions they are planning on a attending, and what other topics they are keen to learn about during the conference that may not yet be addressed by existing sessions.

We’ll repeat this exercise on the Monday after all the rest of the conference is finished and we get a quieter day to just focus on being productive.

But is it worth the effort?

As in the past years, whether the workshop is a success for you depends on you and you alone. You have the power to direct what sessions and breakout groups are being created, and you have the possibility to find others at the workshop that share an interest and drag them away for some productive brainstorming or coding.

I’m going to make sure we have an adequate number of rooms available to actually achieve such an environment. I am very happy to have the support of OVC for this and I am assured we have the best location with plenty of space.

Trip sponsorships

As in previous FOMSes, we have again made sure that travel and conference sponsorship is available to community software developers that would otherwise not be able to attend FOMS. We have several such sponsorships and I encourage you to email the FOMS committee or OVC about it. Mention what you’re working on and what you’re interested to take away from OVC and we can give you free entry, hotel and flight sponsorship.

Oh, and don’t forget to Register for OVC!

by silvia at August 13, 2011 05:12 AM

Ben Schwartz : Evolution


So I wrote this song, sort of. Maybe you’ll like it.

<video controls="controls">
<source src="http://www.archive.org/download/EvolutiontheSong/evolution-small.ogv" type="video/ogg">
<iframe allowfullscreen="allowfullscreen" frameborder="0" height="349" src="http://www.youtube.com/embed/_QF6dB06c2I" width="560"></iframe>
</video>

YouTube version
Sheet Music
Reference files at Archive.org

After about 6 years of covering pop songs in my a cappella groups, I really wanted to sing some original music. In part, I was motivated by the US’s aggressively restrictive copyright regime, which always prevented us from freely sharing recordings of our own performances.

I tried to write a song from scratch for a while, but it wasn’t working out, mostly because I don’t have anything interesting to say. Then I struck upon the idea of using the text of an old out-of-copyright poem (which, because of the US’s effectively perpetual copyright, has to be very old indeed). I started browsing through the poetry section of WikiSource, until I stumbled across this brilliant 1895 poem by Langdon Smith. The choice was clear.

I drew up a thoroughly derivative 4-part a cappella arrangement in MuseScore, and VoiceLab indulged me by adding it to the repertoire. We’ve sung it twice so far, but the first time we didn’t have a good recording, and then this time I had to solve this audio-video alignment problem… but now it’s here.

The recordings and sheet music are all CC0 dedicated to the public domain. I would appreciate attribution as the arranger, but I find threats of legal action to be just as distasteful as plagiarism. I wouldn’t want to do anything to discourage people from adopting and adapting the music as they see fit. Maybe someone will make a recording with a soloist who can really sing!

by Ben at August 05, 2011 05:12 PM

Chris Pearce : Simple rate limited HTTP server for testing HTML5 media/streaming


While working on the Firefox HTML5 video and audio support, I've found it extremely useful to have an HTTP server on which the transfer rate is reliably limited. Existing servers are either too heavy weight, like apache, or have inconsistent rate-limiting, like lighttpd which I found to have very "bursty" rate limiting.

I ended up taking the educational route, and implementing a simple HTTP server in C++. It supports the following features:

  1. Support for HTTP1.1 Byte Range Requests. This means you can seek into unbuffered data when watching HTML5 video.
  2. Rate limiting, configurable on a per request basis by passing the "rate=x" HTTP query parameter, where x is the transfer rate of the connection in kilobytes per second. The server will send x/10 KB ten times per second to maintain this rate smoothly.
  3. Simulated live streaming, configurable on a request basis by passing the "live" query parameter. When in "live" mode, no Content-Length header is sent, and the server doesn't advertise or perform byte range requests - so you can't seek into unbuffered video/audio, just like in a live stream.
  4. Cross platform; tested on Windows (runs on port 80) and Linux (runs on port 8080). I haven't test it on MacOS yet.
  5. Simply serves all files in the program's working directory, making it easy to use (and abuse).
  6. Open source! Get the code at https://github.com/cpearce/HttpMediaServer, or download a pre-built win32 binary.
For example, if you wanted to simulate a live stream being served at 100KB/s, your test URL might look something like http://localhost:80/video.ogg?rate=100&live.

I've been using it for quite a while, and over the weekend I finally cleaned it up and put it up on GitHub. Check it out.

by Chris Pearce (noreply@blogger.com) at August 03, 2011 07:03 AM

Ben Schwartz : An Auto-Aligner for PiTiVi


It’s rare to get exactly one recording of an a capella concert. Usually someone’s parents have a fancy but outdated camcorder, someone in the front row has a cell phone video with a great angle but terrible quality, and there’s a beautiful audio-only recording, maybe straight from the mixing board. All the recordings are independent, starting and stopping at different times. Some are only one song long, or are broken into many short pieces.

If you want to combine all these inputs into a video that anyone could watch, you’ll first have to line them up correctly in a video editor. This is a painful process of dragging clips around on the timeline with the mouse, trying to figure out if they’re in sync or not. The usual trick to making this achievable is to look at the audio waveform visualization, but even so, the process can be tedious and irritating.

This year, when I got three recordings from the VoiceLab spring concert, I resolved to solve the problem once and for all. I set about writing an automatic clip alignment algorithm as a patch to PiTiVi, a beautiful (if not mature) free software video editor written in Python.

Today, after about two months of nights and weekends, the result is ready for testing in PiTiVi mainline. Jean-François Fortin Tam has a great writeup explaining how it works from a user’s perspective.

I hadn’t looked into it until after the fact, but of course this is not the first auto-alignment function in a video editor. Final Cut Pro appears to have a similar function built in, and there are also plug-ins such as “Plural Eyes” for many editors. However, to the best of my knowledge, this is the first free implementation, and the first available on Linux. Comparing features in PiTiVi vs. the proprietary giants, I think of this as “one down, 20,000 to go”.

I guess this is as good a place as any to talk about the algorithm, which is almost The Simplest Thing that could Possibly Work. Alignment works by analyzing the audio tracks, relying on every video camera to have a microphone of its own. The most direct approach might be to compute the cross-correlation of these audio tracks and look for the peak … but this could require storing multi-gigabyte audio files in memory, and performing impossibly large FFTs. On computers of today, the direct approach is technologically infeasible.

The algorithm I settled on resembles the method a human uses when looking at the waveform view. First, it breaks each input audio stream into 40 ms blocks and computes the mean absolute value of each block. The resulting 25 Hz signal is the “volume envelope”. The code subtracts the mean volume from each track’s envelope, then performs a cross-correlation between tracks and looks for the peak, which identifies the relative shift. To avoid performing N^2 cross-correlations, one clip is selected as the fixed reference, and all others are compared to it. The peak position is quantized to the block duration (creating an error of +/- 20ms), so to improve accuracy a parabolic fit is used to interpolate the true maximum. I don’t know the exact residual error, but I expect it’s typically less than 5 ms, which should be plenty good enough, seeing as sound travels about 1 foot per ms.

My original intent was to compensate for clock skew as well, because all these recording devices are using independent sample clocks that are running at slightly different rates due to manufacturing variation. There’s even code in the commit for a far more complex algorithm that can measure this clock skew. At the moment, this code is disused, for two reasons: none of our test clips actually showed appreciable skew, and PiTiVi doesn’t actually support changing the speed of clips, especially audio.

If you want to help, just stop by the PiTiVi mailing list or IRC channel. We can use more test clips, a real testing framework, a cancel button, UI improvements, conversion to C for speed, and all sorts of general bug squashing. For this feature, and throughout PiTiVi, there’s always more to be done. I’ve found the developer community to be extremely welcoming of new contributions … come and join us.

by Ben at July 26, 2011 04:12 AM

Cristian Adam : Maintainer for Directshow Filters for Ogg Vorbis, Speex, Theora and FLAC


I probably should have blogged sooner, but here it is: I'm the current maintainer for Directshow Filters for Ogg Vorbis, Speex, Theora and FLAC.

If you want a hardware Ogg Player you should consider buying a Trekstor Samsung (most of their MP3 players support Ogg Vorbis and FLAC formats) product!

by Cristian Adam (noreply@blogger.com) at June 21, 2011 06:06 AM

Monty : FTC "Patent Hold-Up" Workshop


Quite a few otherwise interested people may not have heard that the FTC (Federal Trade Commission) is holding a panel and workshop next week concerning how patent trolls are abusing standards body processes. This is our field, and we didn't find out about it until end of last week.

Regardless, Xiph.Org has assembled an official comment document, and will be represented in person by at least Dr. Tim Terriberry and possibly a few other core members (I won't be there).

If you're interested in software patents, some of the US Government's thinking on the issue, and participating in the process, have a look at the above two links. Also, feel free to distribute our comments far and wide. It's somewhat more gripping than the usual, dry "Percy Q. Business Leader Advises the Federal Goverment".

by Monty (monty@xiph.org) at June 15, 2011 09:36 AM

Monty : Death By Graphs (a new Ghost update/demo)


I'd mentioned in the previous update that we're (Xiph is) using a chirp estimation algorithm that we published back in 2007, and that the original paper has precious little space to devote to describing in detail how the algorithm actually performed. One of the upshots of not having done extensive characterization tests of our own algorithm was that it has already surprised me a few times this year (in both good and bad ways).

Therefore, Ghost Update 20110604 concerns itself with describing and graphing algorithm behavior in mind-numbing detail.

Death! By! Graphs!

by Monty (monty@xiph.org) at June 05, 2011 03:28 AM

Monty : A Ghost update! For... last month!


Not actually last month, but pretty close at this point.

I never publically released the my previous Ghost update delivered internally to Red Hat at the beginning of the month because I hadn't finished some of the diagrams I wanted to do for the last section on chirp coding. Well, the diagrams are done! Here's the latest Ghost demo update, just in time for the next one to almost be due!

by Monty (monty@xiph.org) at April 27, 2011 05:27 PM

Monty : WebM Community Cross-License announced


In case folks have missed it (or worse, read about about it on Ars Technica)...

The WebM folks have finally finished up their work on the WebM Community Cross-License project and announced the license launch. This is a FOSS defensive license/pool similar to what a couple other groups are trying out (and similar to the defensive patent license that Xiph is already using for our parts of Opus within the IETF).

The basic idea of the cross-license is:

"Everyone is free to use any known or unknown WebM patents. Unless you sue over patents related to WebM. In that case, we all agree to yank your license."

In short, it's sort of a NATO for FOSS patents; a free license with an agreed-upon mutual defense clause that tries to enforce everyone playing nice. This strategy is not a new idea, but it's interesting that several different FOSS groups, Xiph and WebM included, are finally trying the idea for real in practice.

by Monty (monty@xiph.org) at April 27, 2011 01:09 PM

Chris Pearce : HTML5 Video painting performance statistics in Firefox 5


I've landed video frame paint performance counters for HTML5 video onto mozilla-central. This should ship in Firefox 5, barring any disasters. This work was a combined effort by Chris Double and I. These are Mozilla specific fields which will only be available in Firefox.

The new statistics enable us to measure the performance of the video decoding and frame painting pipeline.

This adds the following fields to the HTMLVideoElement:
  • mozParsedFrames - A count of the number of video frames that have been demuxed/parsed from the media resource. If we were playing perfectly, we'd be able to paint this many frames.
  • mozDecodedFrames - A count of the number of deumxed/parsed video frames that have been decoded into Images. We skip decoding of parsed/demuxed frames if the decode is falling behind the playback position (this can happen if it takes a long time to decode a keyframe for example).
  • mozPresentedFrames - A count of the number of decoded frames that have been presented to the rendering pipeline for painting (set as the current Image on the video element's ImageContainer). We may not present decoded frames if the frame arrives for presentation late.
  • mozPaintedFrames - A count of the number of presented frames which were painted on screen. We may end up not painting presented frames if another frame is presented before the graphics pipeline has time to paint the previously presented frame, or if the video is off screen. 
  • mozFrameDelay - The time (as a floating point number in seconds) which the last painted video frame was rendered late by. This is the time duration between the decoder saying "paint frame X now", and the graphics pipeline physically getting frame X displayed on the screen. The value is accurate on desktop Firefox, but not on mobile. Improvements in the graphics pipeline, and the integration with the graphics pipeline, will show up as a decrease in this number.
Here's a demo of the video paint statistics in Firefox 5. You'll need a recent Firefox trunk nightly build for the demo to work.

by Chris Pearce (noreply@blogger.com) at March 31, 2011 02:58 AM

David Schleef : GStreamer SDI Capture Plugins


I’m getting ready to push several commits to the gst-plugins-bad source repository that add plugins for capturing SDI and HD-SDI using cards from two different manufacturers: BlackMagic Design‘s DeckLink, and Linear Systems SDI Master capture card.

The Linear Systems cards are probably better known by their reseller, DVEO. Entropy Wave uses both of these cards in the E1000 Live Encoder appliance, we’ve found that aside from some motherboard incompatibilities in the DeckLink cards, they both work great in Linux. While we’re primarily interested in live capture at the moment, output has also been implemented.

We slightly prefer the Linear Systems cards – mainly because the drivers are open source, but also because the API allows lower level access to the hardware, including SDI clocking and raw VANC and HANC data. It also allows subframe latency, although not implemented in the GStreamer plugin, it will be nice to use in the future.

In comparison, the DeckLink driver and SDK are not open source (which means I can’t fix any bugs), although they conveniently provide open source headers and shim code for interfacing with the SDK. This allows the GStreamer plugin to be completely open source and legally distributable separately from the SDK, but will only work if the SDK libraries and driver are present. Optical fiber connections are only available in the DeckLink, and the DeckLink cards tend to be less expensive.

It will take a few weeks for these to be available as part of a GStreamer release, however, they are available in the Media SDK now.

(Reposted from my Entropy Wave blog.)

by David Schleef at March 24, 2011 07:35 PM