Planet Xiph

November 18, 2009

Silvia Pfeiffer

HTML5 Video element discussions at TPAC meetings

Last week’s TPAC (2009 W3C Technical Plenary / Advisory Committee) meetings were my second time at a TPAC and I found myself becoming highly involved with the progress on accessibility on the HTML5 video element. There were in particular two meetings of high relevanct: the Video Accessibility workshop and Friday’s HTML5 breakout group on the video element.

HTML5 Video Accessibility Workshop

The week started on Sunday with the “HTML5 Video Accessibility workshop” at Stanford University, organised by John Foliot and Dave Singer. They brought together a substantial number of people all representing a variety of interest groups. Everyone got their chance to present their viewpoint – check out the minutes of the meeting for a complete transcript.

The list of people and their discussion topics were as follows:

Accessibility Experts

  • Janina Sajka, chair of WAI Protocols and Formats: represented the vision-impaired community and expressed requirements for a deeply controllable access interface to audio-visual content, preferably in a structured manner similar to DAISY.
  • Sally Cain, RNIB, Member of W3C PF group: expressed a deep need for audio descriptions, which are often overlooked besides captions.
  • Ken Harrenstien, Google: has worked on captioning support for video.google and YouTube and shared his experiences, e.g. http://www.youtube.com/watch?v=QRS8MkLhQmM, and automated translation.
  • Victor Tsaran, Yahoo! Accessibility Manager: joined for a short time out of interest.

Practicioners

  • John Foliot, professor at Stanford Uni: showed a captioning service that he set up at Stanford University to enable lecturers to publish more accessible video – it uses humans for transcription, but automated tools to time-align, and provides a Web interface to the staff.
  • Matt May, Adobe: shared what Adobe learnt about accessibility in Flash – in particular that an instream-only approach to captions was a naive approach and that external captions are much more flexible, extensible, and can fit into current workflows.
  • Frank Olivier, Microsoft: attended to listen and learn.

Technologists

  • Pierre-Antoine Champin from Liris (France), who was not able to attend, sent a video about their research work on media accessibility using automatic and manual annotation.
  • Hironobu Takagi, IBM Labs Tokyo, general chair for W4A: demonstrated a text-based audio description system combined with a high-quality, almost human-sounding speech synthesizer.
  • Dick Bulterman, Researcher at CWI in Amsterdam, co-chair of SYMM (group at W3C doing SMIL): reported on 14 years of experience with multimedia presentations and SMIL (slides) and the need to make temporal and spatial synchronisation explicit to be able to do the complex things.
  • Joakim Söderberg, researcher at Ericsson, co-chair of media annotation group at W3C: reported on W3C media annotations group work and wanted to find out whether there are a11y related attributes missing.
  • Felix Sasaki, University of Applied Sciences in Potsdam, W3C media annotations group member: Teaching metadata.
  • Eric Carlson, Apple, Engineering HTML5 media elements in Webkit: there to watch and make sure the specification is implementable.
  • James Craig, Accessibility Software Quality Engineer, Apple: is interested in universal design and to solve content selection.
  • Myself, as Mozilla’s video accessibility contractor: I discussed what requirements I had collected before going ahead and doing implementations (slides) – also showed my demos.

Standards Experts

  • Dave Singer, Apple, head of multimedia standards: interested in building up a framework that gets better accessibility over time, and presented on how media queries could be used to improve source selection (slides).
  • Michael Cooper, work as W3C WAI staff: needs to make sure W3C technology has accessibility baked in and existing solutions are re-used.
  • Marisa DeMeglio, DAISY Consortium developer: reported on Digital Talking Book standards, the challenges with video in DAISY 4, and understanding the possibilities with HTML5.
  • Ian Hickson, Google, HTML5 editor: gave an update on the state of accessibility in HTML5.
  • Chris Lilley, W3C Hypertext CG co-chair, CSS and SVG group, etc.: wanted to make sure that whatever solution is created for HTML will be usable in SVG, and was also keenly interested to solve the internationalisation question.
  • Philippe Le Hegaret, W3C representative for HTML, timed text, and video working groups: reported on work in Timed Text working group, including a demo of DFXP use in HTML5.
  • Charles McCathieNeville, Opera, in charge of standards: interested in i18n and use of accessibility methods across different technologies without reinventing wheels.
  • Geoff Freed, NCAM, joined online: keen to solve captions and audio description in HTML5 in a declarative way.
  • Judy Brewer, head of WAI at W3C: interested in how the options for accessible media affect the user, especially when there are multiple options. Keen to create a proper requirements document.
  • Doug Schepers, Team Contact for the SVG and WebApps Working Groups: video handling in HTML5 and SVG should be similar. Keen to re-use technology.

The workshop helped clarify some of the requirements and potential solutions. No concrete specification progress was made, but that was not the intention of the workshop. Rather, it started getting people talking and involved with the newly created HTML5 Accessibility Task Force. In the end, the creation of a requirements document for media accessibility was taken on board as one of the first tasks for the HTML5 Accessibility Task Force, which, I believe, is still looking for an editor for it.

TPAC HTML5 Video Breakout Group

The HTML Working Group at TPAC was run differently to the previous year: instead of having all topics discussed in the full group, breakout groups were organised in two parallel rooms which focused each on particular topics that still need resolution in HTML5. I proposed a breakout group on Video Accessibility and somebody else wanted to get an update on the baseline codec discussion on the HTML5 video element, so we turned it into a single Video breakout group.

I put together a list of the core issues that we had come across from Sunday and turned it into a an agenda. Here are some short notes on each topic that was discussed. A full transcript is available in the minutes.

  1. Baseline codecs

    The brief discussion about baseline codecs just stated the impasse that we are currently at wrt the lack of a baseline codec that would satisfy all the stated requirements for a baseline codec. There is work being done behind closed doors to move towards agreeable royalty-free codecs, but no progress to be reportable to date.

  2. Cue ranges

    It has become apparent that there is a need to bring back the functionality of cue ranges, in particular for things such as activating slide changes, pause and display ads, and captions for live video. Rather than callbacks, this time a declarative and event-based solution is envisaged. It will possibly get introduced as part of the solution for captions.

  3. Media Element Accessibility

    The discussion here focused on the means in which to provide caption/subtitle support and briefly touched upon audio descriptions. I collected the following requirements:
  • while we prefer textual sources for captions/subtitles, and burnt-in captions or bitmap overlays (the DVD format) are not ideal, they should still be possible to be displayed; since most bitmaps are in gif, png or bmp format, that should not be a major issue
  • we need a declarative syntax for captions/subtitles, both for in-band and external files
  • if there are “conflicting” captions/subtitles available from in-band and external files, the in-band one would probably be displayed by default, but all available tracks should be user selectable
  • we need a default presentation, but also a JavaScript API to allow custom display
  • to deal with cross-site scripting issues for external files, CORS is probably a good solution
  • baseline codecs for captions/subtitles should be DFXP, srt and probably smpteTText – a new DFXP based SMPTE standard for timed text; it was further suggested to regard srt simply as a trivial subpart of DFXP functionality
  • the default presentation has to take into account authoring requirements, user preferences, and allow for interactive override – it seems we need to define new user preferences for video a11y aside from simply the browser language settings
  • the idea of using text and ARIA live regions to display audio descriptions was welcomed as a useful modern means to provide a11y to vision-impaired – it provides choice between screen reader and braille use and also improves searchability and has further advantages of being automatically processable
  • however, there continues to be a need to make human-created audio descriptions available, in particular for high-quality recordings (Shakespeare was used as an example)
  • futher, the HTML5 video element has not yet clarified how multiple encodings e.g. for different devices/bitrates, should be made available – this also ties into making sign language video tracks in different sign languages available
  • content selection could be done using media queries – needs further experimentation

Unfortunately, there was not enough time to address the remaining topics on the agenda: the need for hierarchical navigation through audio/video elements, and the handling of multi-track video. I am sure we will get back to them in due time.

Well, the meetings have certainly widened my understanding of the issues that we are currently dealing with around the audio and video elements – in particular I believe that once we solve how to deal with multiple alternative representations of the original media file, we will also solve many of the accessibility issues. It may, however, happen that we create the accessibility solutions first and thus also solve the issues of alternate representations. I have a lot to think about.

by silvia at November 18, 2009 10:58 AM

November 17, 2009

Ralph Giles

17 Nov 2009

In the Future, we will embed machine images in our Ogg files, which, when booted and given network access to the other multiplexed data will decode, render, and export the results data in a variety of JSON responses.

November 17, 2009 11:12 PM

November 12, 2009

David Schleef

Theora on TI C64x+ DSP and OMAP3

For the last several months, Entropy Wave has been making Theora work on the TI C64x+ DSP as a project for Mozilla Corp.

An Ogg/Theora video of Big Buck Bunny being played back on a Beagle Board via the C64x+ DSP coprocessor

An Ogg/Theora video of Big Buck Bunny being played back on a Beagle Board via the C64x+ DSP coprocessor

The goal behind porting to the C64x+ is to run on OMAP3 SoC from TI, which has an ARM Cortex A8 core and also has a C64x+ DSP coprocessor. This SoC (System on Chip) is best known as being the base behind Nokia’s N series of mobiles (including the N900), the Motorola Droid, Palm Pre, and the Beagle Board. The DSP coprocessor is commonly used for audo and video processing, including video encoding and decoding, and TI makes codecs available for MPEG-4 video decoding, AAC decoding, etc.  Having Theora decoded on the DSP fits into Mozilla’s Fennec project, making Firefox with video useful on a mobile platform.

One of the engineering reasons behind having a separate processor for media handling is that it separates real-time tasks (media decoding) from non-real-time tasks, such as running web browser software. From the standpoint of software running on the ARM, the video decoder looks and acts just like a hardware video codec. The DSP on the OMAP3 is even more compelling for video decoding because attached to the DSP are several units that accelerate motion vector copying, VLC decoding, and loop deblocking. Unfortunately, these pieces are not publicly documented by TI, so the current Theora port (which is open source) is unable to use them. A future Entropy Wave project will likely add support for these acceleration units which would allow the performance of the Theora decoder to be similar to TI’s MPEG-4 codec, which can do 800×480 playback (possibly more?). As it looks now, the resulting code would necessarily be closed source until such a time when TI wishes to make the specifications public.

As it currently stands, the Theora decoder plays 640×360 24fps at slightly more than 100% speed on average. This isn’t quite good enough to call it “real time”, since some frames take longer than the allotted time to decode, but it’s pretty close and the results are good. Additional speed improvements in libtheora would require internal changes, which would be a project in itself. One clear area for improvement is that the DSP spends a substantial part of its time idle, because the host code is serialized with the DSP processing. Fixing this is likely to put the above case firmly into the “real time” category. Given that 640×360 is larger than the iPhone display resolution and almost as large as the N900 resolution, it’s clearly good enough, even if it is less than TI’s hardware accelerated MPEG-4.

On the Entropy Wave site is a page describing the demo, including where to download images and how to compile source code.

A big thanks to the people that laid the foundations for this work, especially Felipe Contreras.

by admin at November 12, 2009 04:24 AM

November 11, 2009

Silvia Pfeiffer

FOMS and LCA Multimedia Miniconf

If you haven’t proposed a presentation yet, got ahead and register yourself for:

FOMS (Foundations of Open Media Software workshop) at
http://www.foms-workshop.org/foms2010/pmwiki.php/Main/CFP

LCA Multimedia Miniconf at
http://www.annodex.org/events/lca2010_mmm/pmwiki.php/Main/CallForP

It’s already November and there’s only Christmas between now and the conferences!

I’m personally hoping for many discussions about HTML5 <video> and <audio>, including what to do with multitrack files, with cue ranges, and captions. These should also be relevant to other open media frameworks – e.g. how should we all handle multitrack sign language tracks?

But there are heaps of other topics to discuss and anyone doing any work with open media software will find a fruitful discussions at FOMS.

by silvia at November 11, 2009 10:31 PM

November 08, 2009

Maik Merten

Cortado nostalgia

Yes, this is Cortado running on Netscape 4.79:



Basically this means Cortado can be made run even on, uh, bad and slow Java virtual machines. No, the JVM included with Netscape isn't fast enough for smooth video playback even on this 3 GHz machine, but sound isn't crackling either.

November 08, 2009 01:13 PM

November 06, 2009

Ben Schwartz

Old bugs fixed

In the process of testing Cortado on old operating systems, we discovered that using a recent compiler produced bytecode that wouldn’t run on Sun JDK 1.1. Instead, we got IllegalMonitorState exceptions in an infinite loop.

A little bit of searching made it clear that we weren’t the only ones who’d experienced this problem. There were reports going back to 2001 that Sun had introduced some sort of bug in their compiler in version 1.4. We verified that going back to an old compiler produced code that worked for us, again.

Today Greg Maxwell constructed a minimal test case and printed out the disassembled bytecode produced with old and new compilers. One difference stood out: the new compiler introduced a circular exception handler at the end of a synchronized block. I looked around, and sure enough, this behavior drew complaints when it first appeared over eight years ago.

Rather than attempt to convince the compiler authors that their code has a logical fallacy, or somehow fix ten-year-old versions of closed-source software, we instead decided to add a workaround into ProGuard, a bytecode post-processor that we are already using to shrink Cortado by 30% for faster downloads.

There’s an interesting question here as to what, exactly, the bug is. Is it a code generation bug, in which the compiler produces bytecode that will not run correctly on the Java 1.1 target? Or is it a JVM bug, exposed by newer compilers that make use of previously untested edge cases? This is a case of Software Development Relativity: the number of bugs is conserved, but their precise location depends on your reference frame.

Anyway, I think this is a nice short story about the power of an open development model. We found a bug somewhere in a complex system, and wound up putting a fix in the component whose maintainers, we hope, will be most receptive to it. When one avenue is cut off, open source finds another route.

by Ben at November 06, 2009 05:38 AM

November 03, 2009

Ben Schwartz

Cortado

A project I’ve been playing with recently is Ogg Theora’s Cortado, a free video player designed to be able to run on an extremely wide variety of computers, including old, obsolete systems. How old, you ask?

Really old:

Screenshot of Cortado playing a video in SheepShaver

Screenshot of Cortado playing a video in SheepShaver

This is a picture of Cortado running on Mac OS 7.5.5, in the Macintosh Runtime for Java 2.0, playing the video from the FSF’s freedom testimonials campaign. This operating system was released in 1996. The system is emulated in SheepShaver, which makes playback far too slow to be usable. Someone will have to test on real hardware to see what happens.

Nonetheless, I think this is strong evidence regarding how serious we are about backwards compatibility and inclusive software. Serious, or at least, enthusiastic.

by Ben at November 03, 2009 03:57 AM

November 01, 2009

Silvia Pfeiffer

Best economy flight evva!

Over the years, I have flown a lot – mainly between Sydney and Frankfurt or Sydney and San Francisco. Today, for the first time in a long time, I had a flight with Qantas from Sydney to San Francisco. And I must say: it was the most productive and most comfortable economy flight I had in a long time.

This is gonna feel awkward, since it’s not one of my usual technical posts. But I just have to say “Thank you” to Qantas. When I fly to the US, I tend to catch a US airline because they usually turn up as the cheapest. This time, Qantas was the second cheapest, so I decided to spend the extra hundred bucks on getting a modern airline. Yes, get that US airlines: no matter which of you I take, I always feel like I am thrown back into the last century. Legspace is rare, seats are uncomfortable, food is crap, service is poor, oh … and have you ever heard of personal entertainment screens? Yes, I know, your planes are from the last century. But honestly: I had a personal entertainment screen on my Singapore Airlines flight when coming to Australia for the first time in 1998! Couldn’t you at least upgrade the inside of your planes?

Anyway, back to this flight. It all started with the question: would you like to sit in the centre isle in front of the baby bassinet? Oh, I usually take a window seat to get some peace and quiet – but hey, I’m not going to say “no” to space! And, man did I use it!

I settled in with a good book and a little nap until the first meal and after that felt strengthened and awake enough to start hacking. With my new MacBook Pro, I was bound to get a few hours in before the battery would die on me. Not the 7 hours, that Apple claims, but that’s because I was going to do lots of compiles of Firefox. Anyway – without a seat in front of me, without the personal entertainment screen pulled out, and with the nice thick cushion that Qantas supply on my lap, protecting me from the laptop heat, I almost felt like I was back home in my living room.

On top of that – and unfortunately for Qantas, but fortunately for me – the plane was only two thirds full, so I had the middle seat on my left empty, which I immediately used to extend my table space. I had continuing catering service for the next 4-5 hours of compiling, applying OggK patches to the new Chris Double Firefox codebase, and fixing compile errors (all configuration based – I have yet to get to writing actual code). Ongoing catering service, no need to cook for myself, uninterrupted coding time, good music from the inflight entertainment service – I think I’ll move my office into a Qantas plane! Not been this productive in ages!

Everywhere around me the lights were out, people were watching movies, but I was working and really enjoying it. And then, the battery was empty, half way into the flight. Bummer! But I didn’t give up this easily. Thought it’d be worth asking if there was a way to recharge without occupying a toilet for two hours. And as with everything else, Qantas inflight personnel made an extra effort to please: they found me a empty seat in business class and hooked up the laptop for an hour to recharge. Totally, utterly awesome! I got it back after another nice reading break – cannot start watching movies, since that makes the brain go mash. I got another few hours of compiling in before my body forced me to catch a few hours of sleep.

Now, I’m about an hour away from San Fran and the laptop claims 40min of power left. Funnily, that number seems to go up rather than down, so I’m sure it will last until arrival (uh! It’s now at 1:24min – oh, compilation just finished!). Hopefully I will be able to find out, why some of the Ogg Theora/Vorbis/Kate videos that I created using kateenc and oggz-merge don’t play in the patched Firefox. After all, it would be awesome to be able to show it off in the upcoming HTML5 Video Accessibility workshop!

by silvia at November 01, 2009 04:34 AM

October 29, 2009

Silvia Pfeiffer

New proposal for captions and other timed text for HTML5

The first specification for how to include captions, subtitles, lyrics, and similar time-aligned text with HTML5 media elements has received a lot of feedback – probably because there are several demos available.

The feedback has encouraged me to develop a new specification that includes the concerns and makes it easier to associate out-of-band time-aligned text (i.e. subtitles stored in separate files to the video/audio file). A simple example of the new specification using srt files is this:

<video src="video.ogv" controls>
   <itextlist category="CC">
     <itext src="caption_en.srt" lang="en"/>
     <itext src="caption_de.srt" lang="de"/>
     <itext src="caption_fr.srt" lang="fr"/>
     <itext src="caption_jp.srt" lang="jp"/>
   </itextlist>
 </video>

By default, the charset of the itext file is UTF-8, and the default format is text/srt (incidentally a mime type the still needs to be registered). Also by default the browser is expected to select for display the track that matches the set default language of the browser. This has been proven to work well in the previous experiments.

Check out the new itext specification, read on to get an introduction to what has changed, and leave me your feedback if you can!

The itextlist element
You will have noticed that in comparison to the previous specification, this specification contains a grouping element called “itextlist”. This is necessary because we have to distinguish between alternative time-aligned text tracks and ones that can be additional, i.e. displayed at the same time. In the first specification this was done by inspecting each itext element’s category and grouping them together, but that resulted in much repetition and unreadable specifications.

Also, it was not clear which itext elements were to be displayed in the same region and which in different ones. Now, their styling can be controlled uniformly.

The final advantage is that association of callbacks for entering and leaving text segments as extracted from the itext elements can now be controlled from the itextlist element in a uniform manner.

This change also makes it simple for a parser to determine the structure of the menu that is created and included in the controls element of the audio or video element.

Incidentally, a patch for Firefox already exists that makes this part of the browser. It does not yet support this new itext specification, but here is a screenshot that Felipe Corrêa da Silva Sanches created to demonstrate it:

screenshot of subtitle menu included in Firefox

If several itextlist elements are specified, that menu will receive sub-menus – one each for each itextlist. An example is the following:

<video src="video.ogv" aria-label="test video" controls>
    <itextlist category="SUB" name="subtitles">
      <itext src="sub_en.srt" lang="en"/>
      <itext src="sub_de.srt" lang="de"/>
      <itext src="sub_fr.srt" lang="fr"/>
      <itext src="sub_jp.srt" lang="jp"/>
    </itextlist>
    <itextlist category="TAD" name="spoken transcript">
      <itext id="tad_en" src="tad_en.srt" lang="en"/>
      <itext id="tad_jp" src="tad_jp.srt" lang="jp"/>
    </itextlist>
  </video>

which will result in the following menu structure:

text
- subtitles
-- English
-- German
-- French
-- Japanese
-- none
- spoken transcript
-- English
-- Japanese
-- none

Similarly, a context menu would use the same structure.

Callbacks on timed text segments
This specification further introduces callbacks on time-aligned text segments: onenter and onleave. At this stage this is an idea I am experimenting with, but I believe has lots of potential to allow people to do fancy things when subtitles appear or disappear. Some ideas are: to have a specific picture displayed that relates to the text segment, to have text in another area of the display change e.g. because we have moved into a different part of the full text transcript, or to display Google ads that relate to the text in that particular text segment.

I am curious about feedback on this idea. It relates closely to the idea of cue ranges that was previously part of HTML5.

It is possible to achieve this effect simply through adding a timeupdate event listener, but proper callbacks like these are much more efficient.

Synchronisation adjustments
Another addition to the itext element is the introduction of two attributes that together allow fixing synchronisation issues in the timing between the video (or audio) and the itext track. The two attributes are “delay” and “stretch”.

“delay” allows specification of a negative or positive float value that represents the amount of seconds with which to delay the display of the itext text segments relative to the timing of the video (or audio) element.

“stretch” allows fixing a constant drift that in timing differences between the video (or audio) element and the text segments. It is given in percent, where 100% means no time stretch, 97% means getting the text segments 3% faster than their actual timing, and 108% means 8% slower.

These attributes are relevant since itext files are independent resources to the media resource and can therefore synchronise to a different clock than the media files. It happens frequently with srt files that are being used for differently encoded video files.

Further feedback
I am currently experimenting with creating the same kind of JavaScript API for in-line annotation tracks through extending some Firefox patches. It is exciting to see it all come together.

At the same time, I am sure there is still feedback that will further improve the specification and I encourage you to contribute. I have set up a wiki page where you can leave your feedback. Also feel free to drop me an email or leave a comment on this blog post. Thanks!

UPDATE 30th Oct 2009:
There is now also a working implementation that demonstrates the approach with itextlist. Check out http://www.annodex.net/~silvia/itext/elephant_no_skin_v2.html, which will not look much different to the previous version, but does indeed behave very differently.

by silvia at October 29, 2009 01:27 PM

October 28, 2009

Silvia Pfeiffer

Cortado 0.5.0 released

Cortado is a java applet that provides support for Ogg Theora/Vorbis to Web publishers. It’s particularly useful to publishers that want to use Ogg Theora/Vorbis in Browsers that do not yet support the HTML5 video element with Ogg.

Cortado was originally developed by Fluendo SA under a LGPL license and contains a re-implementation of Theora and Vorbis in Java (jheora and jcraft). After a few years of low maintenance, the Wikimedia Foundation took it in their hands to undust the code for their use in the Wikimedia Commons, where only unencumberd open video format are acceptable.

As Ralph states in his announcement of the new release: earlier this year, Xiph.org took over maintenance of the Cortado java applet to help concentrate interest and expertise on this important component of the free media codec infrastructure. Therefore, the official website for Cortado is as now part of the Xiph. [If somebody could update the Wikipedia article - that would be awesome!]

So, I am very happy to point to the first Cortado release in three years. Source and sample builds are available from the Xiph.org download site.

Ralph writes further:

The new version is tagged 0.5.0 to indicate both the change in hosting and the significant new support for files from the new libtheora encoder implementation and Kate embedded subtitles.

In particular, 0.5.0 has:

  • Support for files encoded with Theora 1.1
  • Faster YUV to RGB conversion with better results
  • Basic support for embedded Ogg Kate streams
  • Seeking fixed for files with an Ogg Skeleton track
  • Maintained compatibility with the Microsoft VM

This is an awesome example of the power of open source and what a group of people can achieve. Congratulations to everyone at Xiph, Wikipedia, and anyone else who contributed to the release!

by silvia at October 28, 2009 08:04 AM