The first specification for how to include captions, subtitles, lyrics, and similar time-aligned text with HTML5 media elements has received a lot of feedback – probably because there are several demos available.
The feedback has encouraged me to develop a new specification that includes the concerns and makes it easier to associate out-of-band time-aligned text (i.e. subtitles stored in separate files to the video/audio file). A simple example of the new specification using srt files is this:
<video src="video.ogv" controls>
<itextlist category="CC">
<itext src="caption_en.srt" lang="en"/>
<itext src="caption_de.srt" lang="de"/>
<itext src="caption_fr.srt" lang="fr"/>
<itext src="caption_jp.srt" lang="jp"/>
</itextlist>
</video>
By default, the charset of the itext file is UTF-8, and the default format is text/srt (incidentally a mime type the still needs to be registered). Also by default the browser is expected to select for display the track that matches the set default language of the browser. This has been proven to work well in the previous experiments.
Check out the new itext specification, read on to get an introduction to what has changed, and leave me your feedback if you can!
The itextlist element
You will have noticed that in comparison to the previous specification, this specification contains a grouping element called “itextlist”. This is necessary because we have to distinguish between alternative time-aligned text tracks and ones that can be additional, i.e. displayed at the same time. In the first specification this was done by inspecting each itext element’s category and grouping them together, but that resulted in much repetition and unreadable specifications.
Also, it was not clear which itext elements were to be displayed in the same region and which in different ones. Now, their styling can be controlled uniformly.
The final advantage is that association of callbacks for entering and leaving text segments as extracted from the itext elements can now be controlled from the itextlist element in a uniform manner.
This change also makes it simple for a parser to determine the structure of the menu that is created and included in the controls element of the audio or video element.
Incidentally, a patch for Firefox already exists that makes this part of the browser. It does not yet support this new itext specification, but here is a screenshot that Felipe Corrêa da Silva Sanches created to demonstrate it:

If several itextlist elements are specified, that menu will receive sub-menus – one each for each itextlist. An example is the following:
<video src="video.ogv" aria-label="test video" controls>
<itextlist category="SUB" name="subtitles">
<itext src="sub_en.srt" lang="en"/>
<itext src="sub_de.srt" lang="de"/>
<itext src="sub_fr.srt" lang="fr"/>
<itext src="sub_jp.srt" lang="jp"/>
</itextlist>
<itextlist category="TAD" name="spoken transcript">
<itext id="tad_en" src="tad_en.srt" lang="en"/>
<itext id="tad_jp" src="tad_jp.srt" lang="jp"/>
</itextlist>
</video>
which will result in the following menu structure:
text
- subtitles
-- English
-- German
-- French
-- Japanese
-- none
- spoken transcript
-- English
-- Japanese
-- none
Similarly, a context menu would use the same structure.
Callbacks on timed text segments
This specification further introduces callbacks on time-aligned text segments: onenter and onleave. At this stage this is an idea I am experimenting with, but I believe has lots of potential to allow people to do fancy things when subtitles appear or disappear. Some ideas are: to have a specific picture displayed that relates to the text segment, to have text in another area of the display change e.g. because we have moved into a different part of the full text transcript, or to display Google ads that relate to the text in that particular text segment.
I am curious about feedback on this idea. It relates closely to the idea of cue ranges that was previously part of HTML5.
It is possible to achieve this effect simply through adding a timeupdate event listener, but proper callbacks like these are much more efficient.
Synchronisation adjustments
Another addition to the itext element is the introduction of two attributes that together allow fixing synchronisation issues in the timing between the video (or audio) and the itext track. The two attributes are “delay” and “stretch”.
“delay” allows specification of a negative or positive float value that represents the amount of seconds with which to delay the display of the itext text segments relative to the timing of the video (or audio) element.
“stretch” allows fixing a constant drift that in timing differences between the video (or audio) element and the text segments. It is given in percent, where 100% means no time stretch, 97% means getting the text segments 3% faster than their actual timing, and 108% means 8% slower.
These attributes are relevant since itext files are independent resources to the media resource and can therefore synchronise to a different clock than the media files. It happens frequently with srt files that are being used for differently encoded video files.
Further feedback
I am currently experimenting with creating the same kind of JavaScript API for in-line annotation tracks through extending some Firefox patches. It is exciting to see it all come together.
At the same time, I am sure there is still feedback that will further improve the specification and I encourage you to contribute. I have set up a wiki page where you can leave your feedback. Also feel free to drop me an email or leave a comment on this blog post. Thanks!
UPDATE 30th Oct 2009:
There is now also a working implementation that demonstrates the approach with itextlist. Check out http://www.annodex.net/~silvia/itext/elephant_no_skin_v2.html, which will not look much different to the previous version, but does indeed behave very differently.