Planet Xiph

July 01, 2009

Silvia Pfeiffer

Open Video Conference Working Group: HTML5 and <video>

At the recent Open Video Conference, I was asked to chair a working group on HTML5 and the <video> tag. Since the conference had attracted a large number of open media software developers as well as HTML5 <video> tag developers, it was a great group of people that were on the panel with me: Philip Jagenstedt from Opera, Jan Gerber from Xiph, Viktor Gal from Annodex, Michael Dale from Metavid, and Eric Carlson from Apple. This meant we had three browser vendors and their <video> tag developers present as well as two javascript library developers representing some of the largest content sites that are already using Ogg Theora/Vorbis with the <video> tag, plus myself looking into accessiblity for <video>.

The biggest topic around the <video> tag is of course the question of baseline codec: which codec can and should become the required codec for anyone implementing <video> tag support. Fortunately, this discussion was held during the panel just ahead of ours. Thus, our panel was able to focus on the achievements of the HTML5 video tag and implementations of it, as well as the challenges still ahead.

Unfortunately, the panel was cut short at the conference to only 30 min, so we ended up doing mostly demos of HTML5 video working in different browsers and doing cool things such as working with SVG.

The challenges that we identified and that are still ahead to solve are:

  • annotation support: closed captions, subtitles, time-aligned metadata, and their DOM exposure
  • track selection: how to select between alternate audio tracks, alternate annotation tracks, based on e.g. language, or accessibility requirements; what would the content negotiation protocol look like
  • how to support live streaming
  • how to support in-browser a/v capture
  • how to support live video communication (skype-style)
  • how to support video playlists
  • how to support basic video editing functionality
  • what would a decent media server for html5 video look like; what capabilities would it have

Here are the slides we made for the working group.

Download PDF: Open Video Conference: HML5 and video Panel

by silvia at July 01, 2009 07:49 AM

June 29, 2009

Silvia Pfeiffer

Mario is dead and dismantled

Beloved Shuttle Box

Beloved Shuttle Box

Six weeks ago, on a fatal Saturday, both my washing machine and cute little Mario died in one day. The washing machine was quickly repaired, but there was no hope for Mario, as the burnt smell of electronics indicated. It wasn’t going to start up again.

Mario had been the first server to run the code developed at Vquence. It was our development and testing server for more than 8 months until we moved to a server at The Planet – later to Voxel and now ultimately to Amazon.

After it was relieved off Vquence duty, Mario became what it was originally bought to become: a media server. Running Linux and MythTV, it was the beloved center of our living room for the last 2 years. But it seems the heavy duty VCR work as well as running Linux exhausted him.

Well, it is now replaced by an ordinary HP machine – I will miss the cute little shuttle.

If anyone wants the remains, let me know.

by silvia at June 29, 2009 01:25 AM

June 28, 2009

Silvia Pfeiffer

A review of the W3C Timed Text Authoring Format

The W3C has published a third last call for the draft specification of DFXP, the Distribution Format Exchange Profile for the Timed Text Authoring Format – or short: for their new standard format for captions. Comments are due by the 30th June, so rush if you want to give any feedback. Here is what came to my mind as I was reading the 183 pages long document.

Please note: This review looks at DFXP from a Web view, i.e. how compatible is it with existing Web technologies, since my main use case will be on the Web, even if advocates will say that that’s not it’s main purpose, strangely enough, for a standard coming out of the W3C.

The state of affairs with caption formats

When it comes to caption and subtitles, there is no lack of formats. It seems, because it is an easy challenge to define a data format for something as simple as a piece of text and some timing information, every new project that wanted to deal with captions – or more generally timed text – created their own format. I am no exception to the rule. :-)

Thus, the current state of affairs wrt timed text is that there are many different textual file formats to store such data, there are also many different video container formats each with their own data format (or even formats) for embedding timed text into them, and there is a lot of software that will deal with many input, output and encapsulation formats.

The problem with this situation is that the formats are all different in their complexity. The simple “piece of text and timing information” problem can be turned into as complex a problem as you desire. By adding layout information, styling information, animation functionality, metadata about the video and about the content, and possibly hyperlinks, we have ended up in a large mess of incompatible formats.

The aim of W3C Timed Text

The W3C Timed Text working group was chartered in January 2003 to attack this issue. It was supposed to become the super-format of all possible functionalities for timed text formats and therefore a perfect interchange format between applications (see requirements document). Its focus was for use on the Web and with SMIL and to make use of existing W3C technologies where possible

However, the history of captioning is TV and the scope of Timed Text is beyond mere use on the Web, so while W3C Timed Text took a lot of inspiration from other Web standards, it has become a stand-alone standard that does not rely on, e.g. the availability of a CSS engine, and it has no in-built hyperlinking functionality (see what requirements it fulfills).

Dissecting DFXP

So. let’s look into some of what DFXP provides.

Here is an example file taken straight from the draft – check the presentation here:

<tt xml:lang="" xmlns="http://www.w3.org/2006/10/ttaf1">
  <head>
    <metadata xmlns:ttm="http://www.w3.org/2006/10/ttaf1#metadata">
      <ttm:title>Timed Text DFXP Example</ttm:title>
      <ttm:copyright>The Authors (c) 2006</ttm:copyright>
    </metadata>

    <styling xmlns:tts="http://www.w3.org/2006/10/ttaf1#styling">
      <!-- s1 specifies default color, font, and text alignment -->
      <style xml:id="s1"
                 tts:color="white"
                 tts:fontFamily="proportionalSansSerif"
                 tts:fontSize="22px"
                 tts:textAlign="center" />
      <!-- alternative using yellow text but otherwise the same as style s1 -->
      <style xml:id="s2" style="s1" tts:color="yellow"/>
      <!-- a style based on s1 but justified to the right -->
      <style xml:id="s1Right" style="s1" tts:textAlign="end" />
      <!-- a style based on s2 but justified to the left -->
      <style xml:id="s2Left" style="s2" tts:textAlign="start" />
    </styling>

    <layout xmlns:tts="http://www.w3.org/2006/10/ttaf1#styling">
      <region xml:id="subtitleArea"
                   style="s1"
                   tts:extent="560px 62px"
                   tts:padding="5px 3px"
                   tts:backgroundColor="black"
                   tts:displayAlign="after" />
    </layout>
  </head>
  <body region="subtitleArea">
    <div>
      <p xml:id="subtitle1" begin="0.76s" end="3.45s">
        It seems a paradox, does it not,
      </p>
      <p xml:id="subtitle2" begin="5.0s" end="10.0s">
        that the image formed on<br/>
        the Retina should be inverted?
      </p>
      <p xml:id="subtitle3" begin="10.0s" end="16.0s" style="s2">
        It is puzzling, why is it<br/>
        we do not see things upside-down?
      </p>
      <p xml:id="subtitle4" begin="17.2s" end="23.0s">
        You have never heard the Theory,<br/>
        then, that the Brain also is inverted?
      </p>
      <p xml:id="subtitle5" begin="23.0s" end="27.0s" style="s2">
        No indeed! What a beautiful fact!
      </p>
      <p xml:id="subtitle6a" begin="28.0s" end="34.6s" style="s2Left">
        But how is it proved?
      </p>
      <p xml:id="subtitle6b" begin="28.0s" end="34.6s" style="s1Right">
        Thus: what we call
      </p>
      <p xml:id="subtitle7" begin="34.6s" end="45.0s" style="s1Right">
        the vertex of the Brain<br/>
        is really its base
      </p>
      <p xml:id="subtitle8" begin="45.0s" end="52.0s" style="s1Right">
        and what we call its base<br/>
        is really its vertex,
      </p>
      <p xml:id="subtitle9a" begin="53.5s" end="58.7s">
        it is simply a question of nomenclature.
      </p>
      <p xml:id="subtitle9b" begin="53.5s" end="58.7s" style="s2">
        How truly delightful!
      </p>
    </div>
  </body>
</tt>

I’m going to look at each of the different functionalities separately and discuss their strengths and weaknesses.

Content

Let’s begin with the body of the DFXP document and what elements are defined for this area.

Firstly, <body> comes with optional begin, end, and dur attributes. As is the case for all time specifications in DFXP, there are both “end” and “dur” attributes. Why this over-specification? There is not even an explanation which of the two has higher priority when in conflict. This is plainly asking for trouble – why not simplify the spec?

The “region” and “style” attributes refer to a previously defined region and styles that are applied to the body. “id” and “lang” attributes allow to associate a name and a language with the body.

The “timeContainer” attribute enables the author to specify whether the elements in the body are all to be regarded as temporally parallel or in sequence, the default being parallel. This means that all text elements specified inside the body can render over the top of each other – a situation that is solved by giving them specific start and end times.

The containing elements of body are a sequence of <div> tags. The div element functions as a logical container and a temporal structuring element for a sequence of textual content units. div elements like body elements are allowed a “start”, “end” and “dur” attribute and generally everything that the body element also has, except that their children can be more div or p. Again, the children of the div element are all regarded as being temporally parallel.

The p element is basically the inner-most element that contains the actual text, including new-lines (br) and spans to associate further styling, metadata, or animations. The children of the p or span element are also all regarded as being temporally parallel, unless otherwise specified.

The structuring of text into div, p, and span elements seems to make sense and provide sufficient (if not even excessive) flexibility for any required timed text needs.

Layout

Once the text is specified and structured, the next question is where it should be positioned.

The extent attribute of the <tt> root element specifies the width and height of the root container, if not specified by the external authoring context.

Inside the root container, regions are defined through explicit <region> elements. The origin of placement for a region is the top left corner. Regions can define their “origin” offset, their “width” and “height”, the alignment of text within them through the “textAlign” and “displayAlign” styles, and whether text that “overflows” a region should be visible or hidden.

The way in which DFXP defines regions and placement of text within regions is very different to the way in which HTML and CSS work. By default, elements in HTML flow one after another in the same order as they appear in the source. CSS attributes applied to the elements can control their positioning through giving coordinates, or relative placements in relation to other elements. In DFXP elements are placed inside regions that are styled, making it incompatible with HTML.

Styling

The styling attributes available for DFXP are limited, but sufficient for timed text purposes. The way in which style associations to elements are resolved is quite diverse. Styles can be associated with regions, with individual elements, individually and as a group, through layouts and through parent elements. Compared to CSS, it feels complicated and potentially full of contradictions.

Animation

Further to styling, DFXP defines animations, which are discrete changes to some style parameter value that applies over some time interval. This is relevant for example to implement karaoke style colouring of text over time.

Metadata

The <metadata> element serves as a generic container for grouping metadata information. It can be associated virtually with any element – which seems somewhat over-flexible, but provides for interesting meta data information such as meta data for styles or for a <br>.

In addition, metadata is actually limited to a set number of elements: title, desc, copyright, agent, name, and actor. These are strange fields – in particular if you compare them to the flexibility of HTML meta data, which consists of free-form name-value pairs, bringing us domain-specific schemes such as the Dublin Core. This is not easily possible here, but instead one has to define extensions to allow for such flexible meta data.

Other features

DFXP provides other features such as information that describes the related video file, e.g. frameRate, subFrameRate, frameRateMultiplier, pixelAspectRatio, smpteMode, timeBase, and tickRate. Such information will help at the point in time when DFXP is supposed to be multiplexed into a binary media file together with audio and video tracks. These attributes can provide information required for the multiplexing process. I am not sure that justifies their existence though.

Other, minor features are available too. Check out the full specification to get a complete picture.

Examples

Part of the publication of this draft is also a test suite. Several of the defined features are still not represented in the test suite, which to me raises the question if they are really required. It might do wonders to the draft size to remove them.

Summary

DFXP is a standard for timed text that is firmly grounded in past captioning specifications, but written in XML, and borrowing ideas from Web technologies. It is unfortunately not re-using existing Web infrastructure to implement its more complex features: no use of CSS for styling and layout, no use of hyperlinks. Also, the use of namespaces seems excessive and won’t make it easy to author this format, in particular since the defined namespaces do not map into the defined profiles.

DFXP is, however, simple to transcode to something that a Web Browser can deal with through its existing engines, because it has borrowed from other Web standards. It is thus easier to work with on the Web than most other formats. It should be relatively easy to map to HTML, CSS and javascript, as already started in the test suite with the HTML5 video element.

DFXP is witten in such a way that it is possible to put together a new profile with extensions that are more appropriate for specific needs, e.g. that fit better into existing Web infrastructure. Currently, DFXP has three defined profiles: one focused on transformation, one focused on presentation, and one that contains everything.

I think it’s time for a html5 profile of DFXP that at minimum extends DFXP with hyperlinks, making it a real timed text Web format.

by silvia at June 28, 2009 04:11 PM

June 27, 2009

Chris Double

Playing Ogg files with audio and video in sync

My last post in this series had Vorbis audio playing but with Theora video out of sync. This post will go through an approach to keeping the video in sync with the audio.

To get video in sync with the audio we need a timer incrementing from when we start playback. We can't use the system clock for this as it is not necessarily keeping the same time as the audio or video being played. The system clock can drift slightly and over time this audio and video to get out of sync.

The audio library I'm using, libsydneyaudio, has an API call that allows getting the playback position of the sound sample being played by the audio system. This is a value in bytes. Since we know the sample rate and number of channels of the audio stream we can compute a time value from this. Synchronisation becomes a matter of continuously feeding the audio to libsydneybackend, querying the current position, converting it to a time value, and displaying the frame for that time.

The time for a particular frame is returned by the call to th_decode_packetin. The last parameter is a pointer to hold the 'granulepos' of the decoded frame. The Theora spec explains that the granulepos can be used to compute the time that this frame should be displayed up to. That is, when this time is exceeded this frame should no longer be displayed. It also enables computing the location of the keyframe that this frame depends on - I'll cover what this means when I write about how to do seeking.

The libtheora API th_granule_time converts a 'granulepos' to an absolute time in seconds. So decoding a frame gives us 'granulepos'. We store this so we know when to stop displaying the frame. We track the audio position, convert it to a time. If it exceeds this value we decode the next frame and display that. Here's a breakdown of the steps:

  1. Read the headers from the Ogg file. Stop when we hit the first data packet.
  2. Read packets from the audio stream in the Ogg file. For each audio packet:
    1. Decode the audio data and write it to the audio hardware.
    2. Get the current playback position of the audio and convert it to an absolute time value.
    3. Convert the last granulepos read (defaulting to zero if none have been read) to an absolute time value using the libtheora API.
    4. If the audio time is greater than the video time:
      1. Read a packet from the Theora stream.
      2. Decode that packet and display it
      3. Store the granulepos from that decoded frame so we know when to display the next frame.

Notice that the structure of the program is different to the last few articles. We no longer read all packets from the stream, processing them as we get them. Instead we specifically process the audio packets and only handle the video when it's time to display them. Since we are driving our a/v sync off the audio clock we must continously feed the audio data. I think it tends to be a better user experience to have flawless audio with video frame skipping rather than skipping audio but smooth video. Worse is to have both skipping of course.

The example code for this article is in the 'part4_avsync' branch on github.

This example takes a slightly different approach to reading headers. I use ogg_stream_packetpeek to peek ahead in the bitstream for a packet and do the header processing on the peeked packet. If it is a header I then consume the packet. This is done so I don't consume the first data packet when reading the headers. I want the data packets to be consumed in a particular order (audio, followed by video when needed).

// Process all available header packets in the stream. When we hit
// the first data stream we don't decode it, instead we
// return. The caller can then choose to process whatever data
// streams it wants to deal with.
ogg_packet packet;
while (!headersDone &&
(ret = ogg_stream_packetpeek(&stream->mState, &packet)) != 0) {
assert(ret == 1);

// A packet is available. If it is not a header packet we exit.
// If it is a header packet, process it as normal.
headersDone = headersDone || handle_theora_header(stream, &packet);
headersDone = headersDone || handle_vorbis_header(stream, &packet);
if (!headersDone) {
// Consume the packet
ret = ogg_stream_packetout(&stream->mState, &packet);
assert(ret == 1);
}

To read packets for a particular stream I use a 'read_packet' function that operates on a stream passed as a parameter:

bool OggDecoder::read_packet(istream& is, 
ogg_sync_state* state,
OggStream* stream,
ogg_packet* packet) {
int ret = 0;

while ((ret = ogg_stream_packetout(&stream->mState, packet)) != 1) {
ogg_page page;
if (!read_page(is, state, &page))
return false;

int serial = ogg_page_serialno(&page);
assert(mStreams.find(serial) != mStreams.end());
OggStream* pageStream = mStreams[serial];

// Drop data for streams we're not interested in.
if (stream->mActive) {
ret = ogg_stream_pagein(&pageStream->mState, &page);
assert(ret == 0);
}
}
return true;
}

If we need to read a new page (to be able to get more packets) we check the stream for the read page and if it is not for the stream we want we store the packet in the bitstream for that page so it can be retrieved later. I've added an 'active' flag to the streams so we can ignore streams that we aren't intersted in. We don't want to continuously buffer data for alternative audio tracks we aren't playing for example. The streams are marked inactive when the headers are finished reading.

The code that does the checking to see if it's time to display a frame is:


// At this point we've written some audio data to the sound
// system. Now we check to see if it's time to display a video
// frame.
//
// The granule position of a video frame represents the time
// that that frame should be displayed up to. So we get the
// current time, compare it to the last granule position read.
// If the time is greater than that it's time to display a new
// video frame.
//
// The time is obtained from the audio system - this represents
// the time of the audio data that the user is currently
// listening to. In this way the video frame should be synced up
// to the audio the user is hearing.
//
ogg_int64_t position = 0;
int ret = sa_stream_get_position(mAudio, SA_POSITION_WRITE_SOFTWARE, &position);
assert(ret == SA_SUCCESS);
float audio_time =
float(position) /
float(audio->mVorbis.mInfo.rate) /
float(audio->mVorbis.mInfo.channels) /
sizeof(short);

float video_time = th_granule_time(video->mTheora.mCtx, mGranulepos);
if (audio_time > video_time) {
// Decode one frame and display it. If no frame is available we
// don't do anything.
ogg_packet packet;
if (read_packet(is, &state, video, &packet)) {
handle_theora_data(video, &packet);
video_time = th_granule_time(video->mTheora.mCtx, mGranulepos);
}
}

The code for decoding and display the Theora video is similar to the Theora decoding article. The main difference is we store the granulepos in mGranulepos so we know when to stop displaying the frame.

This version of 'plogg' should play Ogg files with a Theora and Vorbis track in sync. You can test it on the transformers trailer for example. It does not play Theora files with no audio track - we can't synchronise to the audio clock if there is no audio. This can be worked around by falling back to delaying for the required framerate as the previous Theora example did.

The a/v sync is not perfect however. If the video is large and decoding keyframes takes a while then we can fall behind in displaying the video and go out of sync. This is because we only play one frame when we check the time. One approach to fixing this is to decode, but not display, all frames up until the audio time rather than just the next time.

The other issue is that the API call we are using to write to the audio hardware is blocking. This is using up valuable time that we could be using to decode a frame. When the write to the sound hardware returns we have very little time to decode a frame before glitches start appearing in the audio due to buffer underruns. Try playing a larger video (like the Ghostbusters HD Trailer and the audio and video will skip (depending on the speed of your hardware). This isn't a pleasant experience. Because of the blocking audio writes we can't skip more than one frame due to the frame decoding time taking too long causing audio skip.

The fixes for these aren't too complex and I'll go through it in my next article. The basic approach is to move to an asynchronous method of writing the audio, skip displaying frames when needed (to reduce the cost of the YUV decoding), skip decoding frames if possible (depending on location of keyframes we can do this), and to check how much audio data we have queued before decoding to always ensure we won't drop audio while decoding.

With these fixes in place I can play the 1080p Ogg version of Big Buck Bunny on a Macbook laptop (running Arch Linux) with no audio interruption and with a/v syncing correctly. There is a fair amount of frame skipping however but it's a lot more watchable than if you try playing it without these modifications in place. And better than watching with the video lagging further and further behind the longer you watch it. Further improvements can be made to reduce the frame skipping by utilising threads to take advantage of extra core's on the PC.

After the followup article on improving the a/v sync I'll look at covering seeking.


Categories: , , ,

by Chris Double (noreply@blogger.com) at June 27, 2009 11:07 PM

June 26, 2009

Chris Double

Decoding Theora files using libtheora

My last post covered read Ogg files using libogg. The resulting program didn't do much but it covered the basic steps needed to get an ogg_packet which we need to decode the data in the stream. The thing step I want to cover is decoding Theora streams using libtheora.

In the previous post I stored a count of the number of packets in the OggStream object. For theora decoding we need a number of different objects to be stored. I encapsulate this in a TheoraDecode structure:

class TheoraDecode { 
...
th_info mInfo;
th_comment mComment;
th_setup_info *mSetup;
th_dec_ctx* mCtx;
...
};

th_info, th_comment and th_setup_info contain data read from the Theora headers. The Theora stream contains three headers packets. These are the info, comment and setup headers. There is one object for holding each of these as we read the headers. The th_dec_ctx object holds information that the decoder requires to keep track of the decoding process.

th_info and th_comment need to be initialized using th_info_init and th_comment_init. Notice that th_setup_info is a pointer. This needs to be free'd when we're finished with it using th_setup_free. The decoder context object also needs to be free'd. Use th_decode_free. A convenient place to do this is in the TheoraDecode constructor and destructor:

class TheoraDecode {
...
TheoraDecode() :
mSetup(0),
mCtx(0)
{
th_info_init(&mInfo);
th_comment_init(&mComment);
}

~TheoraDecode() {
th_setup_free(mSetup);
th_decode_free(mCtx);
}
...

The TheoraDecode object is stored in the OggStream structure. The OggStream stucture also gets a field holding the type of the stream (Theora, Vorbis, Unknown, etc) and a boolean indicating whether the headers have been read:

class OggStream
{
...
int mSerial;
ogg_stream_state mState;
StreamType mType;
bool mHeadersRead;
TheoraDecode mTheora;
...
};

Once we get the ogg_packet from an Ogg stream we need to find out if it is a Theora stream. The approach I'm using to do this is to attempt to extract a Theora header from it. If this succeeds, it's a Theora stream. th_decode_headerin will attempt to decode a header packet. A return value of '0' indicates that we got a Theora data packet (presumably the headers have been read already). This function gets passed the info, comment, and setup objects and it will populate them with data as it reads the headers:

ogg_packet* packet = ...got this previously...;
int ret = th_decode_headerin(&stream->mTheora.mInfo,
&stream->mTheora.mComment,
&stream->mTheora.mSetup,
packet);
if (ret == TH_ENOTFORMAT)
return; // Not a theora header

if (ret > 0) {
// This is a theora header packet
stream->mType = TYPE_THEORA;
return;
}

assert(ret == 0);
// This is not a header packet. It is the first
// video data packet.
stream->mTheora.mCtx =
th_decode_alloc(&stream->mTheora.mInfo,
stream->mTheora.mSetup);
assert(stream->mTheora.mCtx != NULL);
stream->mHeadersRead = true;
...decode data packet...

In this example code we attempt to decode the header. If it fails it bails out, possibly to try decoding the packet using libvorbis or some other means. If it succeeds the stream is marked as type TYPE_THEORA so we can handle it specially later.

If all headers packets are read and we got the first data packet then we call th_decode_alloc to get a decode context to decode the data.

Once the headers are all read, the next step is to decode each Theora data packet. To do this we first call th_decode_packetin. This adds the packet to the decoder. A return value of '0' means we can get a decoded frame as a result of adding the packet. A call to th_decode_ycbcr_out gets the decoded YUV data, stored in a th_ycbcr_buffer object. This is basically an array of the YUV data.

ogg_int64_t granulepos = -1;
int ret = th_decode_packetin(stream->mTheora.mCtx,
packet,
&granulepos);
assert(ret == 0);

th_ycbcr_buffer buffer;
ret = th_decode_ycbcr_out(stream->mTheora.mCtx, buffer);
assert(ret == 0);
...copy yuv data to SDL YUV overlay...
...display overlay...
...sleep for 1 frame...

The 'granulepos' returned by the th_decode_packetin call holds information regarding the presentation time of this frame, and what frame contains the keyframe that is needed for this frame if it is not a keyframe. I'll write more about this in a future post when I cover synchronising the audio and video. For now it's going to be ignored.

Once we have the YUV data I use SDL to create a surface, and a YUV overlay. This allows SDL to do the YUV to RGB conversion for me. I won't copy the code for this since it's not particularly relevant to using the libtheora API - you can see it in the github repository.

Once the YUV data is blit to the screen the final step is to sleep for the period of one frame so the video can playback at approximately the right framerate. The framerate of the video is stored in the th_info object that we got from the headers. It is represented as the fraction of two numbers:

float framerate = 
float(stream->mTheora.mInfo.fps_numerator) /
float(stream->mTheora.mInfo.fps_denominator);
SDL_Delay((1.0/framerate)*1000);

With all that in place, running the program with an Ogg file containing a Theora stream should play the video at the right framerate. Adding Vorbis playback is almost as easy - the main difficulty is synchronising the audio and video. I'll cover these topics in a later post.


Categories: , ,

by Chris Double (noreply@blogger.com) at June 26, 2009 03:08 PM

June 25, 2009

Chris Double

Decoding Vorbis files with libvorbis

Decoding Vorbis streams require a very similar approach to that used when decoding Theora streams. The public interface to the libvorbis library is very similar to that used by libtheora. Unfortunately the libvorbis documentation doesn't contain an API reference that I could find so I'm following the approached used by the example programs.

Assuming we have already obtained an ogg_packet, the general steps to follow to decode and play Vorbis streams are:

  1. Call vorbis_synthesis_headerin to see if the packet is a Vorbis header packet. This is passed a vorbis_info and vorbis_comment object to hold the information read from those header packets. The return value of this function is zero if the packet is a Vorbis header packet. Unfortunately it doesn't return a value to see that it's a Vorbis data packet. To check for this you need to check if the stream is a Vorbis stream (by knowing you've previously read Vorbis headers from it) and the return value is OV_ENOTVORBIS.
  2. Once all the header packets are read create a vorbis_dsp_state and vorbis_block object. Initialize these with vorbis_synthesis_init and vorbis_block_init respectively. These objects hold the state of the Vorbis decoding. vorbis_synthesis_init is passed the vorbis_info object that was filled in during the reading of the header packets.
  3. For each data packet:
    1. Call vorbis_synthesis passing it the vorbis_block created above and the ogg_packet containing the packet data. If this succeeds (by returning zero), call vorbis_synthesis_blockin passing it the vorbis_dsp_state and vorbis_block objects. This call copies the data from the packet into the Vorbis objects ready for decoding.
    2. Call vorbis_synthesis_pcmout to get an pointer to an array of floating point values for the sound samples. This will return the number of samples in the array. The array is indexed by channel number, followed by sample number. Once obtained this sound data can be sent to the sound hardware to play the audio.
    3. Call vorbis_synthesis_read, passing it the vorbis_dsp_state object and the number of sound samples consumed. This allows you to consume less data than vorbis_synthesis_pcmout returned. This is useful if you can't write all the data to the sound hardware without blocking.

In the example code in the github repository I create a VorbisDecode object that holds the objects needed for decoding. This is similar to the TheoraDecode object mentioned in my Theora post:

class VorbisDecode {
...
vorbis_info mInfo;
vorbis_comment mComment;
vorbis_dsp_state mDsp;
vorbis_block mBlock;
...

VorbisDecode()
{
vorbis_info_init(&mInfo);
vorbis_comment_init(&mComment);
}
};

I added a TYPE_VORBIS value to the StreamType enum and the stream is set to this type when a Vorbis header is successfully decoded:

  int ret = vorbis_synthesis_headerin(&stream->mVorbis.mInfo,
&stream->mVorbis.mComment,
packet);
if (stream->mType == TYPE_VORBIS && ret == OV_ENOTVORBIS) {
// First data packet
ret = vorbis_synthesis_init(&stream->mVorbis.mDsp, &stream->mVorbis.mInfo);
assert(ret == 0);
ret = vorbis_block_init(&stream->mVorbis.mDsp, &stream->mVorbis.mBlock);
assert(ret == 0);
stream->mHeadersRead = true;
handle_vorbis_data(stream, packet);
}
else if (ret == 0) {
stream->mType = TYPE_VORBIS;
}
}

The example program uses libsydneyaudio for audio output. This requires sound samples to be written as signed short values. When I get the floating point data from Vorbis I convert this to signed short and send it to libsydneyaudio:

  int ret = 0;

if (vorbis_synthesis(&stream->mVorbis.mBlock, packet) == 0) {
ret = vorbis_synthesis_blockin(&stream->mVorbis.mDsp, &stream->mVorbis.mBlock);
assert(ret == 0);
}

float** pcm = 0;
int samples = 0;
while ((samples = vorbis_synthesis_pcmout(&stream->mVorbis.mDsp, &pcm)) > 0) {
if (!mAudio) {
ret = sa_stream_create_pcm(&mAudio,
NULL,
SA_MODE_WRONLY,
SA_PCM_FORMAT_S16_NE,
stream->mVorbis.mInfo.rate,
stream->mVorbis.mInfo.channels);
assert(ret == SA_SUCCESS);

ret = sa_stream_open(mAudio);
assert(ret == SA_SUCCESS);
}

if (mAudio) {
short buffer[samples * stream->mVorbis.mInfo.channels];
short* p = buffer;
for (int i=0;i < samples; ++i) {
for(int j=0; j < stream->mVorbis.mInfo.channels; ++j) {
int v = static_cast<int<(floorf(0.5 + pcm[j][i]*32767.0));
if (v > 32767) v = 32767;
if (v <-32768) v = -32768;
*p++ = v;
}
}

ret = sa_stream_write(mAudio, buffer, sizeof(buffer));
assert(ret == SA_SUCCESS);
}

ret = vorbis_synthesis_read(&stream->mVorbis.mDsp, samples);
assert(ret == 0);
}
}

A couple of minor changes were also made to the example program:

  • Continue processing pages and packets when the 'end of file' is reached. Otherwise a few packets that are buffered after we've reached the end of the file will be missed.
  • After reading a page don't just try to read one packet, call ogg_stream_packetout until it returns a result saying there are no more packets. This means we process all the packets from the page immediately and prevents a build up of buffered data.

The code for this example is in the 'part3_vorbis' branch of the github repository. This also includes the Theora code but does not do any a/v synchronisation. Files containing Theora streams will show the video data but it will not play smoothly and will not be synchronised with the audio. Fixing that is the topic of the next post in this series.


Categories: , , ,

by Chris Double (noreply@blogger.com) at June 25, 2009 09:33 PM

Reading Ogg files using libogg

Reading data from an Ogg file is relatively simple. The file format is well documented in RFC 3533. I showed how to read the format using JavaScript in a previous post.

For C and C++ programs it's easier to use the xiph.org libraries. There are libraries for decoding specific formats (libvorbis, libtheora) and there is a library for reading data from Ogg files (libogg).

I'm prototyping some approaches to improve the performance of the Firefox Ogg video playback and while I'm at it I'll write some posts on using these libraries to decode/play Ogg files. Hopefully it'll prove useful to others using them and I can get some feedback on usage.

All the code for this is in the plogg git repository on github. The 'master' branch contains the work in progress player that I'll describe in a series of posts, and there are branches specific to the examples in each post.

The libogg documentation describes the API that I'll be using in this post. All that this example will do is read an Ogg file, read each stream in the file and count the number of packets for that stream. It prints the number of packets. It doesn't decode the data or do anything really useful. That'll come later.

You can think of an Ogg file as containing logical streams of data. Each stream has a serial number that is unique within the file to identify it. A file containing Vorbis and Theora data will have two streams. A Vorbis stream and a Theora stream.

Each stream is split up into packets. The packets contain the raw data for the stream. The process of decoding a stream involves getting a packet from it, decoding that data, doing something with it, and repeating.

That describes the logical format. The physical format of the Ogg file is split into pages of data. Each physical page contains some part of the data for one stream.

The process of reading and decoding an Ogg file is to read pages from the file, associating them with the streams they belong to. At some point we then go through the pages held in the stream and obtain the packets from it. This is the process the code in this example follows.

The first thing we need to do when reading an Ogg file is find the first page of data. We use a ogg_sync_state structure to keep track of search for the page data. This needs to be initialized with ogg_sync_init and later cleaned up with ogg_sync_clear:

ifstream file("foo.ogg", ios::in | ios::binary);
ogg_sync_state state;
int ret = ogg_sync_init(&state);
assert(ret==0);
...look for page...

Note that the libogg functions return an error code which should be checked, A result of '0' generally indicates success. We want to obtain a complete page of Ogg data. This is held in an ogg_page structure. The process of obtaining this structure is to do the following steps:

  1. Call ogg_sync_pageout. This will take any data current stored in the ogg_sync_state object and store it in the ogg_page. It will return a result indicating when the entire pages data has been read and the ogg_page can be used. It needs to be called first to initialize buffers. It gets called repeatedly as we read data from the file.
  2. Call ogg_sync_buffer to obtain an area of memory we can reading data from the file into. We pass the size of the buffer. This buffer is reused on each call and will be resized if needed if a larger buffer size is asked for later.
  3. Read data from the file into the buffer obtained above.
  4. Call ogg_sync_wrote to tell libogg how much data we copied into the buffer.
  5. Resume from the first step, calling ogg_sync_buffer. This will copy the data from the buffer into the page, and return '1' if a full page of data is available.

Here's the code to do this:

ogg_page page;
while(ogg_sync_pageout(&state, &page) != 1) {
char* buffer = ogg_sync_buffer(oy, 4096);
assert(buffer);

file.read(buffer, 4096);
int bytes = stream.gcount();
if (bytes == 0) {
// End of file
break;
}

int ret = ogg_sync_wrote(&state, bytes);
assert(ret == 0);
}

We need to keep track of the logical streams within the file. These are identified by serial number and this number is obtained from the page. I create a C++ map to associate the serial number with an OggStream object which holds information I want associated with the stream. In later examples this will hold data needed for the Theora and Vorbis decoding process.

class OggStream
{
...
int mSerial;
ogg_stream_state mState;
int mPacketCount;
...
};

typedef map StreamMap;

Each stream has an ogg_stream_state object that is used to keep track of the data read that belongs to the stream. We're storing this in the OggStream object that we associated with the stream serial number. Once we've read a page as described above we need to tell libogg to add this page of data to the stream.

StreamMap streams;
ogg_page page = ...obtained previously...;
int serial = ogg_page_serialno(&page);
OggStream* stream = 0;
if (ogg_page_bos(&page) {
stream = new OggStream(serial);
int ret = ogg_stream_init(&stream->mState, serial);
assert(ret == 0);
streams[serial] = stream;
}
else
stream = streams[serial];

int ret = ogg_stream_pagein(&stream->mState, &page);
assert(ret == 0);

This code uses ogg_page_serialnoto get the serial number of the page we just read. If it is the beginning of the stream (ogg_page_bos) then we create a new OggStream object, initialize the stream's state with ogg_stream_init, and store it in out streams map. If it's not the beginning of the stream we just get our existing entry in the map. The final call to ogg_stream_pagein inserts the page of data into the streams state object. Once this is done we can start looking for completed packets of data and decode them.

To decode the data from a stream we need to retrieve a packet from it. The steps for doing this are:

  1. Call ogg_stream_packetout. This will return a value indicating if a packet of data is available in the stream. If it is not then we need to read another page (following the same steps previously) and add it to the stream, calling ogg_stream_packetout again until it tells us a packet is available. The packet's data is stored in an ogg_packet object.
  2. Do something with the packet data. This usually involves calling libvorbis or libtheora routines to decode the data. In this example we're just counting the packets.
  3. Repeat until all packets in all streams are consumed.
while (..read a page...) {
...put page in stream...
ogg_packet packet;
int ret = ogg_stream_packetout(&stream->mState, &packet);
if (ret == 0) {
// Need more data to be able to complete the packet
continue;
}
else if (ret == -1) {
// We are out of sync and there is a gap in the data.
// We lost a page somewhere.
break;
}

// A packet is available, this is what we pass to the vorbis or
// theora libraries to decode.
stream->mPacketCount++;
}

That's all there is to reading an Ogg file. There are more libogg functions to get data out of the stream, identify end of stream, and various other useful functions but this covers the basics. Try out the example program in the github repository for more information.

Note that the libogg functions don't require reading from a file. You can use these routines with any data you've obtained. From a socket, from memory, etc.

In the next post about reading Ogg files I'll go through using libtheora to decode the video data and display it.


Categories: , ,

by Chris Double (noreply@blogger.com) at June 25, 2009 12:49 AM

June 21, 2009

Silvia Pfeiffer

The history of Ogg on the Web

In the year 2000, while working at CSIRO as a research scientist, I had the idea that video (and audio) should be hyperlinked content on the Web just like any Web page. Conrad Parker and I developed the vision of a “Continuous Media Web” and called the technology that was necessary to develop “Annodex” for “annotated and indexed media”.

Not many people now know that this was really the beginning of Ogg on the Web. Until then, Ogg Vorbis and the emerging Ogg Theora were only targeted at desktop applications in competition to MP3 and MPEG-2.

Within a few years, we developed the specifications for a markup language for video called CMML that would provide the annotations, anchor points, and hyperlinks for video to make it possible to search and index video, hyperlink into video section, and hyperlink out of video sections.

We further developed the specification of temporal URIs to actually address to temporal offsets or segments in video.

And finally, we developed extensions to the Xiph Ogg framework to allow it to carry CMML, and more generally multi-track codecs. The resulting files were originally called “Annodex files”, but through increasing collaboration with Xiph, the specifications were simplified and included natively into Ogg and are now known as “Ogg Skeleton”.

Apart from specifications, we also developed lots of software to make the vision actually come true. Conrad, in particular, developed many libraries that helped develop software on top of the raw Xiph codecs, which include liboggz and libfishsound. Libraries were developed to deal with CMML and with embedding CMML into Ogg. Apache modules were developed to deal with segmenting sections from Ogg files and deliver them as a reply to a temporal URI request. And finally we actually developed a Firefox extension that would allow us to display the Ogg Theora/Vorbis videos inside a Web Browser.

Over time, a lot more sofware was developed, amongst them: php, perl and python bindings for Annodex, DirectShow filters to have Ogg Theora/Vorbis support on Windows, an ActiveX control for Windows, an authoring tool for CMML on Windows, Ogg format validation software, mobile phone support for Ogg Theora/Vorbis, and a video wiki for CMML and Ogg Theora called cmmlwiki. Several students and Annodex team members at CSIRO helped develop these, including Andre Pang (who now works for Pixar), Zen Kavanagh (who now works for Microsoft), and Colin Ward (who now works for Symbian). Most of the software was released as open source software by CSIRO and is available now either in the Annodex repository or the Xiph repositories.

Annodex technology became increasingly part of Xiph technology as team members also became increasingly part of the Xiph community, such as by now it’s rather difficult to separate out the Annodex people from the Xiph people.

Over time, other projects picked up on the Annodex technology. The first were in fact ethnographic researchers, who wanted their audio-visual ethnographic recordings usable in deeply. Also, other multimedia scientists experimented with Annodex. The first actual content site to publish a large collection of Ogg Theora video with annotations was OpenRoadTrip by Scott Shawcroft and Brandon Hines in 2006. Soon after, Michael Dale and Aphid from Metavid started really using the Annodex set of technologies and contributing to harden the technology. Michael was also a big advocate for helping Wikimedia and Archive.org move to using Ogg Theora.

By 2006, the team at CSIRO decided that it was necessary to develop a simple, cross-platform Ogg decoding and playback library that would allow easy development of applications that need deep control of Ogg audio and video content. Shane Stephens was the key developer of that. By the time that Chris Double from Firefox picked up liboggplay to include Ogg support into Firefox natively, CSIRO had stopped working on Annodex, Shane had left the project to work for Google on Wave, and we eventually found Viktor Gal as the new maintainer for liboggplay. We also found Cristian Adam as the new maintainer for the DirectShow filters (oggcodecs).

Now that the basic Ogg Theora/Vorbis support for the HTML5 <video> element is starting to be available in all major browsers (well, as soon as an ActiveX control is implemented for IE), we can finally move on to develop the bigger vision. This is why I am an invited expert on the W3C media fragments working group and why I am working with Mozilla on sorting out accessibility for <video>. Accessibility is an inherent part of making video searchable. So, if we can find a way to extend the annotations with hyperlinks, we will also be able to build Webs of videos and completely new experiences on the Web. Think about mashing up simply by creating a list of URLs. Think about tweeting video segments. Think about threaded video email discussions (Shane should totally include that into Google Wave!). And think about all the awesome applications that come to your mind that I haven’t even thought about yet!

I spent this week at the Open Video Conference in New York and was amazed about the 800 and more people that understand the value of open video and the need for open video technologies to allow free innovation and sharing. I can feel that the ball has got rolling – the vision developed almost 10 years ago is starting to take shape. Sometimes, in very very rare moments, you can feel that history has just been made. The Open Video Conference was exactly one such point in time. Things have changed. Forever. For the better. I am stunned.

by silvia at June 21, 2009 02:11 PM

June 14, 2009

Silvia Pfeiffer

Cool HTML5 video demos

I’ve always thought that the most compelling reason to go with HTML5 Ogg video over Flash are the cool things it enables you to do with video within the webpage.

I’ve previously collected the following videos and demos:

First there was a demo of a potential javascript interface to playing Ogg video inside the Web browser, which was developed by CSIRO. The library in use later became the library that Mozilla used in Firefox 3.5:

Then there were Michael Dale’s demos of Metavidwiki with its direct search, access and reuse of video segments, even a little web-based video editor:

Then there was Chris Double’s video SVG demo with cool moving, resizing and reshaping of video:

and Chris kept them coming:

Then Chris Blizzard also made a cool demo for showing synchronised video and graph updates as well as a motion detector:

And now we have Firefox Director Mike Belitzer show off the latest and coolest to TechCrunch, the dynamic content injection bit of which you can try out yourself here:

It just keeps getting better!

UPDATE: Here are some more I’ve come across:

by silvia at June 14, 2009 02:15 PM

YouTube Ogg Theora+Vorbis & H.263/H.264 comparison

On Jun 13th 2009 Chris DiBona of Google claimed on the WhatWG mailing list:

“If [youtube] were to switch to theora and maintain even a semblance of the current youtube quality it would take up most available bandwidth across the Internet.”

Everyone who has ever encoded a Ogg Theora/Vorbis file and in parallel encoded one with another codec will have to immediately protest. It is sad that even the best people fall for FUD spread by the un-enlightened or the ones who have their own agenda.

Fortunately, Gregory Maxwell from Wikipedia came to the rescue and did an actual “YouTube / Ogg/Theora comparison”. It’s a good read and a comparison on one video. He has put his instructions there, so anyone can repeat it for themselves. You will have to start with a pretty good quality video though to see such differences.

by silvia at June 14, 2009 12:36 PM

June 12, 2009

Sebastian Pipping

Inviting you to project “PackageMap”

Quick (re-)introduction:  My task for Gentoo/Google Summer of Code 2009 is to give Gentoo a Debian popcon equivalent, a tool to collect statistics on “what package is installed how often”.  To achieve this goal I’m extending Smolt (a tool currently doing similar things with hardware information) by fine-tunable software stats gathering.

The plan we have for Smolt is to make it cross-distro, not just fit Gentoo or Fedora.  One point where the consequences and benefits of such an approach can be seen clearly is with

counting packages from different distros into the same buckets.

What do I mean by that?  Debian’s Git counts for Gentoo’s Git counts for Fedora’s, you know the list.  With packages counted from accross distros we can suddenly answer questions that we currently cannot answer, among them

  • What globally popular packages are missing in distro X? Let’s say we don’t have a package for product P. Do other distros have one? They do, maybe we need one, too?  They don’t, maybe P is not that important then?
  • How many Linux users are approximately using program X in total? Not just on Ubuntu or Arch - all across Linux, BSD, Solaris!
  • Does distro X have 10 times the packages of Y or is it just different splitting?

To count into the same bucket we use global identifiers for the “products” that fall out of a package.  Gentoo package “dev-util/git” can produce product “cpe://a:git:git”, Debian’s “git-core” can, too. That string before is a CPE name, a concept close to package naming in Java.  This “intermediate language” allows us to relate package names from distro X with those of distro Y and answer various questions from that data.

To do such mapping we need code (or a “service”) that does the mapping for us and base of collected data that the service can operate on.  Both of these is project “PackageMap”.

I have started populating the database with packages (currently 312 in number) made from information extracted from the Gentoo tree and the National Vulnerability Database.  Latter holds many CPEs. Let me state clearly that packagemap is not about Gentoo in particular.  Sure, the initial data has lots of Gentoo in it but the whole point of the project is to get information and people from different distros together.

To see what these 312 packages maps look like at the moment you best do a few clicks through the database folder yourself:
http://git.goodpoint.de/?p=packagemap.git;a=tree;f=database

Also, there are Relax NG schema and DTD for validation, more documentation than I usually write and a few scripts:
http://git.goodpoint.de/?p=packagemap.git;a=tree

By now I hope you have gained interest in what this can become.
Your active participation is highly appreciated.
A few minutes from everyone can make a huge difference here.
If you want write access to the repo - mail me: sebastian@pipping.org.

Please have a look at the Git repository and ask questions.

Thanks for reading up to this point.

PS: I’m aware “hartwork.org” might not make a good longterm location for DTDs, XML namespaces and such for a cross-distro project.  Any ideas where to put them best?

by sping at June 12, 2009 07:42 AM

June 09, 2009

Silvia Pfeiffer

Sites with Ogg in HTML5 video tag

Yesterday, somebody mentioned that the HTML5 video tag with Ogg Theora/Vorbis can be played back in Safari if you have XiphQT installed (btw: the 0.1.9 release of XiphQT is upcoming). So, today I thought I should give it a quick test. It indeed works straight through the QuickTime framework, so the player looks like a QuickTime player. So, by now, Firefox 3.5, Chrome, Safari with XiphQT, and experimental builds of Opera support Ogg Theora/Vorbis inside the HTML5 video tag. Now we just need somebody to write some ActiveX controls for the Xiph DirectShow Filters and it might even work in IE.

While doing my testing, I needed to go to some sites that actually use Ogg Theora/Vorbis in HTML5 video tags. Here is a list that I came up with in no particular order:

I’m sure there’s a lot more out there – feel free to post links in the comments.

by silvia at June 09, 2009 10:44 PM