What are MPEG and AVC?

Posted by Stephen Cook

In late 2015, I had been working at Amazon Video for a year and a half. I’m a huge fan of online video (why should we live in a world where putting data on a disc and physically sending it by mail is easier than sending bits through a wire?), so during a pub quiz when the question “what does MPEG stand for?” came up — my entire team expectantly turned to me.

Sadly, before writing this blog post, I had an embarrassingly poor practical understanding of how digital video actually worked. To the extent that I had absolutely no idea what MPEG stood for. In response to this embarrassment, I thought it would be fun to learn how digital video worked — in depth. AVC and MP4 are some of the most common video file types, so MPEG felt like a sane choice of things to pick apart.

Let it not be said that I don’t take pub quizes seriously.

In this blog post, I will look at what the AVC file format actually is, and how it’s structured. I was originally planning on going into enough depth to build an entire AVC movie from scratch — but in researching for, and writing this post, I’ve been repeatedly blown away by the breadth of the file format, and just how much stuff is actually in it.

So instead I’m going to make this a post of four parts:

  • What are MPEG and AVC? (this post)
  • How does AVC encoding work?
  • What is MP4 and BIFS?
  • What is MPEG-DASH?

I’ll update links to the other posts as I publish them.

So this post will focus on the actual file format here (MPEG-4 Part 12 and Part 15), and leave the video encoding (MPEG-4 Part 10) to another blog post, at which point we’ll be able to make an AVC file from scratch.

What is AVC? #

AVC (aka. H.264) is a video coding format that’s defined in MPEG-4 Part 10. It also has its storage format defined in MPEG-4 Part 15 (which extends MPEG-4 Part 12). So let’s come back to this question once we settle what MPEG-4 is, and why it’s split into parts.

What is MPEG-4? #

MPEG stands for “Moving Picture Experts Group”. It’s literally a group of people and organisations working together to form standards for audio and video.

MPEG’s standards all address a particular problem and break the problem down into several parts. And these parts aren’t fixed for a standard; MPEG-4, for example, was introduced in 1998, but at the time of writing this post, MPEG-4 Part 31 (concerned with compression specifically for browsers) is still under development.

MPEG-4 is the standard for “Coding of audio-visual objects.” This is subtly but importantly different to MPEG-2’s goal of “Generic coding of moving pictures and associated audio information.” MPEG-4 is concerned with remaining a lot more generic than MPEG-2, both in ways that allow more intelligent compression and in ways that allow more features to be added to the format.

What is MPEG-4 Part 12? #

MPEG-4 Part 12 is a digital container format, that wraps the actual video encodings (e.g. MPEG-4 Part 10, but this could be nearly any encoding). It includes information about what encodings were used, how to play the video chunks, what the framerate and length of the video is, etc.

It is an OOP-esque format, made up of “boxes”. The boxes themselves can be defined in a very OOP style, e.g. take this description of the Decoding Time to Sample Box:

As to what all these boxes are, and what their fields mean — I’ll get to later.

But we can see from these OOP-esque class definitions, that a Decoding Time to Sample Box will look like this in bits:

(size) (boxtype=‘stts’) [if size was 1] (largesize) [/if size was 1] (version=0) (flags=0) (entry count) [entry count times] (sample count) (sample time delta) [/entry count times]

What does AVC (MPEG-4 Part 15) add to MPEG-4 Part 12? #

MPEG-4 Part 15 adds a few things to MPEG-4 Part 12 — mostly it adds extra blocks. The things we’ll see in this post are:

  • The avc1 brand introduced (I explain brands a bit more here)
  • The AVC Configuration Box (avcC) to specify the configuration of the encoded data
  • The AVC Sample Entry (avc1), which I explain more here

Looking at an actual AVC file #

Using a hex editor, we can inspect an actual AVC file. I made a tiny 64×64, two frame GIF of just a single block colour — that I converted into an AVC video, and this is what it looks like from a high level:

  • ftyp — tells us what kind of file this is, what version of MPEG-4 we're dealing with
  • free
  • mdat — the actual encoded video data
  • moov — metadata about the video data
    • mvhd
    • trak — one particular track of the video data
      • tkhd
      • edts — information about where this particular track fits in to the whole file's timeline
        • elst
      • mdia
        • mdhd
        • hdlr
        • minf
          • vmhd
          • dinf — information on where the encoded video data is all stored
            • dref
              • url
            • stbl — information about how the encoded data is broken into samples
              • stsd
                • avc1
                  • avcC
              • stts
              • stss
              • ctts
              • stsc
              • stsz
              • stco
    • udta — high level metadata about the file itself
      • meta
        • hdlr
        • ilst
          • data

To get into much more detail though, we’re going to need to get into the nitty-gritty details of it all. So what follows are the raw bytes in hex of the MPEG file.

I’ve annotated the data as best I could, by doing the following:

  • Indented to represent boxes containing other boxes
  • Written bytes intended to be ascii as ascii, rather than raw hex
  • Written bytes intended to be a null-terminated ascii string underlined in blue
  • Left raw hex italicised in red
  • Put the box type in bold
  • Highlighted my annotations

If you have a visual impairment, please try clicking here (after enabling JavaScript) to make the styling more explicit.

If you’re this far into my post, I can assume that you’re interested in MPEG — and so am I, which is why I think this is really interesting. But objectively speaking… this is probably a little long, and dry. MPEG-4 has a large overhead, which means even this tiny video ends up as a fairly large file. You can click here to skip to the end of the AVC file if you’re not interested in getting into this much detail. Otherwise, grab yourself a cup of tea, and let’s have a look:

  • 00 00 00 20 ftyp The size (in bytes) and name of the box is included at the start of every box

    ftyp is the File Type Box. Every MPEG-4 Part 12 file must have exactly one of these (unless they're using a very old MPEG-4 specification, in which case you just assume an ftyp box with major brand mp41 and minor version 0).

    isom The first 4 bytes form a 4-char string, stating the major brand. This corresponds to the main MPEG-4 specification that should be used for decoding this particular MPEG-4 file. For example, isom here means (according to ftyp.com) IS0 14496-12:2003, aka. MPEG-4 Part 12 (the 2003 revision)

    00 00 02 00 The next 4 bytes form the minor version number of the major brand specification used. This field is meant to be used for debugging only — to get a more accurate understanding of which specification was used, if something goes wrong

    isom iso2 avc1 mp41 The rest of the box is made up of other 4-char strings of supported brand names

  • 00 00 00 08 free

    The free or skip block is literally empty space. It's space that's just allocated in the file to allow overwrites without changing the position of any of the other blocks. This can be useful for editing, since some MPEG-4 Part 12 blocks have pointers pointing to other blocks, so all of these pointers would need to be updated if the file size changed.

    Although, as this block is only 8 bytes long (i.e. the exact length of the size, and the block type) this block seems entirely pointless to me (but the file was made through automated software, so that's entirely possible).

  • 00 00 03 65 mdat

    mdat is the Media Data Box. The entire content of the box (after the name) is the actual media data of the MPEG-4 file. Skipping ahead in the file (to stsd, the Sample Description Box) we know that the media data is encoded using avc1 (i.e. MPEG-4 Part 10).

    As I mentioned previously, I don't want to go too far into the actual encoding details of AVC's actual encoding here — so we'll leave this block as is for now.

    00 00 02 A0 06 05 FF FF 9C DC 45 E9 BD E6 D9 48 B7 96 2C D8 20 D9 23 EE EF

    x264 - core 148 - H.264/MPEG-4 AVC codec - Copyleft 2003-2015 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=3 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00

    80 00 00 00 2A 65 88 84 00 27 FF FE F5 B1 7C 0A 6A E9 EA 8A 0C E8 32 2E E1 97 24 D5 91 25 E4 E9 10 EB D3 4C 03 21 3B C5 80 10 10 0D 71 12 A3 00 00 00 0B 41 9A 24 6C 42 3F FD E1 00 B0 80 00 00 00 08 41 9E 42 78 8D FF 05 5D 00 00 00 08 01 9E 61 74 45 FF 06 C4 00 00 00 08 01 9E 63 6A 45 FF 06 C5 00 00 00 0F 41 9A 68 49 A8 41 68 99 4C 08 FF FC 84 02 97 00 00 00 09 41 9E 86 45 11 2C 6F 05 5D 00 00 00 08 01 9E A5 74 45 FF 06 C5 00 00 00 08 01 9E A7 6A 45 FF 06 C4 00 00 00 10 41 9A AA 49 A8 41 6C 99 4C 14 4C 5F FA 58 04 E4 00 00 00 08 01 9E C9 6A 45 FF 06 C5

  • 00 00 03 96 moov

    moov, the Movie Box. Every MPEG-4 Part 12 file must have exactly 1, and in it is stored all of the metadata related to the presentation.

    • 00 00 00 6C mvhd

      mvhd, the Movie Header Box. Every moov box must have exactly 1, and in it is stored some basic metadata about the presentation.

      00 mvhd extends FullBox, meaning that we have a version, and flags. The first byte contains the mvhd version, which (for mvhd, in MPEG-4 Part 12 2012) must be either 0 or 1. This tells us the byte-size of some of the following fields (e.g. creation date is 64-bit in version 1, but 32-bit in version 0)

      00 00 00 The next 3 bytes are for flags. This is necessary to fully extend the FullBox, but mvhd doesn't actually use these 3 bytes at all. You'll see this a lot

      56 84 03 0F Since we're using version 0, the next 4 bytes makes the creation time in epoch time (Wed Dec 30 2015 16:15:01 GMT+0000)

      56 84 0F 8A The next 4 bytes makes the modification time in epoch time (Wed Dec 30 2015 17:08:26 GMT+0000)

      00 00 03 E8 The next 4 bytes makes the timescale. This is the number of "time units" that you want to fit in a second. In this instance, it's 1000

      00 00 00 6E The next 4 bytes make the duration (how many "time units" long the presentation is). In this instance, it's 110, or 0.11 seconds

      00 01 00 00 Next is a 16.16 fixed point number for the preferred rate. E.g. 1 is playing at normal speed, 2 is playing at 2x speed, -1 is rewinding

      01 00 Next is a 8.8 fixed point number for the preferred volume. E.g. 1 is 100% volume, 0.5 is 50% volume

      00 00 00 00 00 00 00 00 00 00 The next 10 bytes are reserved as 0. I believe these are 10 bytes that are left as legacy from the QuickTime format that MPEG-4 Part 12 derives from

      The next 9 sets of 4 bytes are 16.16 matrix elements, making a transformation matrix for the presentation represented in homogeneous coordinates

      00 01 00 00   00 00 00 00   00 00 00 00
      00 00 00 00   00 01 00 00   00 00 00 00
      00 00 00 00   00 00 00 00   40 00 00 00

      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Next is again some space used by the Quick Time format, but not used here — so all set to 0

      00 00 00 02 Finally, we have the next track ID. This is the ID that the next trak added to this file would be (i.e. 2 here, since there's only 1 trak with ID 1 in this file, so the next would be 2)

    • 00 00 02 C0 trak

      The Track Box. Every moov box must have at least 1 trak box, containing either media data, or packetization information for streaming protocols.

      • 00 00 00 5C tkhd

        Exactly 1 Track Header Box is contained in each trak box.

        00 Version. This being version 0 again means that the timing fields are 32-bit rather than 64-bit

        00 00 03 Flags. Only 3 bits are used for flags for tkhd, and those are track_in_preview, track_in_movie, and track_enabled. So since our flag is 3 (binary 011) we see that the only unset flag is track_in_preview

        56 84 03 0F Creation time

        56 84 0F 8A Modification time

        00 00 00 01 This is the ID of this particular trak box

        00 00 00 00 These 4 bytes are reserved as 0

        00 00 00 6E Duration of the track. 110 "time units" equates to 0.11 seconds (because of mvhd's timescale)

        00 00 00 00 00 00 00 00 Reserved as 0

        00 00 Layer of this track. If there were other tracks in this MPEG, then the lowest number "layer" would be "on top" of all the others

        00 00 Alternate group. Since this is 0, there is no alternate group. But if it was non-zero, then we would know that it is an alternate track to any other track that has the same alternate group. This is for multi-bitrate tracks, etc

        00 00 Volume, as an 8.8 fixed number. Since this is 0, we know that this is not an audio track

        00 00 Reserved as zero, again

        Next is another 16.16 fixed number transformation matrix for the track.

        00 01 00 00   00 00 00 00   00 00 00 00
        00 00 00 00   00 01 00 00   00 00 00 00
        00 00 00 00   00 00 00 00   40 00 00 00

        00 00 00 64 Width

        00 00 00 64 Height

      • 00 00 00 24 edts

        The Edits Box. This contains information about how the track fits into the timeline of the entire video, in other words it maps from the timeline of this trak, to the timeline of the entire moov. If omitted, it's just assumed that there's a 1-to-1 mapping of time.

        • 00 00 00 1C elst

          The Edit List Box. This contains the explicit timeline information of the track.

          00 Version. Similarly to other boxes, this version determines if the timing information is 64 or 32-bit

          00 00 00 Flags (unused)

          00 00 00 01 Entry count — that is, how many edits there are. The following 3 fields are repeated entry_count times, but our entry_count here is just 1, so there's no real repeating

          00 00 00 6E The segment duration specifies the duration of this edit segment (i.e. 110 "time units" here, or 0.1 seconds)

          00 00 00 00 The media time specifies when this edit starts relative to this trak. If this were -1, it would be an "empty edit" — meaning it does nothing for the time, rather than playing from the trak

          00 01 00 00 Finally, the media rate is a 8.8 fixed number specifying the rate that the media of this edit should be played at

          In other words, the entire trak (since segment duration is the length of the track) should be played from its start (since media time is 0), at regular speed (since the media rate is 1). We know that this should happen straight away, since there were no other entries before this.

      • 00 00 02 38 mdia

        The Media Box contains all the boxes related to media (e.g. the video or audio components) of the track.

        • 00 00 00 20 mdhd

          The Media Header Box contains much like the previous mvhd, but the values here pertain to this track, rather than the movie as a whole.

          00 Version

          00 00 00 Flags (unused)

          56 84 03 0F Creation time

          56 84 0F 8A Modification time

          00 00 32 00 Timescale (remember, for this particular track now — not the entire movie) — so 3200 units in a second, now

          00 00 05 80 Duration, 1408 units of time (so 0.44 seconds, which is actually longer than our track — so 0.33 seconds are being ignored here)

          55 C4 The next 2 bytes are technically three 5-bit numbers, preceeded by a padding of one bit as 0. These three numbers form an ISO-639-2/T language code. In this case, we get 21, 14, and 4 (i.e. u, n, and d). und is the code for "undetermined" language

          00 00 Finally, the next 2 bytes are predefined as 0

        • 00 00 00 2D hdlr

          The Handler Reference box declares the nature of the media of the track, and thus how to handle it.

          00 Version

          00 00 00 Flags (unsued)

          00 00 00 00 Reserved as 0

          vide The next 4 bytes are 4 asii characters, stating the type of the handler. This is a video track handler

          00 00 00 00 00 00 00 00 00 00 00 00 The next 12 bytes are reserved as 0

          VideoHandler Finally, the name of the handler is given as a null-terminated string

        • 00 00 01 E3 minf

          The Media Information Box contains all boxes that give characteristic information on the media in the track.

          • 00 00 00 14 vmhd

            The Video Media Header Box contains information on how to show the encoded video data

            00 The first byte is the version

            00 00 01 The second 3 bytes are flags. The only flag here is the "no lean ahead" flag, used to determine if the media is using an old QuickTime v1.0 format. This should always be 1 for modern movies

            00 00 Next is the graphics mode. The value of 0 is "copy" (i.e., copy new image information on top of what's already there). Indeed, 0 is the only possible value, since MPEG-4 Part 12 only defines this as a base value, allowing other formats to extend it, and MPEG-4 Part 15 does not

            00 00   00 00   00 00 Finally is a set of 3 colour values (RGB). These are all 0 here, meaning no colour modification takes places

          • 00 00 00 24 dinf

            The Data Information Box contains objects that declare the location of the media information in a track.

            • 00 00 00 1C dref

              The Data Reference Box contains reference boxes that point to external locations of media data.

              00 Version

              00 00 00 Flags (unused)

              00 00 00 01 Entry count — the number of reference boxes contained in this dref

              • 00 00 00 0C url

                00 Version

                00 00 01 The flags having value 1 means that the media content is contained in this file, not an external URL. If it were 0, then a null-terminated string would follow, with a URL to the file that contains the movie box

            • 00 00 01 A3 stbl

              The Sample Table Box contains all of the offsets (or other ways to locate) the media samples of this track, and their type (e.g. I-frame or not).

              • 00 00 00 97 stsd

                The Sample Description Box contains information about the encoding of the media specified by this track. Different coding methods may be used, even within a video track, so multiple sample descriptions might be necessary.

                00 Version

                00 00 00 Flags (unused)

                00 00 00 01 Entry count

                • 00 00 00 87 avc1

                  avc1 is an MPEG-4 Part 15 specific instance of a Visual Sample Entry. This is a Sample Description Box entry concerning visual media (specifically AVC encoded visual media).

                  00 00 00 00 00 00 Reserved as 0

                  00 01 Data reference index

                  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Reserved as 0

                  00 64 Width

                  00 64 Height

                  00 48 00 00 Horizontal resolution

                  00 48 00 00 Vertical resolution

                  00 00 00 00 Reserved as 0

                  00 01 Frame count

                  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Compressor name (clearly not used here)

                  00 18 Depth — a value of 0x0018 means that the images are in colour, with no alpha channel. This is the only value allowed in MPEG-4 Part 12, and MPEG-4 Part 15

                  FF FF Pre-defined as -1 (presumably for some backwards compatibility reasons)

                  • 00 00 00 31 avcC

                    The AVC Decoder Configuration Record, specified in MPEG-4 Part 15. This box contains information on the best configuration to run the AVC decoder with. As these configuration values won't be useful without a better understanding on the AVC encoding, I'll leave this for another blog post.

                    01 64 00 0C FF E1 00 18 67 64 00 0C AC D9 47 3F 9E 78 40 00 00 03 00 40 00 00 32 03 C5 0A 65 80 01 00 06 68 EB E3 CB 22 C0

              • 00 00 00 18 stts

                The Decoding Time to Sample Box is a table that maps the advised time to decode each sample. This can be different to just before the time that a sample should be put on screen (the "composition time").

                00 Version

                00 00 00 Flags (unused)

                00 00 00 01 Entry count (only 1, since the decoding time of all 11 samples are merged into 1 entry here)

                00 00 00 0B The number of samples this entry is concerning

                00 00 00 80 The time distance between each of the samples decode times

                So in this movie file, there are 11 samples, that should be decoded with 128 time-units apart.

              • 00 00 00 14 stss

                The Sync Sample Box defines the sample ids that are sync samples. Sync samples are the samples that are key-frames (i.e. I-frames), which I'll go more into in another blog post.

                00 Version

                00 00 00 Flags (unused)

                00 00 00 01 Entry count

                00 00 00 01 The id of the sync sample. So our only I-frame is the first one

              • 00 00 00 68 ctts

                The Composition Time to Sample Box contains the offsets from the Decoding Time to Sample box (stts)'s decoding time values, to the composition times

                00 Version 0 of this box has positive numbers only for the composition offsets. In version 1, negative values can be provided too (which makes more sense given B-frames, but I won't go into that here)

                00 00 00 Flags (unused)

                00 00 00 0B Entry count here is 11, so we loop the next 2 entries 11 times

                00 00 00 01 The number of samples that the following offset is relavent for (so 1 for the first value)

                00 00 01 00 The offset from the last decoding time, that we should compose (actually show) the sample (so 256 time units after, for the first value)

                00 00 00 01 The number of samples that the following offset is relavent for (so 1)

                00 00 02 80 The offset from the last decoding (so 640 time units after)

                etc

                00 00 00 01

                00 00 01 00

                00 00 00 01

                00 00 00 00

                00 00 00 01

                00 00 00 80

                00 00 00 01

                00 00 02 80

                00 00 00 01

                00 00 01 00

                00 00 00 01

                00 00 00 00

                00 00 00 01

                00 00 00 80

                00 00 00 01

                00 00 01 80

                00 00 00 01

                00 00 00 80

              • 00 00 00 1C stsc

                The Sample to Chunk Box describes how many samples are in each chunk in the file, and how many samples are in each chunk. A chunk is just a logical grouping of samples.

                00 Version

                00 00 00 Flags (unused)

                00 00 00 01 Entry count

                00 00 00 01 The index of the chunk that the following samples-count applies to

                00 00 00 0B The number of samples in the chunk (and all following chunks, until the following stsc entry's index)

                00 00 00 01 The index of the sample's description in the earlier Sample Description Box (stsd)

              • 00 00 00 40 stsz

                The Sample Size box describes the size in bytes of all the samples in the file.

                00 Version

                00 00 00 Flags (unused)

                00 00 00 00 Sample size. If this were non-zero, then all samples would be this size. Since it's zero, it means that following the sample count will follow that number of sample sizes, for each sample

                00 00 00 0B The sample count, so 11 samples in this case

                00 00 02 D2 The size of the first sample, 722 bytes

                00 00 00 0F The size of the second sample, 15 bytes. As we would expect, the following samples are much smaller than the first one, as they're not sync samples (and no metadata is included in this sample)

                etc

                00 00 00 0C

                00 00 00 0C

                00 00 00 0C

                00 00 00 13

                00 00 00 0D

                00 00 00 0C

                00 00 00 0C

                00 00 00 14

                00 00 00 0C

              • 00 00 00 14 stco

                The Chunk Offset Box describes the file-offset of each chunk in terms of the file itself (not in terms of a box). The idea of this is so we can still determine the start of a chunk that exists in media data without box structure

                00 Version

                00 00 00 Flags (unused)

                00 00 00 01 Entry count

                00 00 00 30 The chunk offset — so 48 bytes in (which is right at the start of our mdat box, as we would expect)

    • 00 00 00 62 udta

      The User Data Box contains any user information of the parent box (data that might be displayed to the user)

      • 00 00 00 5A meta

        The Meta Box, surprisingly, contains metadata

        00 Version

        00 00 00 Flags (unused)

        • 00 00 00 21 hdlr

          The Handler Reference Box we've seen before in the mdia box. It can also be used to declare the structure of a meta box.

          00 Version

          00 00 00 Flags (unused)

          00 00 00 00 Reserved as zero

          mdir The handler type, mdir means that this is a handler for just metadata

          appl In strict MPEG-4 Part 12, these 4 bytes are part of the next "reserved" set of bytes. But it can also be used to label the Quicktime manufacturer

          00 00 00 00 00 00 00 00 Reserved as zero

          00 A null-terminated string to describe the metadata, that's actually an empty-string (so is just the null character)

        • 00 00 00 2D ilst

          The Apple Item List Box is not strictly MPEG-4 Part 12, but is part of the hdlr because of the format of the file

          00 00 00 25 This is an offset until the end of the box (presumably just here for backwards compatibility)

          A9 This byte indicates the format of the bytes to come in this box (since there are multiple options for ilst)

          too This 3-byte string indicates that the box concerns itself with the encoder metadata (not on, for example, the album or composer metadata)

          • 00 00 00 1D data

            The Apple Item Data Box contains the payload of the ilst

            00 Version

            00 00 01 Flags. The flags for this field are text, cpil (compilation), image, and tmpo (where no flags just mean that it represent generic data). So the flag value here means that this data is text data, describing the encoder

            00 00 00 00 Reserved as zero

            Lavf56.40.101 The following bytes continue until the end of the box, containing text describing the encoder. In this case, it's saying that version 56.40.101 of the lib avf format was used to encode this file

Well done if you managed to make it through all of that. Here's a puppy video (that is encoded with AVC) as a prize.

So even though we still don’t know what the encoded mdat data actually is, we know a lot about the actual video.

We know where all of the encoded data is, we know how it’s split up into chunks and samples, how long those samples last (and thus how long the movie is). We know which samples are sync samples (so we could, say, pick the correct sample to start from if we wanted to seek to a particular part of the movie). We even know when we’re advised to start decoding each sample.

We know all of the metadata associated with the file, when it was created, last edited, and what sort of media it is. We know the height and width of the video, and its resolution.

Because of MPEG-4’s use of blocks, features can be included and omitted by just adding and removing boxes. There are lots of boxes that we’ve not touched upon at all, for the example the Progress Download Information Box (pdin) that can help suggest how long to delay playback, given the download rate of the video (so as to avoid buffering). Or the Movie Fragment Random Access Box (mfra) that contains heuristic information to point towards sync samples (quicker than working it out definitively).

Conclusion #

There’s a lot of dead space in AVC’s file format, and there’s a lot of overhead (adding a fair space cost, even for tiny videos). There are lots of ways that MPEG-4 Part 12 could be optimised for a particular application, saving on space and computation of decoding and encoding.

However, the file format’s real beauty comes from the fact that it can do so many things pretty well. It’s the jack of all trades, of video Digital Container Formats.

A lot of its dead space comes from backwards compatibility, its flags allow for high extensibility, and its version bytes make the format very future-proof (e.g. consider the Composition Time to Sample Box that was able to still support B-frames by upping a version number). The cost of explicitly naming the type of box, and its length, is balanced by the extensibility this provides — allowing new blocks to be introduced at any time, and for additional formats to extend MPEG-4 Part 12 (which lots of things do, e.g. AVC and MP4).

In writing this, I think I have a better appreciation for how much a video format needs to do (rather than just literally representing sound and audio data), and a stronger understanding of MPEG in general, and a decent understanding of MPEG-4 Part 12 and 15.

I hope reading this was as helpful as writing it was. And for my final disclaimer: I don’t claim to be an expert on MPEG. Please do take everything with a pinch of salt, and get in contact if you spot anything that looks wrong!

Further Reading #

Comments are closed.