1. 16 Jul, 2019 17 commits
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Fix probing of unknown-length headers · 9c6d14ab
      Andreas Rheinhardt authored
      matroska_probe did not support the case of an unknown-length EBML header
      at all; given that libavformat's Matroska muxer used to produce such
      files in the streaming case, support for them has been added.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      9c6d14ab
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Accept more unknown-length elements · 3c70b941
      Andreas Rheinhardt authored
      The current Matroska specifications mandate that only two elements may
      use an unknown-length length: Segments and clusters. But this was not
      always so: For the greater part of Matroska's existence, all master
      elements were allowed to make use of the unknown-length feature.
      
      And there were muxers creating such files: For several years
      libavformat's Matroska muxer used unknown-length for all master
      elements when the output wasn't seekable. This only stopped in March
      2010 with 2529bb30. And even afterwards it was possible (albeit
      unlikely) for libavformat to create unknown-length master elements
      that are in violation of today's specifications, namely if the master
      element was so big that the seek backwards to update the size could
      no longer be performed inside the AVIOContext's write buffer. This
      has only been fixed in October 2016 (with the patches that introduced
      support for writing CRC-32 elements).
      
      Libavformat's Matroska demuxer meanwhile has never really supported
      unknown-length elements besides segments and clusters. Support for the
      latter was hardcoded. This commit changes this: Now all master elements
      for which a syntax to parse them is available are supported. This
      includes the files produced by old versions of libavformat's muxer.
      
      More precisely, master elements that have unknown length and are about
      to be parsed (not skipped) are supported; only a warning is emitted for
      them. For normal files, this means that level 1 elements after the
      clusters that are encountered after the clusters have been parsed (i.e.
      not because they are referenced by the seekhead at the beginning of the
      file) are still unsupported (they would be skipped at this point if
      their length were known).
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      3c70b941
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Improve invalid length error handling · 04b62bd7
      Andreas Rheinhardt authored
      1. Up until now, the error message for EBML numbers whose length exceeds
      the limits imposed upon them because of the element's type did not
      distinguish between known-length and unknown-length elements. As a
      consequence, the numerical value of the define constant
      EBML_UNKNOWN_LENGTH was emitted as part of the error message which is
      of course not appropriate. This commit changes this by adding error
      messages designed for unknown-length elements.
      
      2. We impose some (arbitrary) sanity checks on the lengths of certain
      element types; these checks were conducted before the checks depending
      on whether the element exceeds its containing master element. Now the
      order has been reversed, because a failure at the (formerly) latter
      check implies that the file is truly erroneous and not only fails our
      arbitrary length limit. Moreover, this increases the informativeness of
      the error messages.
      
      3. Furthermore, the error message in general has been changed by replacing
      the type of the element (something internal to this demuxer and
      therefore suitable as debug output at best, not as an error message
      intended for ordinary users) with the element ID. The element's position
      has been added, too.
      
      4. Finally, the length limit for EBML_NONE elements has been changed so
      that all unknown-length elements of EBML_NONE-type trigger an error.
      This is done because unknown-length elements can't be skipped and need
      to be parsed, but there is no syntax to parse available for EBML_NONE
      elements. This is done in preparation for a further patch which allows
      more unknown-length elements than just clusters and segments.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      04b62bd7
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Don't skip too much when unseekable · 51203051
      Andreas Rheinhardt authored
      The Matroska (and WebM) file format achieves forward-compability by
      insisting that demuxers ignore and skip elements they don't know about.
      Unfortunately, this complicates the detection of errors as errors
      resulting from loosing sync can't be reliably distinguished from
      unknown elements that are part of a future version of the standard.
      
      Up until now, the strategy to deal with this situation was to skip all
      unknown elements that are not obviously erroneous; if an error happened,
      it was tried to seek to the last known good position to resync from (and
      resync to level 1 elements). This is working fine if the input is
      seekable, but if it is not, then the skipped data can usually not be
      rechecked lateron. This is particularly acute if unknown-length clusters
      are in use, as the check for whether a child element exceeds the
      containing master element is ineffective in this situation.
      
      To remedy this, a new heuristic has been introduced: If an unknown
      element is encountered in non-seekable mode, an error is presumed to
      have happened based upon a combination of the length of the row of the
      already encountered unknown elements and of how far away skipping this
      element would take us.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      51203051
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Typos, nits and cosmetics · 60f75c99
      Andreas Rheinhardt authored
      Cosmetics include reordering EbmlType so that EBML_SINT is adjacent to
      the other numbers (and matches the order in the switch in ebml_parse)
      and also reordering the switch for assignment of default values so that
      it matches the order in EbmlType.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      60f75c99
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Reuse positions · 7087fc95
      Andreas Rheinhardt authored
      Up until now, avio_tell was used multiple times in ebml_parse and its
      subroutines, although the result of these calls can usually be simply
      derived from the result of earlier calls to avio_tell. This has been
      changed. Unnecessary calls to avio_tell in ebml_parse are avoided now.
      
      Furthermore, there has been a slight change in the output of some error
      messages relating to elements exceeding their containing master element:
      The reported position of the element now points to the first byte of the
      element ID and no longer to the first byte of the element's payload.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      7087fc95
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Redo EOF handling · 3ed2755b
      Andreas Rheinhardt authored
      This commit closes the last hole in the system of checks for a
      known-length file ending too early: Now an error message is emitted
      in case the file ends directly after an EBML element.
      
      Furthermore, this commit adds a check and a corresponding warning
      whether there is data beyond the Matroska segment (only reasonable for
      known-length segments). If everything looks alright, then parsing is
      stopped as soon as EOF is reached (in contrast, the earlier code would
      always call matroska_resync at the end).
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      3ed2755b
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Combine arrays · 38255cdc
      Andreas Rheinhardt authored
      By including SimpleBlocks and BlockGroups twice in the same EbmlSyntax
      array (with different semantics), one can reduce the duplication of the
      other values.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      38255cdc
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Don't reset cluster position · a9f05151
      Andreas Rheinhardt authored
      The new code does not rely on whether the cluster's position is set or
      not to infer whether a cluster needs to be closed or not (instead, this
      is done in ebml_parse), so there is no need to reset the cluster's
      position at all any more. It will be automatically set to the correct
      value when a cluster is entered.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      a9f05151
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Make cluster parsing level compatible · 865c5370
      Andreas Rheinhardt authored
      Before this commit, the parsing of clusters mixed EBML levels by
      allowing elements from different levels in a EbmlSyntax (namely
      matroska_cluster_parsing). This has been changed. And the level
      is now explicitly used to determine how to parse.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      865c5370
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Redo level handling · b31c9b72
      Andreas Rheinhardt authored
      This commit changes how levels are handled: If the level used for
      ebml_parse ends directly after an element that has been consumed, then
      ebml_parse ends the level itself (and any known-length levels that end
      there as well) and informs the caller via the return value; if the
      current level is of unknown-length, then the level is ended as soon as
      an element that is not valid on the current level, but on a higher
      level is encountered (or if EOF has been encountered).
      
      This is designed for situations where one wants to parse master elements
      incrementally, i.e. not in one go via ebml_parse_nest.
      
      The (incremental) parsing of clusters still mixes levels by using a
      syntax list that contains elements from different levels and the level
      is still ended manually via a call to ebml_level_end if the last cluster
      was an unknown-length cluster (known-length clusters are already ended
      when their last element is read), but only if the next element is a
      cluster, too. A  different level 1 element following an unknown-length
      cluster will currently simply be presumed to be part of the earlier
      cluster. Fixing this will be done in a future patch. The modifications
      to matroska_parse_cluster contained in this patch are only intended not
      to cause regressions.
      
      Nevertheless, the fact that known-length levels are automatically ended
      in ebml_parse when their last element has been read already fixes a bogus
      error message introduced in 9326117b that was emitted when a known-length
      cluster is followed by another level 1 element other than a cluster in
      which case the cluster's level was not ended (which only happened when
      a new cluster has been encountered) so that the length check (introduced
      in 9326117b) failed for the level 1 element as it is of course not
      contained in the previous cluster. Most Matroska files were affected by
      this.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      b31c9b72
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Link to parents in syntax tables · c1abd95a
      Andreas Rheinhardt authored
      By linking to the syntax of the parent (i.e. the containing master
      element) one can check whether an element is actually part of a higher
      level in the EBML hierarchy. Knowing this is important for
      unknown-length levels, because they end when an element that doesn't
      belong to this, but to a higher hierarchy level is encountered.
      
      Sometimes there are different syntaxes dealing with the same elements.
      In this case it is important to use a parent that contains all the
      elements at the parent level; whether this is the syntax actually used
      to enter the child's level is irrelevant. This affects the list of level
      1 elements (which has been used as parent for matroska_cluster, too) and
      it affects recursive elements (currently only the SimpleTag), where the
      non-recursive parent has to be choosen.
      
      This is in preparation for a patch that redoes level handling.
      
      Finally, the segment id has been added to ebml_syntax. This will enable
      handling of unknown-length EBML headers.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      c1abd95a
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Introduce a "last known good" position · a3db9f62
      Andreas Rheinhardt authored
      Currently, resyncing during reading packets works as follows:
      The current position is recorded, then a call to matroska_parse_cluster
      is made and if said call fails, the demuxer tries to resync from the
      earlier position. If the call doesn't fail, but also doesn't deliver a
      packet, then this is looped.
      
      There are two problems with this approach:
      1. The Matroska file format aims to be forward-compatible; to achieve
      this, a demuxer should simply ignore and skip elements it doesn't
      know about. But it is not possible to reliably distinguish unknown
      elements from junk. If matroska_parse_cluster encounters an unknown
      element, it can therefore not simply error out; instead it returns zero
      and the loop is iterated which includes an update of the position that
      is intended to be used in case of errors, i.e. the element that is
      skipped is not searched for level 1 element ids to resync to at all if
      later calls to matroska_parse_cluster return an error.
      Notice that in case that sync has been lost there can be a chain of
      several unknown/possibly junk elements before an error is detected.
      
      2. Even if a call to matroska_parse_cluster delivers a packet, this does
      not mean that everything is fine. E.g. it might be that some of the
      block's data is missing and that the data that was presumed to be from
      the block just read actually contains the beginning of the next element.
      This will only be apparent at the next call of matroska_read_packet,
      which uses the (false) end of the earlier block as resync position so
      that in the (not unlikely) case that the call to matroska_parse_cluster
      fails, the data believed to be part of the earlier block is not searched
      for a level 1 element to resync to.
      
      To counter this, a "last known good" position is introduced. When an
      element id that is known to be allowed at this position in the hierarchy
      (according to the syntax currently in use for parsing) is read and some
      further checks (regarding the length of the element and its containing
      master element) are passed, then the beginning of the current element is
      treated as a "good" position and recorded as such in the
      MatroskaDemuxContext. Because of 2., only the start of the element is
      treated as a "good" position, not the whole element. If an error occurs
      later during parsing of clusters, the resync process starts at the last
      known good position.
      
      Given that when the header is damaged the subsequent resync never skips over
      data and is therefore unaffected by both issues, the "last known good"
      concept is not used there.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      a3db9f62
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Refactor some functions · 559e3422
      Andreas Rheinhardt authored
      Since the changes to the parsing of SimpleBlocks, both ebml_parse_id and
      ebml_parse_elem are only called from one place, so that it is possible
      to inline these two function calls. This is done, but not completely:
      ebml_parse_id still exists in a modified form. This is done in
      preparation for a further patch regarding the handling of
      unknown-length elements.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      559e3422
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Use proper levels after discontínuity · 8a286e74
      Andreas Rheinhardt authored
      The earlier code set the level to zero upon seeking and after a
      discontinuity although in both cases parsing (re)starts at a level 1
      element.
      
      Also set the segment's length to unkown if an error occured in order not
      to drop any valid data that happens to be beyond the designated end of
      the segment.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      8a286e74
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Add function to reset status · 310f326b
      Andreas Rheinhardt authored
      This function will be useful later to reset the status (e.g. current
      level and the already parsed id).
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      310f326b
    • Andreas Rheinhardt's avatar
      avformat/matroskadec: Don't abort resyncing upon seek failure · 27f40b1d
      Andreas Rheinhardt authored
      When an error happens, the Matroska demuxer tries to resync to level 1
      elements from an earlier position onwards. If the seek to said earlier
      position fails, the demuxer currently treats this as an unrecoverable
      error. And that behaviour is suboptimal as said failure is nothing
      unrecoverable or unexpected (when the input isn't seekable).
      It is preferable to simply resync from the earliest position available
      (i.e. the start of the AVIOContext's buffer) onwards if the seek failed.
      
      Here are some scenarios that might be treated as unrecoverable errors
      by the current code if the input isn't seekable. They all have in
      common that the current position is so far away from the desired
      position that the seek can't be fulfilled from the AVIOContext's buffer:
      
      1. Blocks (both SimpleBlocks as well as a Block in a BlockGroup) for
      which reading them as binary EBML elements succeeds, but whose parsing
      triggers an error (e.g. an invalid TrackNumber). In this case the
      earlier position from which resyncing begins is at the start of the block
      (or even earlier).
      2. BlockGroups, whose parsing fails in one of the latter elements. Just
      as in 1., the start of the BlockGroup (the target of the seek) might be
      so far away from the current position that it is no longer in the
      buffer.
      3. At the beginning of parsing a cluster, the cluster is parsed until a
      SimpleBlock or a BlockGroup is encountered. So if the input is damaged
      between the beginning of the cluster and the first occurrence of a
      SimpleBlock/BlockGroup and if said damage makes the demuxer read/skip so
      much data that the beginning of the cluster is no longer in the buffer,
      demuxing will currently fail completely.
      Signed-off-by: 's avatarAndreas Rheinhardt <andreas.rheinhardt@gmail.com>
      27f40b1d
  2. 15 Jul, 2019 4 commits
  3. 14 Jul, 2019 15 commits
  4. 13 Jul, 2019 4 commits