Shayan's Software & Technology

My adventure as a Software Engineer continues..,

Mozilla WebVTT 0.6: GitHub Issues, Bug Fixes, Refactoring, and Enhancing the WebVTT Parser

Overview

The pressure is on to get this WebVTT parser on the road and it has been a fun 2 weeks trying to get there! I have encountered a lot of Issues a long the way (GitHub Issues). For those of you unfamiliar with GitHub, it is simply a cloud storage facility for your code base with version control. It also supports online collaboration in a ‘social network’ like setting. Our team takes full advantage of this and this post will detail all of the fun stuff that has happened. I have recently had the honour of being a part of the Mozilla Corporation’s mozilla/webvtt repository. It is there where we discuss, probe, change and enhance the official WebVTT code base.

Here’s a Link to the Mozilla/WebVTT Repository:

https://github.com/mozilla/webvtt/tree/dev

Backstory

So I am tasked with finding, fixing, enhancing the WebVTT code base. Where do I start? I had moved my development environment from Windows 7(not hating, still love you), to MacOSX Lion and promptly I ran a ‘make check’ command on our WebVTT Parser Code. What I found shocked me: none of my unit tests were running! They were not failing, they were not passing they did not even get a chance to try.

This made my primary focus to determine why this was occurring. Our code base had gone through some changes leading up to my discovery and I quickly found out that our unit tests needed to be upgraded with the current stylings and functionality of our new WebVTT Code base. Here come the Issues:

UTF-16

The unit tests Rick and I wrote for escape character checking were using #define directives that have been deprecated in the current version of the code base. The directives referred to UTF-16 CodePoints. The problem is that I remembered that we had decided to use UTF-8 encoding and this proved that what we had now was incorrect. This means that I had to upgrade the existing unit tests to comply with what should be tested in the code base right now.

The solution to this problem after much debate was to create some extended functionality in our string library to allow us to compare the UTF-16 values and convert the UTF-8 values in the codebase to UTF-16. Caitlin(caitp), a member of our team went ahead and implemented this functionality. It was up to me to use it properly. Before I could use it, I had to enumerate the UTF-16 CodePoints I needed to test in the upper level cue_testfixture. I changed this file to support this global enumeration:

enum{ rlm = 0x200F, lrm = 0x200E, nbsp = 0x00A0 };

The enum will definitely grow over time and as soon as I start upgrading the other unit test files it will be quite large. The actual unit tests now use the new string library and now look like this:

/*
 * Verifies that multiple escape characters on multiple lines are parsed.
 * From http://dev.w3.org/html5/webvtt/#webvtt-cue-text (11/27/2012)
 *  Cue text text consists of one or more cue text components optionally separated by a single line terminator which can be:
 *    1. CR (U+000D)
 *    2. LF (U+000A)
 *    3. CRLF pair
 */
TEST_F(PayloadEscapeCharacter, MultilineMultipleEscapeCharacter)
{
  loadVtt( "payload/escape-character/multiline-multiple-escape-character.vtt", 1 );

  const TextNode *node = getHeadOfCue( 0 )->child( 0 )->toTextNode();

  /* verify that it is a TextNode */
  ASSERT_EQ(Node::Text, node->kind());

  /* create a UTF8 representation of NBSP and compare with textnode */
  ASSERT_EQ( nbsp, node->content().utf16At( 0 ) );
}

It should be noted that this is subject to heavy amounts of change in the near future. The reason for this is due to the fact that ASSERT_EQ only evaluates pointers and not the value of the content. An alternative approach of using ASSERT_STREQ has been proposed. This alternative is also not perfect and discussion will occur on what the best approach is for properly dealing with these unit tests.

The discussion can be found here:

https://github.com/mozilla/webvtt/issues/101

The Infinite Loop – file_bytes assumes slen > len

There was an issue that specifically outlined the need for us to check our function parameters before specifically continuing on with the functions main code. The code we are concerned with is the file_bytes() function, a simple strnstr copy located in parser.c. In this case, not checking the function parameter ‘slen’ the code implicitly makes the assumption on its value that it is greater than another parameter ‘len’. That is not a reliable assumption for every case. The need for this was simple: To ensure that the parameters passed to our functions are reliable   to a certain degree. I decided to get on this issue early on and proposed the correct solution. Here it is:

/**
 * basic strnstr-ish routine
 */
WEBVTT_INTERN int
find_bytes( const webvtt_byte *buffer, webvtt_uint len, const webvtt_byte *sbytes, webvtt_uint slen )
{
  // check params for integrity
  if( !buffer || len < 1 || !sbytes || slen < 1 ) {
    return 0;
  }

  webvtt_uint slen2 = slen - 1;
  while( len-- >= slen && *buffer ){
    if( *buffer == *sbytes && memcmp( buffer + 1, sbytes + 1, slen2 ) == 0 ) {
      return 1;
     }
    buffer++;
  }
  return 0;
}

This code solves the improper assumption issue but it also introduces an Infinite Loop in the code making this fix not compilable in our current WebVTT codebase. Which means that we will have to solve the infinite loop before we can apply this patch. The Infinite Loop exposes a vulnerability in other parts of the code base. This code helped uncover this issue. We are currently working on this issue and subsequent posts will update further on our progress.

Refactoring

I also worked on refactoring some of the code base because it contained awkward pointer representation. A function in the codepage took a **variable (a pointer to a pointer) and in the calling function the parameter was sent as &(*var). This is correct, but it is just so much harder to read than just ‘var’. ‘var’ is the address of the variable var and so is &(*var) because the pointer is de-referenced. I went ahead and made these changes in cuetext.c.

To check out all of the other issues I am currently involved with check out our Issues page:

https://github.com/mozilla/webvtt/issues?state=open

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: