Shayan's Software & Technology

My adventure as a Software Engineer continues..,

A Note on WebVTT Parser Implementation According to Specification and Preparing for a 0.4 Release


I am currently finishing up my Unit Tests for the WebVTT Parser using Google’s  GTest C++ Framework. I noticed some key differences between the parser we wrote and the JavaScript parser we used as a base to write our test cases for the 0.1 Release. Now that we are implementing actual Unit Tests for our new parser some of the older tests we wrote are proving to be redundant or useless. But are they? let’s find out!

Here’s a Link to Google GTest a C++ Open Source Unit Testing Framework! 

Old Test Cases

For our 0.1 Release we were tasked with creating some .test files. These test files were WebVTT files that were aimed at thoroughly testing the JavaScript parser we were analyzing as a base. Here is a Link to this parser:

JavaScript Parser

Here are some of the test cases I wrote to ensure that improper escapes on special characters are caught by the parser.


00:11.000 –> 00:13.000
Test Ampersand escape: &am;


00:11.000 –> 00:13.000
Test Left to Right Character escape: &rm;


00:11.000 –> 00:13.000
Test Space Character escape: &nbp;

These test cases basically throw different combinations of incorrectly escaped characters to ensure garbage isn’t parsed as actual output to the user. This garbage output however is acceptable in the WebVTT Specification leading to the conflict of implementation and design.

JavaScript Parser

The Js Parser does not allow incorrect escapes to be parsed as output to the user. This means that the user using the WebVTT specification must escape special characters correctly if they want their intended results to appear in the output. In Contrast, the WebVTT specification does not enforce this escaping rule. It instead says that if there is an incomplete or incorrect escaped character in the cue text. Print that garbage text out to the user. This is similar to how markup in HTML works.

Here is a quote Directly from the WebVTT Specification:

First, examine the value of buffer:

If buffer is the string “&amp“, then append a U+0026 AMPERSAND character (&) to result.

If buffer is the string “&lt“, then append a U+003C LESS-THAN SIGN character (<) to result.

If buffer is the string “&gt“, then append a U+003E GREATER-THAN SIGN character (>) to result.

If buffer is the string “&lrm“, then append a U+200E LEFT-TO-RIGHT MARK character to result.

If buffer is the string “&rlm“, then append a U+200F RIGHT-TO-LEFT MARK character to result.

If buffer is the string “&nbsp“, then append a U+00A0 NO-BREAK SPACE character to result.

Otherwise, append buffer followed by a U+003B SEMICOLON character (;) to result.

Then, in any case, set tokenizer state to the WebVTT data state, and jump to the step labeled next.

WebVTT Specification 

Who is Right here? Are my Test Cases Useless and Extremely Redundant?!!

The specification is saying that if an escape character is incorrectly escape to append those garbage characters to the result and let it go out to output.

Working with other peoples code requires patience and thought. The obvious answer here is that the JavaScript Parser is wrong! why? because it does not follow the WebVTT specification. But I like to think outside the box! The debate here solely lies on the fact if we actually want to be strict on enforcing these rules or if we want to be lenient. The JavaScript Parser is clearly strict and it is upholding some of the escape character parsing rules to the letter! I cannot say that the decision here is incorrect. I personally find this strict adherence to escaping rules to be an example of good coding practices. This will force the user to properly enter data into the captions which will lead to a high likelihood of quality output..

Further, under the current parser and the specification it is very much possible for the user to simply ignore the escaping rules by entering in an ampersand “&” in the cue text and that will be parsed correctly in UTF16. this makes the escape character rules obsolete!

The programmers who coded the JS parser must have had a strict adherence to the escaping rules for this very reason. If we are loose on these rules then the rules hold less significance in ensuring we get reliable output. But it leads to this open ended question: Would you rather see garbage as output, or would you rather see an error? I would personally lean towards an error because I like my output to be as clean as it can possibly get. For the Users of course! Thats You Guys!

Will I be throwing away my tests?

I can’t possibly bring myself to do this seeing the potential they have in case the WebVTT specification is amended.  Which is a real possibility since WebVTT is not a standard yet and it is still in Development. I will add my tests in the unit test files for use in the future!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: