SVG From the Ground Up — Parsing Fundamentals
Scalable Vector Graphics, is a XML based format. So, the first thing we want to do is create an XML ‘parser’. I put that in quotes because we don’t really need to create a full fledged conformant, validating, XML parser. This is the first design principle I’m going to be following. Here I want to create just enough to make things work, with an eye towards future proofing and extensibility, but not go so far as to make it absolutely bullet proof. So, I’ll be writing just enough code to scan some typical .svg files, while leaving room to swap out our quick and dirty parser for something that is more substantial in the future.
If you want to follow along the code I am constructing, you can find it in this GitHub repository: svg2b2d
To start scanning text, begins with how text is represented in the first place. This is a very core fundamental design decision. Will we be reading from files on the local machine? Will we be reading from a stream of bytes coming from the network? Will we be reading from a chunk of memory handed to us through some API within the program? I’ve decided on this last choice. These days, it’s very common to be able to read a file into memory, and operate on it from there. Similarly with networks, speeds are fast enough that you can read the entirety of the content into memory before processing. SVG is not a format where you can easily progressively render, like with raster based formats. You really do need the whole document before you can render it.
So, we’re going to be reading from memory, assuming something else has already taken care of getting the image into memory. I am writing in C++, so I’m ok with struct, and class, but I don’t necessarily want to use the iostream facilities, nor get too far down the track with templates and the like. The latest C++ (20) has a std::span object, which is very useful, and does exactly what I want, but I want the code to be a bit more portable than C++ 20, so I’m not going to use that facility, instead I’m going to somewhat replicate it.
How to represent a chunk of memory. There are two choices. You can either use a starting pointer and length, or you can use a starting and ending pointer. After much deliberation, I chose to do the latter, and use two pointers.
Throughout the code, I will use the ‘struct’ instead of ‘class’ because I’m ok with the data structure defaulting to everything being publicly accessible. There’s not a lot of sub-classing that’s going to occur here either, so I’m not as concerned about data hiding and encapsulation. This also makes the code easier to understand, without a lot of extraneous decorations.
There we have it. You have a chunk of memory, now what? Well, the most common things you do when scanning code are advance the pointer, and check the character you’re currently scanning. So, lets add these conveniences, as well as a couple of constructors, then we can do a sample.
struct ByteSpan { const unsigned char * fStart{}; const unsigned char * fEnd{}; ByteSpan():fStart(nullptr), fEnd(nullptr){} ByteSpan(const char *cstr):fStart((const unsigned char *)cstr), fEnd((const unsigned char *)cstr+strlen(cstr)){} explicit ByteSpan(const void* data, size_t sz) :fStart((const unsigned char*)data) , fEnd((const unsigned char*)fStart + sz) {} // Return false when start and end are the same explicit operator bool() const { return (fEnd - fStart) > 0; }; // get current value from fStart, like a 'peek' operation unsigned char& operator*() { static unsigned char zero = 0; if (fStart < fEnd) return *(unsigned char*)fStart; return zero; } const uint8_t& operator*() const { static unsigned char zero = 0; if (fStart < fEnd) return *(unsigned char*)fStart; return zero; } ByteSpan& operator++() { return operator+=(1); } // prefix notation ++y ByteSpan& operator++(int i) { return operator+=(1); } // postfix notation y++ };
With all that boiler plate code added to the structure, you can now do the following operations
ByteSpan b("Here is some text"); while (b) { printf("%c",*b); b++; }
And that little loop will essentially print a copy of the string you used to create the ByteSpan ‘b’. At this point, it might hardly seem worth the effort. I mean, you could just as simply use a starting pointer and ending pointer, without the intervening ByteSpan structure. Well, yes, and a lot of code out there in the world does exactly that, and it’s just fine. But, we have some future design goals which will make this little encapsulation of the two pointers very convenient. One of the design goals is worth introducing now, and that is the concept of zero, or minimal allocation. We want the scanner to be light weight, fast, minimal impact on the memory of the system. We want it to be able to parse data that is megabytes in size without having any problems. To this end, the scanner itself does no allocations, and does not alter the original memory its operating on, even though the ByteSpan would allow you to.
Alright. With this little tool in hand, what else can we do? Well, soon enough we’re going to need to compare characters, and make decisions. Is that a ‘<‘ opening to an XmlElement? Is this whitespace? Does this string end with “/>”. We have need for something that can represent a set of characters. Here is charset.
struct charset { std::bitset<256> bits; explicit charset(const char achar) { addChar(achar); } charset(const char* chars) { addChars(chars); } charset& addChar(const char achar) { bits.set(achar); return *this; } charset& addChars(const char* chars) { size_t len = strlen(chars); for (size_t i = 0; i < len; i++) bits.set(chars[i]); return *this; } charset& operator+=(const char achar) { return addChar(achar); } charset& operator+=(const char* chars) { return addChars(chars); } charset operator+(const char achar) const { charset result(*this); return result.addChar(achar); } charset operator+(const char* chars) const { charset result(*this); return result.addChars(chars); } // This one makes it look like an array bool operator [](const size_t idx) const { return bits[idx]; } // This way makes it look like a function bool operator ()(const size_t idx) const { return bits[idx]; } bool contains(const uint8_t idx) const { return bits[idx]; } };
All of that, so that we can write the following
charset wspChars(" \t\r\n\v\f"); ByteSpan b(" Now is the time for all humans to come to the aid of animals "); while (b) { // skip whitespace while (wspChars.contains(*b)) b++; // Create a span that will represent a word // start it being empty ByteSpan aWord = b; aWord.fEnd = aWord.fStart; // Advance while there are still characters and they are not a whitespace character while (b && !wspChars.contains(*b)) b++; // Now we're sitting at the end of the whole span, or at the beginning of the next // whitespace character. In either case, it's the end of our word aWord.fEnd = b.fStart; // Now we can do something with the word that we found printWord(aWord); // And continue around the loop until we've exhausted the byte span }
And that’s how we start. If you want to get ahead, you can look at the code in the repository, in particular bspan.h and bspanutil.h. With these two classes alone, we will build up the XML scanning capability, and ultimately the SVG building capability on top of that. So, these are very code, and important to get right, because they will maintain the promise of “no allocations” and “be super fast”.
One question that came up in my mind was “why not just use regex and be done with it?”. Well, yes, C/C++ have regular expression capabilities, either built-in, or as some side library. There were a couple of reasons I chose not to go that route. One is about speed, the other is about allocations. It’s super easy to just store your text into a std::string object, then use regex on that. But, when you do, you’ll find that std::string objects are allocated all over the place, and you don’t have tight control of your memory consumption, which breaks one of the design tenets I’m after. The other is just the size of such code. A good regex library can easily be as big, if not bigger, than the entirety of the SVG parser we’re trying to build. I am somewhat concerned with code size, so I’d rather not have the extra bloat. Besides, all that, trying to construct regex patterns that I or anyone can maintain in the future, can be quite challenging. We’ll essentially be building bits and pieces of what would typically go into regex libraries, but we’ll only be building as much as we need, so it will stay small and tight.
And there you have it. We have begun our journey with these two first small steps, the ByteSpan, and the charset.
Next time, we’ll see how easy it is to ‘parse’ some xml, as we introduce the XmlElement and XmlElementIterator.