SVG From the Ground Up — It’s XML, How hard could it be?
Let’s take a look at the SVG (XML) code that generates that image.
<svg height="200" width="680" xmlns="http://www.w3.org/2000/svg"> <circle cx="70" cy="70" r="50" /> <circle cx="200" cy="70" r="50" fill="#79C99E" /> <circle cx="330" cy="70" r="50" fill="#79C99E" stroke-width="10" stroke="#508484" /> <circle cx="460" cy="70" r="50" fill="#79C99E" stroke-width="10" /> <circle cx="590" cy="70" r="50" fill="none" stroke-width="10" stroke="#508484" /> </svg>
By the end of this post, we should be able to scan through the components of that, and generate the tokens necessary to begin rendering it as SVG. So, where to start?
Last time around (SVG From the Ground Up – Parsing Fundamentals), I introduced the ByteSpan and charset data structures, as a way to say “these are the only tools you’ll need…”. Well, at least they are certainly the core building blocks. Now we’re going to actually use those components to begin the process of breaking down the XML. XML can be a daunting sprawling beast. Its origins are in an even older document technology known as SGML. The first specification for the language can be found here: Extensible Markup Language (XML) 1.0 (Fifth Edition). When I joined the team at Microsoft in 1998 to work on this under Jean Paoli, one of the original authors, there were probably 30 people across dev, test, and pm. Of course we had people working on the standards body, and I was working on XSLT, and a couple on the parser, someone on DTD schema. It was quite a production. At that time, we had to deal with myriad encodings (utf-8 did not rule the world yet), conformance and compliance test suites, and that XSLT beast (CSS did not rule the world yet). It was a daunting endeavor, and at some point we tried to color everything with XML, much to the chagrin of most other people. But, some things did come out of that era, and SVG is one of them.
Today, our task is not to implement a fully compliant validating parser. That again would take a team of a few, and a ton of testing. What we’re after is something more modest. Something a hobby hacker could throw together in a weekend, but have a fair chance at it being able to consume most of the SVG you’re ever really interested in. To that end, there’s a much smaller, simpler XML spec out there. MicroXML. This describes a subset of XML that leaves out all the really hard parts. While that spec is far more readable, we’ll go even one step simpler. With our parser here, we won’t even be supporting utf-8. That might seem like a tragic simplification, but the reality is, not even that’s needed for most of what we’ll be doing with SVG. So, here’s the list of what we will be doing.
Decoding elements
Decoding attributes
Decoding element content (supporting text nodes)
Skipping Doctype
Skipping Comments
Skipping Processing Instructions
Not expanding character entities (although user can)
As you will soon see “skipping” doesn’t mean you have access to the data, it just means our SVG parser won’t do anything with it. This is a nice extensibility point. We start simple, and you can add as much complexity as you want over time, without changing the fundamental structure of what we’re about to build.
Now for some types and enums. I won’t put the entirety of the code in here, so if you want to follow along, you can look at the xmlscan.h file. We’ll start with the XML element types.
enum XML_ELEMENT_TYPE { XML_ELEMENT_TYPE_INVALID = 0 , XML_ELEMENT_TYPE_XMLDECL // An XML declaration, like <?xml version="1.0" encoding="UTF-8"?> , XML_ELEMENT_TYPE_CONTENT // Content, like <foo>bar</foo>, the 'bar' is content , XML_ELEMENT_TYPE_SELF_CLOSING // A self-closing tag, like <foo/> , XML_ELEMENT_TYPE_START_TAG // A start tag, like <foo> , XML_ELEMENT_TYPE_END_TAG // An end tag, like </foo> , XML_ELEMENT_TYPE_COMMENT // A comment, like <!-- foo --> , XML_ELEMENT_TYPE_PROCESSING_INSTRUCTION // A processing instruction, like <?foo bar?> , XML_ELEMENT_TYPE_CDATA // A CDATA section, like <![CDATA[ foo ]]> , XML_ELEMENT_TYPE_DOCTYPE // A DOCTYPE section, like <!DOCTYPE foo> };
This is where we indicate what kinds of pieces of the XML file we will recognize. If something is not in this list, it will either be reported as invalid, or it will simply cause the scanner to stop processing. From the little bit of XML that opened this article, we see “START_TAG”, “SELF_CLOSING”, “END_TAG”. And that’s it!! Simple right?
OK. Next up are a couple of data structures which are the guts of the XML itself. First is the XmlName. Although we’re not building a super conformant parser, there are some simple realities we need to be able to handle to make our future life easier. XML namespaces are one of those things. In XML, you can have a name with a ‘:’ in it, which puts the name into a namespace. Without too much detail, just know that “circle”, could have been “svg:circle”, or something, and possibly mean the same thing. We need a data structure that will capture this.
struct XmlName { ByteSpan fNamespace{}; ByteSpan fName{}; XmlName() = default; XmlName(const ByteSpan& inChunk) { reset(inChunk); } XmlName(const XmlName &other):fNamespace(other.fNamespace), fName(other.fName){} XmlName& operator =(const XmlName& rhs) { fNamespace = rhs.fNamespace; fName = rhs.fName; return *this; } XmlName & operator=(const ByteSpan &inChunk) { reset(inChunk); return *this; } // Implement for std::map, and ordering in general bool operator < (const XmlName& rhs) const { size_t maxnsbytes = std::min(fNamespace.size(), rhs.fNamespace.size()); size_t maxnamebytes = std::min(fName.size(), rhs.fName.size()); return (memcmp(fNamespace.begin(), rhs.fNamespace.begin(), maxnsbytes)<=0) && (memcmp(fName.begin(), rhs.fName.begin(), maxnamebytes) < 0); } // Allows setting the name after it's been created XmlName& reset(const ByteSpan& inChunk) { fName = inChunk; fNamespace = chunk_token(fName, charset(':')); if (chunk_size(fName)<1) { fName = fNamespace; fNamespace = {}; } return *this; } ByteSpan name() const { return fName; } ByteSpan ns() const { return fNamespace; } };
Given a ByteSpan, our universal data representation, split it out into the ‘namespace’ and ‘name’ parts, if they exist. Then we can get the name part by calling ‘name()’, and if there was a namespace part, we can get that from ‘ns()’. Why ‘ns’ instead of ‘namespace’? Because ‘namespace’ is a keyword in C/C++, and we don’t want any confusion or compiler errors.
One thing to note here is the implementation of the ‘operator <‘. Why is that there? Because if you want to use this as a keyfield in an associative container, such as std::map, you need some comparison operator, and by implementing ‘<‘, you get a quick and dirty comparison operator. This is a future enhancement we’ll use later.
Next up is the representation of an XML node itself, where we have XmlElement.
// Representation of an xml element // The xml iterator will generate these struct XmlElement { private: int fElementKind{ XML_ELEMENT_TYPE_INVALID }; ByteSpan fData{}; XmlName fXmlName{}; std::string fName{}; std::map<std::string, ByteSpan> fAttributes{}; public: XmlElement() {} XmlElement(int kind, const ByteSpan& data, bool autoScanAttr = false) :fElementKind(kind) , fData(data) { reset(kind, data, autoScanAttr); } void reset(int kind, const ByteSpan& data, bool autoScanAttr = false) { clear(); fElementKind = kind; fData = data; if ((fElementKind == XML_ELEMENT_TYPE_START_TAG) || (fElementKind == XML_ELEMENT_TYPE_SELF_CLOSING) || (fElementKind == XML_ELEMENT_TYPE_END_TAG)) { scanTagName(); if (autoScanAttr) { if (fElementKind != XML_ELEMENT_TYPE_END_TAG) scanAttributes(); } } } // Clear this element to a default state void clear() { fElementKind = XML_ELEMENT_TYPE_INVALID; fData = {}; fName.clear(); fAttributes.clear(); } // determines whether the element is currently empty bool empty() const { return fElementKind == XML_ELEMENT_TYPE_INVALID; } explicit operator bool() const { return !empty(); } // Returning information about the element const std::map<std::string, ByteSpan>& attributes() const { return fAttributes; } const std::string& name() const { return fName; } void setName(const std::string& name) { fName = name; } int kind() const { return fElementKind; } void kind(int kind) { fElementKind = kind; } const ByteSpan& data() const { return fData; } // Convenience for what kind of tag it is bool isStart() const { return (fElementKind == XML_ELEMENT_TYPE_START_TAG); } bool isSelfClosing() const { return fElementKind == XML_ELEMENT_TYPE_SELF_CLOSING; } bool isEnd() const { return fElementKind == XML_ELEMENT_TYPE_END_TAG; } bool isComment() const { return fElementKind == XML_ELEMENT_TYPE_COMMENT; } bool isProcessingInstruction() const { return fElementKind == XML_ELEMENT_TYPE_PROCESSING_INSTRUCTION; } bool isContent() const { return fElementKind == XML_ELEMENT_TYPE_CONTENT; } bool isCData() const { return fElementKind == XML_ELEMENT_TYPE_CDATA; } bool isDoctype() const { return fElementKind == XML_ELEMENT_TYPE_DOCTYPE; } void addAttribute(std::string& name, const ByteSpan& valueChunk) { fAttributes[name] = valueChunk; } ByteSpan getAttribute(const std::string &name) const { auto it = fAttributes.find(name); if (it != fAttributes.end()) return it->second; else return ByteSpan{}; } private: // // Parse an XML element // We should be sitting on the first character of the element tag after the '<' // There are several things that need to happen here // 1) Scan the element name // 2) Scan the attributes, creating key/value pairs // 3) Figure out if this is a self closing element // // We do NOT scan the content of the element here, that happens // outside this routine. We only deal with what comes up the the closing '>' // void setTagName(const ByteSpan& inChunk) { fXmlName.reset(inChunk); fName = toString(fXmlName.name()); } void scanTagName() { ByteSpan s = fData; bool start = false; bool end = false; // If the chunk is empty, just return if (!s) return; // Check if the tag is end tag if (*s == '/') { s++; end = true; } else { start = true; } // Get tag name ByteSpan tagName = s; tagName.fEnd = s.fStart; while (s && !wspChars[*s]) s++; tagName.fEnd = s.fStart; setTagName(tagName); fData = s; } public: // // scanAttributes // Scans the fData member looking for attribute key/value pairs // It will add to the member fAttributes these pairs, without further processing. // This should be called after scanTagName(), because we want to be positioned // on the first key/value pair. // int scanAttributes() { int nattr = 0; bool start = false; bool end = false; uint8_t quote{}; ByteSpan s = fData; // Get the attribute key/value pairs for the element while (s && !end) { uint8_t* beginattrValue = nullptr; uint8_t* endattrValue = nullptr; // Skip white space before the attrib name s = chunk_ltrim(s, wspChars); if (!s) break; if (*s == '/') { end = true; break; } // Find end of the attrib name. //static charset equalChars("="); auto attrNameChunk = chunk_token(s, "="); attrNameChunk = chunk_trim(attrNameChunk, wspChars); // trim whitespace on both ends std::string attrName = std::string(attrNameChunk.fStart, attrNameChunk.fEnd); // Skip stuff past '=' until the beginning of the value. while (s && (*s != '\"') && (*s != '\'')) s++; // If we've reached end of span, bail out if (!s) break; // capture the quote character // Store value and find the end of it. quote = *s; s++; // move past the quote character beginattrValue = (uint8_t*)s.fStart; // Mark the beginning of the attribute content // Skip until we find the matching closing quote while (s && *s != quote) s++; if (s) { endattrValue = (uint8_t*)s.fStart; // Mark the ending of the attribute content s++; } // Store only well formed attributes ByteSpan attrValue = { beginattrValue, endattrValue }; addAttribute(attrName, attrValue); nattr++; } return nattr; } };
That’s a bit of a brute, but actually pretty straightforward. We need a data structure that tells us what kind of XML element type we’re dealing with. We need the name, as the content of the element held onto for future processing. We hold onto the content as a ByteSpan, but have provision for making more convenient representations. For example, we turn the name into a std::string. In the futue, we can eliminate even this, and just use the XmlName with its chunks directly.
Besides the element name, we also have the ability to split out the attribute key/value pairs, as seen in ‘scanAttributes()’. Let’s take a deeper look at the constructor.
XmlElement(int kind, const ByteSpan& data, bool autoScanAttr = false) :fElementKind(kind) , fData(data) { reset(kind, data, autoScanAttr); } void reset(int kind, const ByteSpan& data, bool autoScanAttr = false) { clear(); fElementKind = kind; fData = data; if ((fElementKind == XML_ELEMENT_TYPE_START_TAG) || (fElementKind == XML_ELEMENT_TYPE_SELF_CLOSING) || (fElementKind == XML_ELEMENT_TYPE_END_TAG)) { scanTagName(); if (autoScanAttr) { if (fElementKind != XML_ELEMENT_TYPE_END_TAG) scanAttributes(); } } }
The constructor takes a ‘kind’, a ByteSpan, and a flag indicating whether we want to parse out the attributes or not. In ‘reset()’, we see that we hold onto the kind of element, and the ByteSpan. That ByteSpan contains everything between the ‘<‘ of the tag to the closing ‘>’, non-inclusive. The first thing we do is scan the tag name, so we can at least hold onto that, leaving the fData representing the rest. This is relatively low impact so far.
Why not just do this in the constructor itself, why have a “reset()”? As we’ll see later, we actually reuse XmlElement in some situations while parsing, so we want to be able to set, and reset, the same object multiple times. At least that’s one way of doing things.
Another item of note is whether you scan the attributes or not. If you do scan the attributes, you end up with a map of those elements, and a way to get the value of individual attributes.
std::map<std::string, ByteSpan> fAttributes{}; ByteSpan getAttribute(const std::string &name) const { auto it = fAttributes.find(name); if (it != fAttributes.end()) return it->second; else return ByteSpan{}; }
The ‘getAttribute()’ method is a most critical piece when we later start building our SVG model, so it needs to be fast and efficient. Of course, this does not have to be embedded in the core of the XmlElement, you could just as easily construct an attribute list outside of the element, but then you’d have to associate it back to the element anyway, and you end up in the same place. getAttribute() takes a name as a string, and returns the ByteSpan which is the raw, uninterpreted content of that attribute, without the enclosing quote marks. In the future, it would be nice to replace that std::string based name with a XmlName, which will save on some allocations, but we’ll stick with this convenience for now.
The stage is now set. We have our core components and data structures, we’re ready for the main event of actually parsing some content. For that, we have to make some design decisions. The first one we already made in the very beginning. We will be consuming a chunk of memory as represented in a ByteSpan. The next decision is how we want to consume? Do we want to build a Document Object Model (DOM), or some other structure? Do we just want to print out nodes as we see them? Do we want a ‘pull model’ parser, where we are in control of getting each node one by one, or a ‘push model’, where we have a callback function which is called every time a node is seen, but the primary driver is elsewhere?
My choice is to have a pull model parser, where I ask for each node, one by one, and do whatever I’m going to do with it. In terms of programming patterns, this is the ‘iterator’. So, I’m going to create an XML iterator. The fundamental structure of an iterator is this.
Iterator iter(content) while (iter) { doSomethingWithCurrentItem(*iter); iter++; }
So, that’s what we need to construct for our XML. Something that can scan its input, delivering XmlElement as the individual items that we can then do something with. So, here is XmlElementIterator.
struct XmlElementIterator { private: // XML Iterator States enum XML_ITERATOR_STATE { XML_ITERATOR_STATE_CONTENT = 0 , XML_ITERATOR_STATE_START_TAG }; // What state the iterator is in int fState{ XML_ITERATOR_STATE_CONTENT }; svg2b2d::ByteSpan fSource{}; svg2b2d::ByteSpan mark{}; XmlElement fCurrentElement{}; public: XmlElementIterator(const svg2b2d::ByteSpan& inChunk) { fSource = inChunk; mark = inChunk; fState = XML_ITERATOR_STATE_CONTENT; next(); } explicit operator bool() { return !fCurrentElement.empty(); } // These operators make it operate like an iterator const XmlElement& operator*() const { return fCurrentElement; } const XmlElement* operator->() const { return &fCurrentElement; } XmlElementIterator& operator++() { next(); return *this; } XmlElementIterator& operator++(int) { next(); return *this; } // Reset the iterator to a known state with data void reset(const svg2b2d::ByteSpan& inChunk, int st) { fSource = inChunk; mark = inChunk; fState = st; } ByteSpan readTag() { ByteSpan elementChunk = fSource; elementChunk.fEnd = fSource.fStart; while (fSource && *fSource != '>') fSource++; elementChunk.fEnd = fSource.fStart; elementChunk = chunk_rtrim(elementChunk, wspChars); // Get past the '>' if it was there fSource++; return elementChunk; } // readDoctype // Reads the doctype chunk, and returns it as a ByteSpan // fSource is currently sitting at the beginning of !DOCTYPE // Note: ByteSpan readDoctype() { // skip past the !DOCTYPE to the first whitespace character while (fSource && !wspChars[*fSource]) fSource++; // Skip past the whitespace // to get to the beginning of things fSource = chunk_ltrim(fSource, wspChars); // Mark the beginning of the "content" we might return ByteSpan elementChunk = fSource; elementChunk.fEnd = fSource.fStart; // To get to the end, we're looking for '[]' or just '>' auto foundChar = chunk_find_char(fSource, '['); if (foundChar) { fSource = foundChar; foundChar = chunk_find_char(foundChar, ']'); if (foundChar) { fSource = foundChar; fSource++; } elementChunk.fEnd = fSource.fStart; } // skip whitespace? // search for closing '>' foundChar = chunk_find_char(fSource, '>'); if (foundChar) { fSource = foundChar; elementChunk.fEnd = fSource.fStart; fSource++; } return elementChunk; } // Simple routine to scan XML content // the input 's' is a chunk representing the xml to // be scanned. // The input chunk will be altered in the process so it // can be used in a subsequent call to continue scanning where // it left off. bool next() { while (fSource) { switch (fState) { case XML_ITERATOR_STATE_CONTENT: { if (*fSource == '<') { // Change state to beginning of start tag // for next turn through iteration fState = XML_ITERATOR_STATE_START_TAG; if (fSource != mark) { // Encapsulate the content in a chunk svg2b2d::ByteSpan content = { mark.fStart, fSource.fStart }; // collapse whitespace // if the content is all whitespace // don't return anything content = chunk_trim(content, wspChars); if (content) { // Set the state for next iteration fSource++; mark = fSource; fCurrentElement.reset(XML_ELEMENT_TYPE_CONTENT, content); return true; } } fSource++; mark = fSource; } else { fSource++; } } break; case XML_ITERATOR_STATE_START_TAG: { // Create a chunk that encapsulates the element tag // up to, but not including, the '>' character ByteSpan elementChunk = fSource; elementChunk.fEnd = fSource.fStart; int kind = XML_ELEMENT_TYPE_START_TAG; if (chunk_starts_with_cstr(fSource, "?xml")) { kind = XML_ELEMENT_TYPE_XMLDECL; elementChunk = readTag(); } else if (chunk_starts_with_cstr(fSource, "?")) { kind = XML_ELEMENT_TYPE_PROCESSING_INSTRUCTION; elementChunk = readTag(); } else if (chunk_starts_with_cstr(fSource, "!DOCTYPE")) { kind = XML_ELEMENT_TYPE_DOCTYPE; elementChunk = readDoctype(); } else if (chunk_starts_with_cstr(fSource, "!--")) { kind = XML_ELEMENT_TYPE_COMMENT; elementChunk = readTag(); } else if (chunk_starts_with_cstr(fSource, "![CDATA[")) { kind = XML_ELEMENT_TYPE_CDATA; elementChunk = readTag(); } else if (chunk_starts_with_cstr(fSource, "/")) { kind = XML_ELEMENT_TYPE_END_TAG; elementChunk = readTag(); } else { elementChunk = readTag(); if (chunk_ends_with_char(elementChunk, '/')) kind = XML_ELEMENT_TYPE_SELF_CLOSING; } fState = XML_ITERATOR_STATE_CONTENT; mark = fSource; fCurrentElement.reset(kind, elementChunk, true); return true; } break; default: fSource++; break; } } fCurrentElement.clear(); return false; } // end of next() };
That code might have a face only a programmer could love, but it’s relatively simple to break down. The constructor takes a ByteSpan, and holds onto it as fSource. This ByteSpan is ‘consumed’, meaning, once you’ve iterated, you can’t go back. But, since ‘iteration’ is nothing more than moving a pointer in a ByteSpan, you can always take a ‘snapshot’ of where you’re at, and continue, but we won’t go into that right here. That’s going to be useful for tracking down where an error occured.
The crux of the iterator is the ‘next()’ method. This is where we look for the ‘<‘ character that indicates the start of some tag. The iterator runs between two states. You’re either in ‘XML_ITERATOR_STATE_CONTENT’ or ‘XML_ITERATOR_STATE_START_TAG’. Initially we start in the ‘CONTENT’ state, and flip to ‘START_TAG’ as soon as we see the character. Once in ‘START_TAG’, we try to further refine what kind of tag we’re dealing with. In most cases, we just capture the content, and that becomes the current element.
The iteration terminates when the current XmlElement (fCurretElement) is empty, which happems if we run out of input, or there’s some kind of error.
So, next() returns true or false. And our iterator does what it’s supposed to do, which is hold onto the current XmlElement that we have scanned. You can get to the contents of the element by using the dereference operator *, like this: *iter, or the arrow operator. In either case, they simply return the current element
const XmlElement& operator*() const { return fCurrentElement; } const XmlElement* operator->() const { return &fCurrentElement; }
Alright, in practice, it looks like this:
#include "mmap.h" #include "xmlscan.h" #include "xmlutil.h" using namespace filemapper; using namespace svg2b2d; int main(int argc, char** argv) { if (argc < 2) { printf("Usage: pullxml <xml file>\n"); return 1; } // create an mmap for the specified file const char* filename = argv[1]; auto mapped = mmap::createShared(filename); if (mapped == nullptr) return 0; // // Parse the mapped file as XML // printing out the elements along the way ByteSpan s(mapped->data(), mapped->size()); XmlElementIterator iter(s); while (iter) { ndt_debug::printXmlElement(*iter); iter++; } // close the mapped file mapped->close(); return 0; }
That will generate the following output, where the printXmlElement() function can be found in the file xmlutil.h. The individual attributes are indicated with their name followed by ‘:’, such as ‘height:’, followed by the value of the attributed, surrounded by ‘||’ markers. Each tag kind is indicated as well.
START_TAG: [svg] height: ||200|| width: ||680|| xmlns: ||http://www.w3.org/2000/svg|| SELF_CLOSING: [circle] cx: ||70|| cy: ||70|| r: ||50|| SELF_CLOSING: [circle] cx: ||200|| cy: ||70|| fill: ||#79C99E|| r: ||50|| SELF_CLOSING: [circle] cx: ||330|| cy: ||70|| fill: ||#79C99E|| r: ||50|| stroke: ||#508484|| stroke-width: ||10|| SELF_CLOSING: [circle] cx: ||460|| cy: ||70|| fill: ||#79C99E|| r: ||50|| stroke-width: ||10|| SELF_CLOSING: [circle] cx: ||590|| cy: ||70|| fill: ||none|| r: ||50|| stroke: ||#508484|| stroke-width: ||10|| END_TAG: [svg]
At this point, we have our XML “parser”. It can scan/parse enough for us to continue on our journey to parse and display SVG. It’s not the most robust XML parser on the planet, but it’s a good performer, very small and hopefully understandable. Usage could not be easier, and it does not impose a lot of frameworks, or pull in a lot of dependencies. We’re at a good starting point, and if all you wanted was to be able to parse some XML to do something, you could stop here and call it a day.
Next time around, we’re going to look into the SVG side of things, and sink deep into that rabbit hole.