Using SQLite as an On-disk File Format, Part 2
Herein we lay the groundwork for our application by writing a typical serialization module. Written in C++ these 75 lines of code are able to serialize any graph objects (including graphs with loops).
In Part 1 of this series, I gave a very cursory explanation of why I want to utilize SQLite as an on-disk file format. In this part, I would like to take a brief look at how I have traditionally written serialization frameworks and to code a simplified one right here.
I am going to write it in ISO C++ in an object-oriented style in order to balance it with the karma of my Win32 article. Also, in truth this article is only going to give half of the serialization story—writing from our in-memory representation of objects to the store. The second half, reading that stored data to re-create our object graph will be tackled in the next article.
On Classes
Let’s begin by considering the metadata of an object. In languages like C# and Lisp, we are given a very rich view of the type of an object. In C# we have the reflection library to query an objects type and to examine that type (we can also create types on the fly but will reserve doing that for the time being).
C++ offers very little in this realm. We are given access to little poorly defined things called type_infos via the special operator typeid. These objects give only a small (undefined) glimpse of the object’s type and its relationship to other types, and nothing else.
We can easily build a serialization system in C# because it has a rooted object hierarchy and a rich reflection system. We can build a serialization system in C++ only by enhancing it slightly to include the barest of metadata that we need.
Our metadata will contain a class name and a version number for every class that can be serialized.
class ClassInfo
{
public:
ClassInfo(const wchar_t *name, int version)
: version_(version), name_(name) { }
int Version() const { return version_; }
const wchar_t *Name() const { return name_; }
private:
int version_;
const wchar_t *name_;
};
We will enforce this by declaring an ISerializable interface and dictating that any class that wants to be serialized must fully implement the ISerializable interface.
class ISerializable
{
public:
virtual ~ISerializable() { }
virtual ClassInfo GetClassInfo() const = 0;
…
};
We will add to this interface as we require more of the classes to be serialized, but this is sufficient for the time being.
We have added a version field to the class information so that in the future we can change our class definitions but still have the potential to load them stores that contain classes of earlier versions. This will be discussed further when it comes time to deserialize these objects.
The Serialization Scheme
We wish to be able to serialize whole graphs of objects—even graphs in which the same object appears multiple times.
The simplest way to represent such a graph is to output all the data and metadata of a class whenever it is encountered. In the case of a graph where an object is repeated, two entire copies of that object will be written to the store. That’s one problem. An even bigger problem with this scheme occurs when we have graphs that contain loops. Consider for instance a linked list where the tail points back to the head of the list. If we attempted to serialize objects as we encounter them, then we would end up serializing for an eternity (or at least until the hard disk filled up).
In order to avoid this, we need to differentiate between objects and references to objects. In the linked list example, all we really want to output are references to objects. That way, when we get to the tail of the list that needs to point back to the beginning, we need only output a reference to the head, not the head itself.
We will adopt the following policy during serialization in order to make this work: when we are asked to serialize an object, we will first check to see if we have already serialized it. If we have not, then we will write out the object (its data and metadata). If we already have serialized it, then we will write only a reference to the object (its metadata).
Ignore for a minute the difficulties that this will have in a multithreaded environment and just consider the simplicity of this solution. We can now represent arbitrary graphs of objects—even those that contain loops.
A Matter of Style
We of course still need to select an output medium for our objects. Of course the whole point of this article series is to explore the use of SQLite as this medium but we currently only want to consider the simple case of serialization.
For that reason, we will write our data to a simple XML file. Every object that we write will consist of an <Object> tag whose children are the various fields of the object being serialized. References to objects will likewise be specified with <Reference> tags.
The fields of objects (which can themselves be objects) are written in enclosing tags named after the field itself.
As an example, consider a Point object that contains two fields: X and Y. The XML for this object will look something like:
<Object class="Point" version="1" id="1"><X>0</X><Y>0</Y></Object>
Yes it’s verbose but its format is very simple. Now, if we wanted to reference this same object, then we would simply write:
<Reference class="Point" version="1" id="1">
On to the Code!
Now that we have specified how our serializer will work, it is a very simple matter to implement it. We shall create a class called Serializer that will orchestrate this whole thing.
That class is responsible for writing all data in the XML format and for maintaining the list of serialized objects so that loops can be avoided.
Currently, we will only support writing integers and conglomerate objects (those derived from ISerializable). When you are implementing a richer system, you will need to provide more fundamental object support (floating point numbers, strings, arrays, etc.).
class Serializer
{
public:
Serializer();
void Serialize(const wchar_t *name, const ISerializable &obj);
void Serialize(const wchar_t *name, int n);
private:
typedef map<const ISerializable*, int> ObjectIds;
int depth_;
ObjectIds objIds_;
int objCount_;
};
The three fields, depth_, objIds_, and objCount_ are all used for managing objects and their references. Whenever an object is serialized, the Serializer assigns a unique identifier (an int) to that object and records its address. If that object is ever attempted to be serialized again, the Serializer will know to simply output a reference to it.
This scheme works just fine in single threaded environments, but becomes much more difficult to implement if the object graph is concurrently changing along with being serialized. This is one scenario that we must think hard about whether supporting. In this implementation we restrict ourselves to requiring the object graph be frozen while serialized. This is another area that we will have to revisit when considering the implementation of this serialization with SQLite.
Serializing an integer is the easiest operation that the Serializer performs since we will not differentiate between objects and references with integers. Instead, we will simply output their value wherever needed:
void Serializer::Serialize(const wchar_t *name, int n)
{
//
// If the depth is > 0, then we are serializing some fields of a parent
// object. We must therefore label the field.
//
if (depth_ > 0) {
wcout << L"<" << name << L">" << n << L"</" << name << L">";
}
else {
wcout << n;
}
}
We see that every object being serialized is given a name. This name is used only when an object is the field of another object. That condition is monitored using the depth_ variable of the Serializer class. If the depth is greater than 0, then the object being serialized is a field of a larger conglomerate object. Given our schema defined earlier, we tag these fields with their names. If the object is not a field, then we simply output its value.
Also, for the sake of simplicity, all data is being dumped to the console. In reality we would have to pass around some “SerializationContext” that dictated which IO stream should be used for output.
Serializing objects is necessarily more difficult as we have to manage references. However, you will find that even with this complexity, the function is still easy to understand:
void Serializer::Serialize(const wchar_t *name, const ISerializable &obj)
{
ClassInfo info = obj.GetClassInfo();
//
// If the depth is > 0, then we are serializing some fields of a parent
// object. We must therefore label the field.
//
if (depth_ > 0) {
wcout << L"<" << name << L">";
}
//
// Have we already serialized this object?
//
ObjectIds::iterator iObj = objIds_.find(&obj);
if (iObj != objIds_.end()) {
//
// Serialize just a reference.
//
wcout << L"<Reference class=\"" << info.Name() <<
L"\" version=\"" << info.Version() <<
L"\" id=\"" << (*iObj).second << "\" />";
}
else {
//
// Serialize the object itself and add it to the serialized table.
//
int id = objCount_++;
objIds_[&obj] = id;
wcout << L"<Object class=\"" << info.Name() <<
L"\" version=\"" << info.Version() <<
L"\" id=\"" << id << "\">";
depth_++;
obj.Serialize(*this);
depth_--;
wcout << L"</Object>";
}
//
// End the labeled object.
//
if (depth_ > 0) {
wcout << L"</" << name << L">";
}
}
Once again we have the simple field tagging based upon the depth of this object. After that we follow one of two paths: the first if the object has already been serialized and the second if the object has not yet been serialized.
When it comes time for an object to output its data, we call the Serialize member function of that object. In the implementation of that function, the object must call back to the serializer with any data it needs serialized. For example, the Point class from above has this Serialize function:
void Point::Serialize(Serializer &s) const
{
s.Serialize(L"X", X);
s.Serialize(L"Y", Y);
}
In their Serialize functions, objects must simply enumerate through their fields and output whatever is needed to reconstruct the object at a later time.
As a word of warning, the Serialize method of the Serializer is not yet at production quality. In particular, we do no error checking and haven’t thoroughly examined the ramifications of exceptions being thrown in the object Serialize functions (it would be nice to give it a no throw specification). However, it is sufficient to be a starting place for our serialization system.
An Example
Let us consider a linked list of nodes that each contains an integer. The class declaration for the node is:
class Node : public ISerializable
{
public:
Node(int value) : pNext_(0), value_(value) { }
virtual ~Node() { }
void SetNext(Node *pNext) { pNext_ = pNext; }
virtual ClassInfo GetClassInfo() const;
virtual void Serialize(Serializer &s) const;
private:
Node *pNext_;
int value_;
};
And the implementation of the ISerializable interface is:
ClassInfo Node::GetClassInfo() const
{
return ClassInfo(L"Node", 1);
}
void Node::Serialize(Serializer &s) const
{
s.Serialize(L"Value", value_);
s.Serialize(L"Next", *pNext_);
}
Now, let’s consider what happens when we try to serialize a loop of four of these nodes. When we serialize the “first” in this loop, its metadata will be output by Serializer::Serialize and its data will be output by Node::Serialize because it has not been serialized before. This will continue for the second and third nodes: each one driving the next node to be serialized. When it comes time for the fourth node to serialize its next pointer, Serializer::Serialize will realize that the first node has already been serialized and will therefore output a mere reference to that node.
That’s what should happen, now let’s see what does happen by constructing a simple test:
int main()
{
Node n3(300);
Node n2(200);
Node n1(100);
Node n0(0);
n0.SetNext(&n1);
n1.SetNext(&n2);
n2.SetNext(&n3);
n3.SetNext(&n0);
Serializer s;
s.Serialize(L"n0", n0);
}
Here, we setup our graph and let the Serializer do its work. The output of this program is the following XML:
<Object class="Node" version="1" id="0">
<Value>0</Value>
<Next>
<Object class="Node" version="1" id="1">
<Value>100</Value>
<Next>
<Object class="Node" version="1" id="2">
<Value>200</Value>
<Next>
<Object class="Node" version="1" id="3">
<Value>300</Value>
<Next>
<Reference class="Node" version="1" id="0" />
</Next>
</Object>
</Next>
</Object>
</Next>
</Object>
</Next>
</Object>
We see that indeed the Serializer operated as expected and fully recreated our loop.
On to Deserialization
Well I think that is enough for all of us to swallow today, so let’s take a break and move on to deserialization of this XML file next week. I’ll see you all then!

Reader Comments
Post a Comment