Should You Use XML or Protocol Buffers to Store and Exchange Data?
Extensible Markup Language (XML) is a flexible text format used for a wide variety of applications, including data serialization and exchange of data. SOAP web services use XML as the data exchange format, for example.
More recently, protocol buffers were also introduced for data exchange and data serialization. Even though the purpose of XML and protocol buffers is the same, these are very different technologies.
Tags and fields
XML uses tags to describe or contain data. The following XML example has an opening tag, data, and a closing tag:
<hello>Hello<hello>
The opening tag name and closing tag name must match.
Protocol buffers use fields, with each field being a name and a value type. The value type could be number, boolean, string, or raw byte.
Plain text and binary encoding
XML is in plain-text format. While a protocol buffer’s message type or schema is in plain text format representation, the data or message that is actually exchanged is in binary format. Protocol buffers are not suitable if the data and message to be exchanged or serialized needs to be in plain text.
Protocol buffers support encoding for several languages, including C++ and Java. A compiler is needed to convert the plain-text protocol buffers message type (.proto file) into language-specific data access classes. Subsequently, a language-specific API may be used to send and receive binary-format protocol buffers messages.
Semi-structured and structured data
XML is a semi-structured data format with the option of schema conformance. A schema is not a prerequisite for creating an XML document, but an XML schema could be used. An XML schema is itself an XML document, which is self-describing.
Protocol buffers need a schema to define the structure of data. The message type in protocol buffers is defined in a .proto file, which constitutes the schema. Protocol buffers are not self-describing. Data exchanged as protocol buffers cannot be interpreted without the associated schema.
The schema specification for XML and protocol buffers has some similarities. Both XML elements and .proto file fields could be made required, optional, or repeated.
Hierarchical data
Both XML and protocol buffers support hierarchical data. XML uses nested elements within other elements, and protocol buffers use value type as other messages.
Protocol buffers or XML?
XML has requirements for an XML document:
- It must start with an XML declaration, which is a processing instruction
- It must have a root element
- It must be well formed, with matching start and end tags
In contrast, a protocol buffer’s textual representation of a message is just name value pairs.
The binary-encoded protocol buffers messages are smaller than equivalent XML messages, and a protocol buffers message takes less time to parse. Protocol buffers are easier to access programmatically using the data access classes than XML using some parser or binding API.
Use XML if the data to be stored, presented, or exchanged needs to be modeled as text; if the data needs to be self-describing; or to model another text-based markup document, such as HTML.
Otherwise, use protocol buffers. The data exchanged is smaller, faster, and requires less network bandwidth.