You’ll find the best reference for BSON at bsonspec.org and in the implementations of BSON in the MongoDB drivers, such as Go, C, and C#. You might want to pull up bsonspec.org/spec.html to reference while reading this article.
SingleStore extended BSON to support top-level value types, as detailed in the engineering blog here.
Read on for a guided tour through the BSON format itself!
BSON starts with a 4-byte little-endian int32 representing the length of the entire document and ends with a null byte.
Thus, the smallest valid BSON according to the original spec is 0500000000
. This can be seen using
SingleStore Playground.
Note that in this example and others, I am specifying BSON by providing Extended JSON V2 and then converting it to BSON. When used via a MongoDB client driver and through SingleStore Kai, the data starts and remains as BSON from client to storage and no conversion is required.
SELECT HEX('{}':>JSON:>BSON);
-- length 05000000
-- end 00
The largest valid document is left unspecified in the spec, but MongoDB does not accept BSON documents larger than 16 MiB.
Negative lengths aren’t valid. Why does the spec specify a signed int32? Perhaps to aid parsers by allowing them to use ‘-1’ as a sentinel in length-handling.
Because of the length prefix, BSON documents can be concatenated together in a stream which is the
format used by the .bson
files generated by mongodump
and consumed by mongorestore
.
Top-level value types using the SingleStore extension end in a type-code byte instead of zero, and use value-specific encoding. These can be as short as a single byte, for example for the ‘null’ type code.
SELECT HEX('null':>JSON:>BSON);
-- type 0A
After that initial 4-byte length, there are zero or more elements. An element is a type code, followed by a null-terminated key, followed by a type-code-specific value encoding. For example,
SELECT HEX('{"a":0}':>JSON:>BSON);
-- length 0C000000
-- type 10
-- key 61 00
-- value 00000000
-- end 00
Note that the null-terminated key means BSON keys cannot contain a null byte — one of the few things valid in JSON but not possible in BSON.
There are 21 BSON type codes detailed below.
A typical IEEE-754 double, stored in normal double byte order. A pointer into the BSON buffer at the value position can be directly interpreted as a double.
SELECT HEX('2.0':>JSON:>BSON);
-- value 0000000000000040
-- type 01
The BSON string type is an int32 length, followed by a utf-8 buffer, followed by a null byte. The length refers to the length of the buffer plus the null byte - unlike with documents, it isn’t inclusive. UTF-8 is a great choice that has stood up over time as the format has become ubiquitious.
SELECT HEX('"abc"':>JSON:>BSON);
-- length 04000000
-- utf-8 616263
-- null 00
-- type 02
The value following type code 3 is a BSON document, by the same spec as the top-level document. The document bytes, like all BSON value bytes, are not sensitive to their context, so they can be copied/moved without modification. This property is true of JSON as well and might seem obvious but it’s important to make modifying BSON efficient.
For example, note that the encoded bytes of {"z":null}
are the same when they are a subdocument.
SELECT HEX('{"z":null}':>JSON:>BSON);
-- 08000000 0A7A00 00
SELECT HEX('{"a":{"z":null}}':>JSON:>BSON);
-- 10000000036100 08000000 0A7A00 00 00
Array is where it gets a little weird.
To quote from the spec:
Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array [‘red’, ‘blue’] would be encoded as the document
{'0': 'red', '1': 'blue'}
. The keys must be in ascending numerical order.
With these constraints, storing the key names is redundant. For long arrays of integers, it can add significant storage overhead. Why was this choice made for BSON? I don’t know. But now the BSON ecosystem is stuck with it.
SELECT HEX('[true,false,false,true]':>JSON:>BSON);
-- length 15000000
-- 0:true 08300001
-- 1:false 08310000
-- 2:false 08320000
-- 3:true 08330001
-- end 00
-- type 04
0x30 is hex ‘0’, 0x31 is hex ‘1’, etc.
Binary is a way of storing arbitrary bytes in BSON. In JSON this would typically need to be base64-encoded and stored as a string. Like a string, it starts with a length, followed by a subtype byte, followed by the bytes. With binary the length refers to the length of the buffer, not including the sub type. There are a few “well-known” subtypes known and handled specially by some drivers. Subtype 0 is the default and most common and probably the one you want to use.
SELECT HEX('{"$binary":{"base64": "AAAABBBBCCCC","subType": "0"}}':>JSON:>BSON);
-- length 09000000
-- subtype 00
-- buffer 000 000 041 041 082 082
-- type 05
This is one of the multiple “deprecated” types. It’s still supported by drivers. Undefined is rejected for many purposes such as comparisons in most MongoDB versions. It shouldn’t be used.
SELECT HEX('{"$undefined": true}':>JSON:>BSON);
-- type 06
This is MongoDB’s format for identifiers. Rather than use some kind of standard UUID, this is a unique 12-byte format:
ObjectIDs are automatically added to documents as the value of the _id
field if that field is not
already present. The timestamp is big-endian so ObjectIDs are sortable with byte comparisons.
SELECT HEX('{"$oid":"AAAAAAAABBBBBBBBBBCCCCCC"}':>JSON:>BSON);
-- timestamp AAAAAAAA
-- random BBBBBBBBBB
-- counter CCCCCC
-- type 07
A simple type, with a byte 0 for false and 1 for true.
SELECT HEX('true':>JSON:>BSON);
-- bool 01
-- type 08
One of the major advantages of BSON over JSON is the native ability to unambiguously store dates. This is a 64-bit integer representing the number of milliseconds since the Unix Epoch.
SELECT HEX('{"$date":"1970-01-01T00:00:00.001Z"}':>JSON:>BSON);
-- date 0100000000000000
-- type 09
Note that the default JSON serialization of DateTime performed by MongoDB drivers has a problem. Typically, ISO 8601 dates are lexically sortable. However, this does not hold true when the milliseconds are sometimes omitted and sometimes not. The default serialization omits the milliseconds when they are zero, leading to incorrectly sorting dates if you sort the strings. This is corrected in SingleStore’s BSON:>JSON conversion.
The BSON Null type code requires no value bytes.
SELECT HEX('null':>JSON:>BSON);
-- type 0A
Note that this value null is distinct from the “undefined” type code, and distinct from a “missing” key.
This type is made of two null-terminated strings, the pattern and the options. It’s not commonly used (this is for storing, not using, regular expressions).
SELECT HEX('{"$regularExpression":{"pattern":"abc","options":"i"}}':>JSON:>BSON);
-- pattern 61626300
-- options 6900
-- type 0B
This deprecated type stores the name of another collection and an ObjectID. It should not be used.
This is an unusual type I think from old days when MongoDB required _id
to be an ObjectID. It
might not be possible to even use this type via extended JSON.
This is for storing JavaScript code in the database. I don’t believe this type has much purpose in modern MongoDB. It’s stored like a string but with a different type code.
SELECT HEX('{"$code":"hi"}':>JSON:>BSON);
-- length 030000000D
-- code 686900
-- type 0D
Symbol is another type that’s just like string with a different type code.
SELECT HEX('{"$symbol":"hi"}':>JSON:>BSON);
-- length 030000000D
-- code 686900
-- type 0E
JavaScript with Scope is a combined string and nested document. It’s deprecated and unused.
SELECT HEX('{"$code":"hi","$scope":{"a":1}}':>JSON:>BSON);
-- length 17000000
-- code length 03000000
-- code 686900
-- scope length 0C000000
-- scope 'a' 106100010000
-- scope end 00
-- end 00
-- type 0F
This type frustrates authors of BSON serializers (or at least, me) because it complicates the nesting. Beyond objects and arrays, now the serializer has to deal with a third nesting type. For a type that is long-deprecated, the ecosystem and libraries still pay a price.
Not much to say about this one!
SELECT HEX('1':>JSON:>BSON);
-- value 01000000
-- type 10
SELECT HEX('{"$timestamp": {"t": 1, "i": 2}}':>JSON:>BSON);
-- incr 02000000
-- time 01000000
-- type 11
Timestamp is a 4-byte increment followed by a 4-byte time. It is not commonly used by MongoDB clients.
Not much to say about this one either!
SELECT HEX('{"$numberLong":"1"}':>JSON:>BSON);
-- value 0100000000000000
-- type 12
IEEE-754-2008 provides for a 16-byte base-10 decimal type. This helps avoid certain odd behaviors with base-2 floating-point types.
SELECT HEX('{"$numberDecimal":"100.00"}':>JSON:>BSON);
-- value 10270000000000000000000000003C30
-- type 13
One oddness of this type is in how it compares to doubles. The same numbers can sometimes not be
represented in base-2 and base-10. 1
can, but for example 0.1
cannot.
// returns true
db.collection.aggregate({
$addFields: {
a: {
$eq: [
NumberDecimal("1"),
1.0
]
}
}
})
// returns false
db.collection.aggregate({
$addFields: {
a: {
$eq: [
NumberDecimal("0.1"),
0.1
]
}
}
})
The reason for this is that in IEEE-754 double, 0.1 is actually stored as roughly the following:
0.1000000000000000055511151231257827021181583404541015625
Try it here. The decimal128 representation can store it exactly.
This is a curious novelty for users, but behind the scenes it gets tricky for implementors. Since MongoDB allows all numbers to be comparable, the comparisons between numbers of different bases need to be exact and correct (as with the above).
This type can be useful for financial applications and other applications where exact decimal arithmetic is required, but it can be slower than the other number types for some applications.
MinKey is a special value where all values other than itself are greater than it is. It can occasionally be useful in queries.
MaxKey is the same thing, but the other way around.
Many other formats are more complicated than BSON. BSON’s simplicity is one of its strengths. Despite its rough edges, it has carried the MongoDB ecosystem for over a decade and will likely continue to do so for the next.