In this post, I am going to explain the basics of protocol buffer 3. Protocol buffer is developed by Google for better handling of data. There are many data formats such as csv, json. However, there are some weaknesses in each format. CSV is easy to handle but has some disadvantages – data type has to be inferred, hard to parse when the data includes commas. JSON is used in many places and can be communicated over the web and very flexible in format but it doesn’t have schema enforcing and JSON objects could be pretty big size because of repeated keys.
Advantages of Protocol Buffers
- Data is fully typed
- Data is compressed automatically results in less CPU usage
- Schema is required to generate code and read the data
- Documentations could be part of the schema
- Supports multi-language communication – data can be shared in different languages (Java, Python, Go, Javascript and others)
- Schema can evolve over time in a safe way
- Code is auto generated for the convenience
Disadvantages
- Not all languages are supported
- Since data is serialized, you can’t open the data file with text editor
Example
This is an example of protocol buffer schema and we will take a look at each piece.
syntax = "proto3"; message Person { int32 age = 1; string first_name = 2; string last_name = 3; bytes profile_img = 4; bool verified = 5; float height = 6; repeated string contacts = 7; }
Schema
You always have to put syntax = “proto3” to indicate this is protocol buffer 3. If you want to use 2 then replace 3 with 2.
Each schema starts with the keyword message then schema name with open/close braces.
In the schema, you can have multiple fields. Each field consists of field type, field name and tag. The first word is field types which are int32, string, bytes, bool, float as you see in the example. Next one is field name which you can arbitrarily decides. Mainly, it’s for your readability. The last one is tag which is more important than field names and is used for protocol buffers. Let’s take a look at each part.
Field Types
There are multiple built in types supported in protobuf3. I will not explain much about each type as they look very similar to other languages like C/C++, Java.
Integers
type: int32, int64, uint32, uint64, sint32, sint64
Floating Point Numbers
type: float (32 bits), double (64 bits)
Boolean
type: bool
String
String must always contain UTF-8 encoded or 7 bit ASCII text
type: string
Bytes
Raw byte array. Interpretation of bytes depends on the code.
type: bytes
Repeated Fields
Protocol buffers supports list or array by using “repeated” keyword. The specified field can take any number (0 or more) of elements you want. After the repeated keyword, you need to specify which type you want to use. Please refer to the example above.
Enums
If you need to use the values that are known in advance (i.e., day of week), you can use enum type.
Please note that the first value of an enum is the default value and enum must start by the tag 0 which is the default value. Here is an example of enums. You can use the enum type just like others after you define it.
enum DayOfWeek { UNDEFINED = 0; MONDAY = 1; TUESDAY = 2; WEDNESDAY = 3; THURSDAY = 4; FRIDAY = 5; SATURDAY = 6; SUNDAY = 7; }
Tag
In protocol buffers, field names are not important because it’s not actually used for the actual communication. Instead, the tag is used and thus is a very important element. In the example above, there are always values after field names. Those values are tags. The smallest value you can use is 1 and the largest value you can use is 2^29 – 1 or 536870991.
Tags numbered from 1 to 15 use 1 byte in space. It is recommended to use them for frequently populated fields.
Tags numbered from 16 to 2047 use 2 bytes.
Please note that the numbers between 19000 – 19999 are reserved by google for special use.
Conclusion
We have taken a look at very basics of protocol buffers. Please continue to read this post for more about protocol buffers.