PolarSPARC |
Introduction to Google Protocol Buffers
Bhaskar S | 07/04/2020 |
Overview
Google Protocol Buffers (sometimes referred to as protobuf) is a Data Serialization framework with the following features:
Is platform neutral
Is language neutral with official support for C++, C#, Go, Java, and Python
Is compact, efficient, fast, and flexible
Supports for schema evolution
Supports basic scalar types such as bool, bytes, double, fixed32, fixed64, float, int32, int64, sint32, sint64, string, uint32, uint64, etc
Supports complex types such as enum, map, message, etc
Data elements of custom type (the data format) are defined in a .proto file
Installation and Setup
The installation is on a Ubuntu 20.04 LTS based Linux desktop.
We need to install the packages for the protobuf compiler called protobuf-compiler and the Python language bindings called python3-protobuf, from the Ubuntu repository.
To install the mentioned packages, execute the following commands:
$ sudo apt-get update
$ sudo apt-get install protobuf-compiler -y
$ sudo apt-get install python3-protobuf -y
For Java language bindings, we will need the JAR file called protobuf-java-3.x.y.jar, where x.y is the current version. At the time of this article, the current version is 3.11.4. We will leverage Maven to manage the Java dependencies.
The following is the Maven pom.xml file we will use:
Once the data format is defined and saved in a .proto file, we run the protobuf compiler (called protoc) on the .proto file to generate the data access classes for the desired language bindings (Java, Python, etc).
In this article, we will demostrate the serialzation and deserialization using both the Java and Python language bindings.
Hands-on with Google Protocol Buffers
We will demonstrate the ability to both serialize and deserialize a simple object Customer with scalar types using protobuf.
The following is the schema definition for a Customer object defined in the file Customer.proto located in the directory src/main/proto as shown below:
Google Protocol Buffers version 3 referred to as proto3 is specified using the syntax keyword.
To avoid any namespace collisions, we use the package keyword.
A Customer object is defined using the message keyword. Each field within the Customer object is defined using the following general syntax:
[repeated] <field-type> <field-name> = <field-tag>
where,
The optional value of repeated is used for fields that can be repeated zero or more times
<field-type> specifies the data type and can be any scalar type or complex type
<field-name> specifies the name of the field
<field-tag> specifies a unique integer representation for the field and should *NOT* be changed once in use. Integers 19000 through 19999 are reserved and cannot be used
Let us now compile the Customer.proto file for Java and Python.
For Java binding, run Maven compile. This will generate a Java file called CustomerOuterClass.java in the directory target/generated-sources/main/java/com/polarsparc/protobuf3.
Let us create a Java program called CustomerTest.java to use the generated class in the directory src/main/java/com/polarsparc/protobuf3 as shown below:
The Java class(es) generated by protobuf compiler are all immutable. To construct a Customer object, one must first use the corresponding builder class (called Customer.Builder) to set the field values in the object and then finally call the build() method to get the object.
Executing the above Java program CustomerTest.java, produces the following results as shown in Output.1 below:
Customer fields: {com.polarsparc.protobuf3.Customer.first_name=Bugs, com.polarsparc.protobuf3.Customer.last_name=Bunny, com.polarsparc.protobuf3.Customer.email_id=bugs.b@carrot.co, com.polarsparc.protobuf3.Customer.phone_no=[100-100-1000, 100-100-1005]} Customer data size: 59 Customer: first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005" Customer deserialized: first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005"
Now, switching gears to the Python binding, compile the Customer.proto file using the following command:
$ protoc --python_out=. ./Customer.proto
The compilation will generate a Python file called Customer_pb2.py in the specified directory, which is the current directory.
Let us create a Python script called CustomerTest.py to use the generated script in the current directory as shown below:
To construct a Customer object, one must first import the generated module (called Customer_pb2) and invoke the empty constructor. One can then set the field values in the object like any regular Python object.
Executing the above Python script CustomerTest.py, produces the following results as shown in Output.2 below:
Customer fields: [(<google.protobuf.pyext._message.FieldDescriptor object at 0x7fe687eca290>, 'Bugs'), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7fe687eca2b0>, 'Bunny'), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7fe687eca2d0>, 'bugs.b@carrot.co'), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7fe687eca2f0>, ['100-100-1000', '100-100-1005'])] Customer data size: 59 Customer: first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005" Customer deserialized: first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005"
Now, we will demonstrate the ability to both serialize and deserialize an object Account with complex types using protobuf.
The following is the schema definition for Customer and Account objects defined in the file CustomerAccount.proto located in the directory src/main/proto as shown below:
Notice the use of the option keyword with java_outer_classname to force the name of the outer class generated by protobuf compiler.
To define a pre-defined set of constants, we use the enum keyword. In this example AccountType is defined as an enum with the constants CA_UNKNOWN, CA_SAVINGS, CA_CHECKING, and CA_BROKERAGE. There *MUST* always be a zero value enum, so that we can use 0 as a numeric default value.
A field in a message can refer other message types. In this example, one of the fields in the Account type references the Customer type.
Let us now compile the CustomerAccount.proto file for Java and Python.
For Java binding, run Maven compile. This will generate a Java file called CustomerAccount.java in the directory target/generated-sources/main/java/com/polarsparc/protobuf3.
Let us create a Java program called CustomerAccountTest.java to use the generated class(es) in the directory src/main/java/com/polarsparc/protobuf3 as shown below:
Executing the above Java program CustomerAccountTest.java, produces the following results as shown in Output.3 below:
Account fields: {com.polarsparc.protobuf3.Account.acct_no=12345, com.polarsparc.protobuf3.Account.acct_type=CA_BROKERAGE, com.polarsparc.protobuf3.Account.customer=first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005" } Account data size: 70 Account: acct_no: "12345" acct_type: CA_BROKERAGE customer { first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005" } Account deserialized: acct_no: "12345" acct_type: CA_BROKERAGE customer { first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005" }
Now, switching gears to the Python binding, compile the CustomerAccount.proto file using the following command:
$ protoc --python_out=. ./CustomerAccount.proto
The compilation will generate a Python file called CustomerAccount_pb2.py in the specified directory, which is the current directory.
Let us create a Python script called CustomerAccountTest.py to use the generated script in the current directory as shown below:
Notice how the fields of the Customer object within the Account object are set in Python.
Executing the above Python script CustomerAccountTest.py, produces the following results as shown in Output.4 below:
Account fields: [(<google.protobuf.pyext._message.FieldDescriptor object at 0x7f19b6f9de10>, '12345'), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7f19b6f9de30>, 2), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7f19b6f9de50>, first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005" )] Account data size: 70 Account: acct_no: "12345" acct_type: CA_BROKERAGE customer { first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005" } Account deserialized: acct_no: "12345" acct_type: CA_BROKERAGE customer { first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@carrot.co" phone_no: "100-100-1000" phone_no: "100-100-1005" }
In the above example, we had all the object definitions in a single CustomerAccount.proto schema definition file.
We could modularize the object definitions and separate them into two .proto files - one for the Customer related object(s) and the other for the Account related object(s).
The following is the schema definition for the Customer2 related object(s) defined in the file Customer2.proto located in the directory src/main/proto as shown below:
Notice the use of the option keyword with java_multiple_files which causes top-level messages, enums, etc to be defined at the package level, rather than inside an outer class file.
And, here is the schema definition for the Account2 related object(s) defined in the file Account2.proto located in the directory src/main/proto as shown below:
In the above Account2.proto file, we import the Customer2.proto file.
Let us now compile both the Customer2.proto and the Account2.proto files for Java and Python.
For Java binding, run Maven compile. This will generate a Java file for each object type defined in the schema file(s) in the directory target/generated-sources/main/java/com/polarsparc/protobuf3.
Let us create a Java program called CustomerAccountTest2.java to use the generated classes in the directory src/main/java/com/polarsparc/protobuf3 as shown below:
Executing the above Java program CustomerAccountTest2.java, produces the following results as shown in Output.5 below:
Account fields: {com.polarsparc.protobuf3.Account2.acct_no=12345, com.polarsparc.protobuf3.Account2.acct_type=AT_SAVINGS, com.polarsparc.protobuf3.Account2.customer=first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@looney.us" phone_no { number: "100-100-1000" type: PT_MOBILE } phone_no { number: "100-100-1005" type: PT_WORK } } Account data size: 78 Account: acct_no: "12345" acct_type: AT_SAVINGS customer { first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@looney.us" phone_no { number: "100-100-1000" type: PT_MOBILE } phone_no { number: "100-100-1005" type: PT_WORK } } Account deserialized: acct_no: "12345" acct_type: AT_SAVINGS customer { first_name: "Bugs" last_name: "Bunny" email_id: "bugs.b@looney.us" phone_no { number: "100-100-1000" type: PT_MOBILE } phone_no { number: "100-100-1005" type: PT_WORK } }
Now, switching gears to the Python binding, compile both the Customer2.proto and Account2.proto files using the following commands:
$ protoc --python_out=. ./Customer2.proto
$ protoc --python_out=. ./Account2.proto
The compilation will generate two Python files called Customer2_pb2.py and Account2_pb2.py in the specified directory, which is the current directory.
Let us create a Python script called CustomerAccountTest2.py to use the generated script in the current directory as shown below:
Notice how the fields of the PhoneNumber2 object within the Customer2 object inside the Account2 object are set in Python.
Executing the above Python script CustomerAccountTest2.py, produces the following results as shown in Output.6 below:
Account fields: [(<google.protobuf.pyext._message.FieldDescriptor object at 0x7fa590626570>, '12345'), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7fa590626230>, 1), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7fa590626330>, first_name: "Bugs" last_name: "Bunny" email_id: "bugs.bunny@looney.us" phone_no { number: "100-100-1000" type: PT_MOBILE } phone_no { number: "100-100-1005" type: PT_WORK } )] Account data size: 82 Account: acct_no: "12345" acct_type: AT_SAVINGS customer { first_name: "Bugs" last_name: "Bunny" email_id: "bugs.bunny@looney.us" phone_no { number: "100-100-1000" type: PT_MOBILE } phone_no { number: "100-100-1005" type: PT_WORK } } Account deserialized: acct_no: "12345" acct_type: AT_SAVINGS customer { first_name: "Bugs" last_name: "Bunny" email_id: "bugs.bunny@looney.us" phone_no { number: "100-100-1000" type: PT_MOBILE } phone_no { number: "100-100-1005" type: PT_WORK } }
We will now demonstrate the ability to serialize an instance of a Customer object to a file using Python and then deserializing the same Customer instance from the file using Java.
Let us create a Python script called SerializeCustomerTest.py to serialize an instance of the Customer object to a file called /tmp/customer.bin as shown below:
Executing the above Python script SerializeCustomerTest.py, produces the following results as shown in Output.7 below:
Customer fields: [(<google.protobuf.pyext._message.FieldDescriptor object at 0x7fe8fc223290>, 'Wile E'), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7fe8fc2232b0>, 'Coyote'), (<google.protobuf.pyext._message.FieldDescriptor object at 0x7fe8fc2232d0>, ['200-101-2001', '201-102-2002'])] Customer data size: 44 Customer: first_name: "Wile E" last_name: "Coyote" phone_no: "200-101-2001" phone_no: "201-102-2002" Customer object serialized to /tmp/customer.bin
Notice that we have not set a value for the email_id field in the Customer object.
Let us create a Java program called DeserializeCustomerTest.java in the directory src/test/com/polarsparc/protobuf3 as shown below:
Executing the above Java program DeserializeCustomerTest.java, produces the following results as shown in Output.8 below:
Customer deserialized: first_name: "Wile E" last_name: "Coyote" phone_no: "200-101-2001" phone_no: "201-102-2002"
When an object is deserialized, if the encoded message does not contain a particular field, the corresponding field in the parsed object is set to the default value for that field. It is false for type bool, empty bytes for type bytes, the first defined enum value (must be set to a zero value) for type enum, zero (0) for numeric types (such as int32, int64, etc), and empty string for type string.
This concludes the demonstration of the Google Protocol Buffers (a.k.a Protobuf) in both Java and Python.
Source Code
References