Recently, I had an opportunity to work a very interesting prototype using Apache Avro and Apache Kafka. For those of you who haven’t worked with it yet, Avro is a data serialization system that allows for rich data structures and promises an easy integration for use in many languages. Avro requires a schema to define the data being serialized. In other words, metadata about the data that is being serialized. If it helps, think of the Avro schema being akin to an XSD document for XML.
Avro does, in fact, have a C# library and code gen tools for generating POCOs from avro schema files. Unfortunately, not a whole lot of documentation exists for either. It took a quite a bit of trial and error to get my serialization logic nailed down. Hopefully this post will help others get started using Avro a lot more easily than I was able to..
For the purpose of illustration, I’ve setup a fairly simplistic console app that will create an Avro serialized file stream. After creating the solution in Visual Studio, we start off by pulling in the Avro libraries. Fortunately, nuget.org does have nuget packages for Avro. The Avro package contains the core libraries and the Avro Tools package contains the code gen utility.
Avro Schemas & Code generation
The first step towards getting the serialization to work is to define the schema for the objects that I would like to serialize. In my hypothetical example, I’d like to define a schema for capturing Errors as they occur in a web application and serializing those to a Kafka based system. (We’ll focus on the Avro part for now, and leave the Kafka bits for later).
As you can see, I have three fields in my record - an id, the name of the application that generated the error and a complex type called details. The description for my complex type looks like this.
The next step would be to generate the C# code using these schemas. This, unfortunately, is where we enter into completely undocumented feature space. Assuming you’ve added the Avro Tools package to your solution, the codegen utility (codegen.exe) will exist inside the
packages\Apache.Avro.Tools.18.104.22.168\lib folder. I tried a number of different ways to get the code generation to work across multiple schema files, but did not have a whole lot of success getting the utility to work.
In the end, I had to copy avrogen.exe, Avro.dll (from the Avro package lib directory) and Newtonsoft.Json.dll into a folder along with the avsc file to get this to work. Additionally, I have to merge the two schema types into a single file. A bit of cop out, I’ll admit, and one of these days I plan to get back to figuring out if there is a better way to do this.
In the end, this is what my merged schema file looked like
Once I had all this squared away, the actual code generation part came down to a single command
avrogen.exe -s Error-Merged.avsc .
This generates two .cs files that I then just pulled into my solution.
Avro Serialization to disk
This was another area where there really wasn’t a whole lot of good sample code to explain the use of the library. Ended up looking at usage of the Java library to figure this out.
Build and run this code to get the serialized data written to disk. While this may not seem as much, we should consider that once we get the Avro serialization taken care of, the data can be streamed not only to the file system but across the wire as well.
Hopefully, this post helps someone get a head start into using Avro on the .net platform. For anyone who’s interested, the full solution is available here. Please feel free to fork and add more useful bits to the code.
I should point out that I, myself, am very new to Avro and am still learning the nuances that go with the framework. If you have a helpful hint or tip, please do leave a comment..