Kaitai Struct and Scientific Data
tl;dr: kaitai struct is awesome.
File formats can be pretty annoying – especially when you figure them out through weird combinations of reverse-engineering, hand-me-down code and trial-and-error.
What we’ve ended up with in yt is a bunch of data formats where the process of conducting the IO is all mixed up with the description of that IO. This means that any attempt to update things (which I’ve alluded to in these blog posts) requires a fair bit of care to make sure that the process is not disruptive in any way.
In preparation for this, I set out to start writing down some of the file formats in a flat text format by going through my old notes and the code in the various yt io.py
files. Before too long, I decided to instead start utilizing
kaitai struct instead.
I’ve found Kaitai struct to be useful in other projects – for instance, I have been using it as a method to reverse engineer the data files from an old DOS game I used to play with my brother back in the early 90s. The combination of the simple (but reasonably flexible) syntax, the ruby-based visualizer and the WebIDE make it really good for exploratory reverse engineering.
And, it generates code! You can run:
$ ksc -t python somefile.ksy
and it’ll generate a python reader. It can also generate javascript, java, go, etc, and I think rust is in the works.
So the question became: could we use kaitai struct for documenting a data format like, say, the binary format from GADGET-2? After that we can really stretch our legs and try it on more complicated ones, but let’s start with this one.
Gadget-2 Data Format
The (vanilla) Gadget-2 data format is reasonably straightforward, and it’s even described in some detail in the Gadget manual. You have a header, then sets of records. Each of these sets of records has a separating value (sometimes!). According to table 3 in the user guide, the file looks something like:
- Header
- Position
- Velocity
- ID
- Mass (only for variable mass particles)
- Energy (only for gas particles)
- Density (Only for gas particles)
- Smoothing length (Only for gas particles)
- Potential (if enabled in makefile)
- Acceleration (if enabled in makefile)
- Endtime (if enabled in makefile)
- Timestep (if enabled in makefile)
Gadget-2 is probably the most widely-used astrophysics simulation code, but one of its defining characteristics is that it is quite readily hackable to include additional parameters, additional particle attributes, and lots more physics. This usually means that when you receive binary gadget data, you need to know what to expect in the file – the individual file formats are typically not self-describing.
The header is often left unchanged, at least in the data I’ve seen. In yt we have a simple specification system to allow for modifications to both the fields and the header, but the “vanilla” header looks something like this (in python struct notation):
default = (('Npart', 6, 'i'),
('Massarr', 6, 'd'),
('Time', 1, 'd'),
('Redshift', 1, 'd'),
('FlagSfr', 1, 'i'),
('FlagFeedback', 1, 'i'),
('Nall', 6, 'i'),
('FlagCooling', 1, 'i'),
('NumFiles', 1, 'i'),
('BoxSize', 1, 'd'),
('Omega0', 1, 'd'),
('OmegaLambda', 1, 'd'),
('HubbleParam', 1, 'd'),
('FlagAge', 1, 'i'),
('FlagMetals', 1, 'i'),
('NallHW', 6, 'i'),
('unused', 16, 'i'))
Before we get any further, let’s check if we can parse just this with kaitai struct.
Our First ksy File
In its simplest form, Kaitai specifications allow specifying the format of data and the order that different types of data will arrive in. The base data types it has are standard numerical data types, bytes, strings, arrays and so on.
Kaitai has a little bit of metadata we will add at the beginning, so we start with this preamble:
meta:
id: gadget_format1
endian: le
This gives it a name and notes it as little endian.
If we wanted to build a parser for the header, the most obvious thing we could do would be to parse the different items in order. The very first item is an array of 6 integers that describes the number of particles in the simulation, separated by each particle type. This comes in a sequence starting at the beginning of our “stream,” so we can use the seq
top-level type to start parsing:
seq:
- id: recsize_0
type: u4
- id: npart1
type: u4
- id: npart2
type: u4
- id: npart3
type: u4
- id: npart4
type: u4
- id: npart5
type: u4
- id: npart6
type: u4
recsize_0
here is because we know that our header will be preceded by a value indicating how many bytes are in it – this is a common practice, but not universal. And I’ve said that each of the variables (npart1
… npart6
) is of type u4
, which means “unsigned four byte integers.”
This isn’t that efficient, though – each particle type has its own variable. KS provides us with the option to repeat
which can take an expression. Now we can enable grabbing arrays of values, so we can use that, instead. Let’s change this to:
seq:
- id: recsize_0
type: u4
- id: npart
type: u4
repeat: expr
repeat-expr: 6
(You can also do repeat: eos
for end of stream and repeat-until
.) The result now is that npart
is an array – so later on when we loop, we can look it up by using the special variable _index
inside another repeat-expr
. We’ll look at that a bit later.
Data Types in ksy
Parsing the entire header in this way is certainly possible, but it also can be a bit clunky – and can be harder for maintainability. Instead of writing all of our header values out, let’s use a user-defined type.
The top-level key types
is where user-defined types are described; you can do some pretty fancy things like supply parameters to them, but we’ll just use it to hold the data we know.
types:
gadget_header:
seq:
...
Now, in our top-level seq
section, we can just reference the header type:
seq:
-id: header
type: gadget_header
Big Arrays and Lazy-reading
The tricky bit comes in when we get to our arrays of values. Kaitai has several different languages it can generate for – Python, C#, C++, ruby, and others. When reading lots of little items, python can get bogged down. For instance, this is the most obvious way of describing one of the position arrays:
-id: coordinates_part1
type: f4
repeat: expr
repeat-expr: header.npart[0] * 3
The generated python code will read a series of 4-byte objects, one at a time, and append them to a list. The combination of these things results in a lot of overhead from the python language runtime and the type system, so it ends up being rather slow.
One way to get around this is to use what’s called an instance
. This allows parsing to happen only when requested, and it can also exist outside of the rest of the parsing structure.
Important note: There are some fiddly issues with making instance
objects work and how they relate to substreams of IO which I’m going to sidestep here. Let’s just assume that it works, instead!
By default, we can read in bytes for these individual objects; this lets us read a big block of data (which we can deal with later – for instance, inside python!) and only parse into individual floats when we request. This might look something like defining a custom type with both a seq
and an instances
section:
types:
f4_array_type:
params:
- id: count
type: u4
seq:
- id: buffer
size: 4 * count
instances:
values:
pos: 0
size-eos: true
id: entries:
type: f4
repeat: eos
Here we are saying, just read the bytes. But, if anybody asks, this is what the attribute values
should look like – you should parse it starting at 0, go to the end of the stream, and assume it’s all f4
-typed items. (I am not certain that both size-eos
and repeat-eos
are necessary.)
With these components, we should be able to parse our entire gadget binary file, a
Working Implementation
I’ve gotten this to the point that it mostly works, and I’ve been committing to data-exp-lab/astro-data-formats. I ended up splitting the array stuff into its own parameterized type that I store in a separate file, so that we can potentially reuse it in other file formats.
Here’s some of the high-level stuff:
meta:
id: gadget_format1
endian: le
ks-opaque-types: true
imports:
- array_buffer
seq:
- id: gadget_header
type: header
- id: coordinates
type: particle_fields('f4', 3)
- id: velocities
type: particle_fields('f4', 3)
- id: particle_ids
type: particle_fields('u4', 1)
types:
header:
seq:
- id: recsize_0
type: u4
- id: npart
type: u4
repeat: expr
repeat-expr: 6
- id: massarr
type: f8
repeat: expr
repeat-expr: 6
[ ... ]
particle_fields:
params:
- id: field_type
type: str
- id: components
type: u1
seq:
- id: magic1
type: u4
- id: fields
type: field(_index, components, field_type)
repeat: expr
repeat-expr: 6
- id: magic2
type: u4
field:
params:
- id: index
type: u1
- id: components
type: u1
- id: field_type
type: str
seq:
- id: field
size: _root.gadget_header.npart[index] * components * 4
type: array_buffer(field_type)
And then the array_buffer
is an instance-based way of getting the raw bytes. Right now this is reasonably fast, and there’s a possibility I could simplify the structure and flatten it out a bit, but it works exactly as I want it to.
Future
The next thing I’m going to do is work on porting the RAMSES frontend to this format. That will enable some more exploration and validation of the different outputs, and hopefully use that as guidance for any future modifications to the IO systems. One of the stickier wickets I’ve seen is that it’s possible to use memory-mapping, but I think there are some subtle places where data is read into memory and copied – this kind of gets rid of the purpose of the memory mapping. I hope to explore and dig more deeply into this later.
Ultimately, I believe a combination of ksy files for binary formats and a ksy-like dialect for describing the connection between chunks of data and physical space locations will simplify a considerable amount of data analysis.
I’d also like to thank @GreyCat on gitter, who was super helpful in helping me work out some of the bits I wasn’t clear on.