HDFtools
I am an atmospheric scientist who uses numerical models as research tools. My current interests are the processes which go on inside of severe thunderstorms. As I am interested in very fine-scale features occurring within these thunderstorms, I run my models at very high resolution. As a result, these models produce a lot of data - up to terabytes per simulation.
These simulations require significant computer resources and are run on computer architraves with multiple processors (either shared memory or distributed memory). A typical simulation may involve on the order of over 100 processors, each operating on its own subset of the entire model domain (the chunk of atmosphere being modeled).
These models I use were written with MPI (Message Passing Interface) for handling communication between nodes. A node may be one of a collection of computers, each with one processor, all hooked together with a fast switch (a so-called distributed memory architecture), or a single CPU on a multiple-CPU machine (a so-called shared memory architecture). Nodes must exchange data as the model runs. This must be explicitly coded, and increases the complexity of the model; however, the reward is that you can "divide and conquer" very large problems just by throwing a bunch of nodes at a problem.
Numerical models will periodically dump their current state to disk for analysis/post-processing/visualization (these are called history files). For very large simulations, writing history files can take up a significant fraction of the total time it takes to run the model. In fact, minimizing the amount of time my models take to write their history files was the motivation behind HDFtools.
Because of the distributed nature of these MPI models, each node only contains its own piece of the total atmosphere being modeled.
Typically, atmospheric models are decomposed such that each node holds the same amount of data, and its own unique piece of the model domain. This is often accomplished by having each node contain a subset of the horizontal extent of the 3D atmosphere and the entire vertical extent of the atmosphere. You could think of each node as containing a rectangular cube where the cubes all fit neatly together. A simple 9 node decomposition could break the atmosphere into a 3x3 cubes as such:
In this case we are looking at the horizontal decomposition. It is understood that each node contains all vertical points. The single numbers are node numbers (what is returned from the MPI_COMM_RANK command), (starting at 0 as is done with MPI) and the numbers in parenthesis indicate how the decomposition was done in the horizontal (x and y directions), starting at 1 as is the FORTRAN norm (acknowledging that most models are written in FORTRAN).
Writing history files can be done in one of two ways:
Most models opt for option the single-file option which has the advantage that you only have one file per time level to deal with. However, this method typically is slow due to the way the way history file data is "funneled" through one node. My own experience is that with very large runs, the slowdown using this method is a show stopper. This motivated me to develop HDFtools.
The distributed file method is typically much faster than single-file for writing data, and this is especially the case with high-bandwidth shared file systems, or where each node writes to its own local file system. However, the disadvantage of DF is that you now have your physical model domain split up into a collection of separate files, which can make post-processing more complicated.
HDFtools is a collection of code which reads in data written using the distributed-file method as easily (if not easier) as if it were written in the single-file method. In addition, HDFtools allows for compression (lossy and lossless) of history file data which can reduce the amount of data written tremendously. Included with HDFtools are also conversion programs to other scientific data formats including netCDF and vis5d. It is called HDFtools because I use the HDF v4 data format for storing model data, and couldn't think of a better name. HDF is a flexible, well supported, free data format written specifically for scientific applications. It is robust, allowing for multiple variables, metadata, etc. to be written to a single file. The API included with HDFv4 makes reading and writing data very easy.
HDFtoools makes it easy to access any subset of the distributed files with a simple call. Let's say we wanted to read the subdomain indicated by the dashed lines into memory:
All you need to do is indicate the Cartesian locations of the endpoints of the cube that you wish to read in (with respect to the entire domain), and HDFtools will take care of the rest.
The core of HDFtools is a layer which sits on top of HDF which allows you read in data from a model which writes data in the DF method. From a FORTRAN code, a series of call to the HDFtools routines might look as follows:
call first_hdf_index(basename,t0,firsthdf)
call get_hdf_metadata(basename,firsthdf,t0,nx,ny,nz,nodex,nodey)
call read_hdf_mult(u,basename,itime,"Uinterp",X0,Y0,X1,Y1,Z0,Z1,nx,ny,nz,nodex,nodey)
The first call is optional, but useful. It looks for hdf files at time t0 in the directory/basename supplied by the string basename, and stops when it finds one. This call exists because it is not always desirable to have the entire domain's worth of files locally. One of the big advantages to splitting the domain across many files is that typically only a small subset of the entire domain is of interest. By locally storing only the files where interesting things are occurring, you can drastically reduce the amount of space required on your hard drive.
The seond call gets metadata from one of the hdf files. Domain-wide metadata is stored in each HDF file, meaning metadata in any HDF file may be used to stitch requested files together for graphing, visualization, etc.
The third call does the heavy lifting. This call takes the variable name, subcube indices and metadata and returns a floating point buffer containing the requested data. Only the HDF files containing requested data are accessed. You can then read the buffer into, say, a 3D array if you wish, or access the buffer data directly.
HDFtools is written in C and is easily called from both C and FORTRAN. Errors are trapped in a graceful manner, providing human-readible output when errors (such as trying to access non-existent files, variables, etc.) occur. Example code in both C and FORTRAN is included in the distribution.
You might be interested in HDFtools if either of the following applies:
My philosophy when it comes to dealing with model data is something like this: Big model runs require lots of hardware resources. In many cases, you write a proposal to get machine time, and you want to use that time as efficiently as possible. I want my model to spend most of its time crunching numbers, not writing history data. But, I want it to write enough data such that I can do interesting post-processing (visualization, animation, tracer analysis, etc.). Hence, I want to write data in frequent intervals. I'd hate to re-run a simulation with identical initialization just so I can write data files over more frequent intervals - this is wasted machine time.
In addition, for post-processing, full floating-point precision is often not necessary. This is certainly true for visualization (contour plots, isosurfaces, etc.). Hence, we can save on disk space and I/O time by doing lossy compression. The native HDFtools data format takes 4 byte floats and scales them, writing the data as 2 byte integers. Further saving on file size is achieved by turning on gzip compression which is built into HDF. Writing data this way can save one order of magnitude in total history file size for typical atmospheric simulations. For example, in a weather simulation, a variable may exist which represents the amount of cloud at a location. In a mostly cloud-free part of the domain, all of the floating point values will be 0.0. By turning these zero values into integers (0) and applying gzip compression you will obtain tremendous compression. (Try gzipping a file where all of the data repeats if you don't believe me!).
If you are using a numerical model which writes out history data in the SF method and you are not having performance problems, it may not be worth your time to modify your model to use HDFtools. If you are using a model which In order to use HDFtools, you will need to modify your model such that it complies to the format that the HDFtools software understands.
In order to use HDFtools, you must adapt your MPI model to write files in the way that the software expects. This is less painful than it might sound. In short, you must
baseTTTTT_NNNN.hdf
base: any descriptive character string you want
TTTTT: Time with leading zeroes
NNN: Node number (starting with 0) with leading zeroes
Here is a listing of a history file dump at 3600 seconds from a model
run with nine nodes:
cobalt:run% ls -l supercell*3600*hdf -rw-r--r-- 1 orf soh 1107927 Sep 11 13:32 supercell03600_0000.hdf -rw-r--r-- 1 orf soh 1765583 Sep 11 13:32 supercell03600_0001.hdf -rw-r--r-- 1 orf soh 1291205 Sep 11 13:32 supercell03600_0002.hdf -rw-r--r-- 1 orf soh 1534868 Sep 11 13:32 supercell03600_0003.hdf -rw-r--r-- 1 orf soh 2680186 Sep 11 13:32 supercell03600_0004.hdf -rw-r--r-- 1 orf soh 1561301 Sep 11 13:32 supercell03600_0005.hdf -rw-r--r-- 1 orf soh 1034291 Sep 11 13:32 supercell03600_0006.hdf -rw-r--r-- 1 orf soh 1275036 Sep 11 13:32 supercell03600_0007.hdf -rw-r--r-- 1 orf soh 909112 Sep 11 13:32 supercell03600_0008.hdfEach hdf file contains the 3D floating point data which defines the model state (for my supercell run, these include things like wind, temperature, and cloud data), as well as what I will call metadata, information which describes the location of the file relative to the entire model domain.
Coming back to our example decomposition, let's assume the entire model domain spans 300 points in x and y (nx=ny=300) and nz=100. Each node holds a chunk of the domain, and each node holds exactly the same sized chunk.
Using C notation (arrays begin with index 0), node 0 would contain 0-99 in x and 0-99 in y, node 1 would contain 100-199 in x and 0 to 99 in y, etc. In the case of node 1, we save x0 = 100, xf = 199, y0 = 0 and yf = 99 in supercell03600_0001.hdf. We also save myi = 2 and myj = 1, indicating the numbers in parenthesis.
With the above metadata, we can retrieve any subset of the entire model domain.
Using the hdp tool which comes with any hdfv4 distribution, we can peek inside one of the hdf files:
cobalt:run% hdp dumpsds supercell03600_0003.hdf|grep ^Var Variable Name = time Variable Name = dx Variable Name = dy Variable Name = x0 Variable Name = xf Variable Name = y0 Variable Name = yf Variable Name = myi Variable Name = myj Variable Name = numi Variable Name = numj Variable Name = nodex Variable Name = nodey Variable Name = nx Variable Name = ny Variable Name = nz Variable Name = sfcrain Variable Name = maxsws Variable Name = U Variable Name = V Variable Name = W Variable Name = Th Variable Name = P Variable Name = Tke Variable Name = qv Variable Name = qc Variable Name = qr Variable Name = qi Variable Name = qs Variable Name = qgThe variables which are critical to the operation of HDFtools are x0,xf,y0,yf,myi,myj. Variables from U to qg are 3D model prognostic variables which you may select for reading in.
No on both counts. So long as you store the required data (x0,xf,y0,yf,myi,myj) you could simply provide your own I/O code. I really haven't done anything groundbreaking here - I've just found an approach that works best for me and thought it might be of use to to other modelers. I've often heard modelers say "it's too much of a pain to deal with history files that are split up" - and decided to make it easier.
As far as writing goes, if you were writing all of your data simultaneously to a slow networked file system, you might not see a big performance increase. But even with a standard Linux cluster and a 100 Mbit switch where all nodes are writing to a shared NFS drive, I have found this method is faster. You will most likely see the biggest performance increases when you are using high performance hardware on a shared file system or, in the case of a distrubuted memory cluster, having your nodes write to local disk. In the latter case you'll have to collect your HDF files to a single directory when you wish to use HDFtools.
One of the nice things about this approach is that if you're reading a subset of the entire domain, you only access files which contain some of that subset. For those of you familiar with chunking in HDF, you can think of this as superchunking (apologies to the band ;). HDF allows you to organize data in HDF files such that accessing data toward the end of the file is faster (this is chunking). Without chunking, accessing data towards the end of a very big HDF file can take a lot of time due the way the HDF routines seek through HDF files. By splitting the file up, you have explicitly "chunked" your domain into separate files. Of course there is nothing stopping you from turning chunking on in your subdomain HDF files.