Technical Articles

Two New Functions for Converting Datatypes and Changing Byte Order

By Jeff Mather, MathWorks


Two new functions in MATLAB 7.1 (R14SP3) significantly simplify working with numeric datatypes at the byte level. The typecast function converts vector datatypes without changing the underlying bytes, and swapbytes changes endianness. In this article, we discuss the basics of byte manipulation and use these functions in sample applications. In one of the applications, a DPX file format parser, we show many of the steps involved in creating your own platform-independent file reader.

Changing the endianness of data written on one platform to give the right values when read on another is a common programming task, especially when performing file I/O. Similarly, when performing I/O on structured data, you can often improve performance by reading all of the data as a contiguous block of raw bytes regardless of datatype, and then parsing out the appropriate MATLAB datatypes. The typecast and swapbytes functions simplify these operations.

Endianness and Casts

MATLAB supports many numeric datatypes, such as 8-, 16-, 32-, and 64-bit signed and unsigned integers. It also supports 32-bit single- and 64-bit double-precision IEEE floating-point numbers. Each numeric type is a collection of one to eight bytes (a byte is eight bits). For example, a UINT8 value occupies one byte, while a double-precision floating-point number fills eight bytes. For multibyte datatypes, the same number can be represented in memory in several unique ways, depending on the hardware architecture handling it. Intel machines are so-called “little-endian” architectures because they store the lowest-order, least-significant bytes first in memory (that is, in the lowest memory address). On the other hand, most RISC processors are “big-endian” architectures that store the bytes of multibyte types in the opposite order.

Consider how the value 1037 (which is 256*4 + 13) looks in MATLAB running on the architectures below when stored in a UINT16 datatype :

Little endian (Intel and AMD)

13 4
10110000 00100000

Big Endian (RISC)

4 13
00100000 10110000

When passed a numeric vector and a string containing the output type, typecast returns another numeric vector of the desired type. The number of elements in the output buffer will change as the underlying bytes are recombined to create the output type. The swapbytes function changes the endianness of any numeric array passed to it. To illustrate the different byte patterns and a mechanism for converting one endianness to the other, we use the typecast function to convert the UINT16 value 1037 to two UINT8 values, and then use the swapbytes function to swap the bytes in the UINT16 value.  

a = uint16(1037)
b = typecast(a, 'uint8')
c = swapbytes(a)
d = typecast(c, 'uint8')

When run on an Intel Pentium 4 machine, which is little-endian, the result is

a =
   1037

b =
   13    4

c =
   3332

d =
    4   13

But on a big-endian Sun SPARC machine, the same code produces the following output:

a =
   1037

b =
    4   13

c =
   3332

d =
   13    4

Notice that the values stored in b and d have the same byte representations as shown in the diagram above, though the order of the byte elements that represent 1037 depends on the endianness of the machine.  Also note that the result of byte swapping is not machine-dependent.

Reinterpret Casts in C and C++

Compiled languages that have pointer datatypes, such as C and C++, usually provide mechanisms such as the cast operator for reinterpreting the data that a value points at, otherwise the compiler would be unable to work with arrays and other aggregates of data. In the following C code fragment, a sequence of 32-bit signed values is treated as a sequence of 16-bit unsigned “words.”

/* C example */
INT32_T data32[] = {-352, -1, 0, 255, 70000};
UINT16_T *data16;

data16 = (UINT16_T *) data32;

When treated as a 16-bit datatype on a little-endian platform, the original five 32-bit values in the data32 array become the ten values {65184, 65535, 65535, 65535, 0, 0, 255, 0, 4464, 1}. The actual bit patterns for the block of memory shared by data16 and data32 are unchanged; only the interpretation of these bits is different, which results in different values. In C++, this is known as a reinterpret cast. In MATLAB, the typecast function reinterprets the values for you.

% MATLAB example
data32 = int32([-352, -1, 0, 255, 70000]);
data16 = typecast(data32, 'uint16')

data16 =
  65184  65535  65535  65535      0      0    255      0   4464      1

Other Uses for Swapbytes and Typecast

A typical use of typecast in conjunction with swapbytes is improving the performance of file I/O in situations involving a lot of file access and parsing data streams that are either embedded in files or returned from a nonfile source, such as a serial port. Typecast provides the proper structure to the raw bytes, and swapbytes accommodates platform differences.

When writing your own image file readers, you may need to swap the pixel bytes of high bit-depth images for the images to display correctly. The following images show 16-bit pixel data from a DICOM file that has the correct byte ordering (Figure 1) and from the same file read with the wrong endianness (Figure 2).

image_datatype_fig1_w.gif
Figure 1. A correctly displayed computed radiography image. The values in this 16-bit image range from 0 to 979. Image courtesy of Mallinckrodt Institute of Radiology.
image_datatype_fig2_w.gif
Figure 2. A 16-bit computed radiography image with the wrong endianness. The pixel values range from 0 to 65,282, and there are unexpected discontinuities of contrast in the image. Image courtesy of Mallinckrodt Institute of Radiology.

Application Example: Using Swapbytes and Typecast to Read DPX Files

Let’s create a simple file reader that imports data from Digital Moving-Picture Exchange (DPX) files. DPX is an ANSI/SMPTE standard for communicating film and television images and associated metadata. Each DPX file contains one high bit-depth frame which is part of a larger collection of digital cinema images. For images from film, the pixel values represent the density of a scanned negative, and the metadata translates those values to a more readily usable RGB or YCbCr colorspace.

The standard defines the metadata values as data of a given type at specific offsets in the file. The image pixels described by the metadata can contain a variety of bit depths and color models. Because the DPX standard defines the metadata by fixed-length values at known positions, creating a DPX parser is fairly straightforward. You can download the DPX parser described in this article from MATLAB Central.

When manipulating DPX files the program follows these steps:

  1. Read all the data from a file into a UINT8 buffer.
  2. Verify that the file is a valid DPX file.
  3. Extract pieces of metadata from the buffer.
  4. Use the metadata to interpret the image pixel data.

The function that creates the UINT8 buffer uses one call to fread to pull in all the file’s data. Other sections of the function will use typecast and swapbytes to extract particular data values.

function buffer = getBufferFromFile(filename)

fid = fopen(filename, 'r');
if (fid == -1)
error('Could not open file %s for reading.', filename);
end

% Read the whole file into a UINT8 buffer and then tease out the parts as needed.
buffer = fread(fid, inf, 'uint8=>uint8');
fclose(fid);

The first four bytes serve the dual purpose of signaling that the file is actually a DPX file and determining whether byte swapping is necessary. We compare an expected “magic number” with the first four bytes of our buffer after converting to UINT32. If our magic number is valid, we use the result of the comparison to set an anonymous function that we will use throughout the rest of the reader.

expectedMagicNumber = uint32(sscanf('53445058', '%x'));
fileMagicNumber = typecast(buffer(1:4), 'uint32');

if isequal(fileMagicNumber, expectedMagicNumber) tf = true;
swapFcn = @(x) x;
elseif isequal(swapbytes(fileMagicNumber), expectedMagicNumber) tf = true;
swapFcn = @(x) swapbytes(x);
else tf = false;
swapFcn = []; end

If the magic numbers are equal, no swapping is necessary; otherwise we will use swapbytes.

To extract a particular metadata value, we find the UINT8 buffer’s bytes that correspond to the data (which is given by the “offset” value in the DPX standard) and call typecast to transform the data into the appropriate type. Our function encapsulates the offsets, datatypes, number of metadata values, and descriptive names in a few cell arrays. For example:

dataStructure = {...
    1408,   1, 'uint32', 'XOffset'
    1412,   1, 'uint32', 'YOffset'
    1416,   1, 'single', 'XCenter'
    1420,   1, 'single', 'YCenter'
    1424,   1, 'uint32', 'XOriginalSize'
    1428,   1, 'uint32', 'YOriginalSize'
    1432, 100, 'char',   'SourceImageFilename'
    1532,  24, 'char',   'SourceImageCreationTime'
    1556,  32, 'char',   'InputDeviceName'
    1588,  32, 'char',   'InputDeviceSerialNumber'
    1620,   4, 'uint16', 'BorderValidity'
    1628,   2, 'uint32', 'PixelAspectRatio'
    1636,  28, 'uint8''Reserved'};

The getData subfunction called from parseDataStructures contains the logic to read the unstructured raw data and then extract from it the data values for this data structure. The input variable “buffer” is the whole UINT8 array (that is, the whole file), and “fieldDetails” contains one row from the data structure.

function data = getData(buffer, fieldDetails)
% Find the offset and extent to this data element in the buffer.
% The data structure offsets are 0-based, and MATLAB is 1-based.
dataStart = fieldDetails{1} + 1;
datatype = fieldDetails{3};
dataEnd = fieldDetails{1} + fieldDetails{2} * sizeof(datatype);

rawData = buffer(dataStart:dataEnd);

% ASCII character data in the file is 8-bit and is not converted.
if (strcmp(datatype, 'char')) data = char(rawData)'; else swapFcn = getSwapFcn;
data = swapFcn(typecast(rawData, datatype)); end

Each of these data values is stored in a structure and assigned a field name taken from the “dataStructure” cell array. These values are used in the readPixels subfunction to convert the correct bytes in the buffer to image pixels.

 % Determine sizes and sample locations.
rows = details.ImageDetails.Rows;
columns = details.ImageDetails.Columns;
bitDepth = details.ImageDetails.ImageElementDetailsParsed(1).BitSize;

startOfPixels = details.FileDetails.ImageDataOffset;
endOfPixels = startOfPixels + ...
    double(rows * columns) * numChannels * double(bitDepth)/8;

% Convert the buffer to an array of output pixels.
switch bitDepth
case 8
pixels = buffer((startOfPixels + 1):endOfPixels);
case 16
swapFcn = getSwapFcn; pixels = typecast(buffer((startOfPixels + 1):endOfPixels), 'uint16'); pixels = swapFcn(pixels);
otherwise
error('Unsupported bit-depth.');
end

% Rearrange the data to follow MATLAB's conventions.
pixels = reshape(pixels, [numChannels, columns, rows]);
pixels = permute(pixels, [3 2 1]); 

To use this function, simply pass it the name of a DPX file:

X = readdpx('flowers-1920x1080-RGB-8.dpx');
figure; image(X)
axis equal
axis tight
set(gca, 'xtick', [], 'ytick', [])
image_datatype_fig3_w.jpg
Figure 3. Image read from flowers-1920x1080-RGB-8.dpx. Click on image to see enlarged view. Image courtesy of GraphicsMagick Group.

Two aspects of memory management should be noted when using typecast and swapbytes to extract smaller parts of large data buffers. Foremost, loading very large buffers can lead to out-of-memory errors. A typical 1900-by-1000 RGB DPX file usually requires less than 10 megabytes of memory; but for very large files, we could use the memmapfile command to work on a virtual array of UINT8 data without importing the whole buffer at once.

Secondly, avoid modifying the UINT8 buffer directly after reading it. If the buffer is not modified when passing as an argument to the subfunction, it will be passed by reference and no extra memory will be used. But if the buffer is modified inside the subfunction, it will result in a temporary variable, which could waste memory.

Application Example: Creating Bytestreams for MD5 Message Digests

Another use for these functions is converting data for use with functions that require a particular type of input, such as Java functions requiring byte arrays (signed 8-bit arrays). One such application is message digesting, which is a routine security task that computes a hash value for an input data stream. Comparing the message digest of data read from one source with the message digest from a trusted source will reveal whether the byte patterns of the two sources match.

For example, let’s run the MD5 message digest algorithm on a typical MATLAB matrix and compute the 128-bit fingerprint. In practice, you might run the digest algorithm on a buffer of data read from a file or returned by a query using the Database Toolbox. In these examples, the data may require conversion to an 8-bit type without changing the underlying bytes. Java provides a number of message digest algorithms as part of its java.security API, which we can call directly from MATLAB:

m = magic(5); % m is 5-by-5 here.
md = java.security.MessageDigest.getInstance('md5');
bytes = typecast(m(:), 'int8'); % bytes is now 200-by-1.
md.update(bytes)
digest = md.digest;
sprintf('%02x', typecast(digest, 'uint8'))

On a big-endian machine, the result is “880c982ecb23432181e503a80abbabae,” while on a little-endian machine it is “054394beff14b8a21216ce553034db14.”

Conclusion

We have seen how to use the new typecast and swapbytes functions to manipulate data at the byte level, which is especially useful in writing external data interfaces and portable applications. There are many other specialized uses for these byte-oriented functions, such as data compression. In run-length encoding, long “runs” of repeated values are compressed to smaller byte patterns. Converting multibyte data to UINT8 values by using typecast is the first step in this process.  The bytes can also be rearranged before compression, resulting in compression ratios higher than what is possible without the help of byte-level operations. Data serialization—creating output representations of objects, for example—is another possible use for these two functions.

Whether the data comes from an external source (as in the DPX parser), is being prepared for disk output, requires translation from a memory-mapped file to another in-memory representation, or just needs a simple type conversion, typecast and swapbytes can help. When combined with the other functions for converting types (such as the class, UINT8, and double functions) and bitwise operations (bitget and bitand, for example), MATLAB provides a powerful platform for manipulating numeric datatypes.

Published 2006