Writing Mixed Data With MATLAB
Today I found myself learning more about MATLAB. It wasn’t a very pleasant experience. Especially considering my experience in Python. I needed a way to first store data about ballistic missile trajectories and then write it to a CSV file for later use. The data consisted of a mix of string and numeric types. I thought this would be easy enough. Instead, I now know why most of the CSV files I’ve seen output from MATLAB only contain numerical data.
The first part of the battle was trying to find the appropriate data structure for my mixed type data. I’m used to Python’s lists, which allow you to nearly any object in them. In MATLAB the most common data structure is an array (or matrix, hence the name MATLAB). It took me a little while to stumble across cell arrays, which can contain mixed data. Unfortunately, the file output functions included in MATLAB do not support cell arrays—they only support arrays of numerical data.
Why these functions do not support mixed data is beyond me. It wasn’t that hard to implement a function that supported cell arrays. Granted, I had a few moments of minor frustration with the MATLAB language.
My first attempt saw me merrily chugging along, converting every
cell of my cell array to strings. Until I realized that none of the
MATLAB functions are designed to work on sequences of data. In
Python, many functions take a sequence like this:
function(sequence). In MATLAB, you have to specify
every item in the sequence like so: function(item1, item2, ...).
Thus, my first implementation had a switch-case tree to deal with
different numbers of columns in cell arrays. I’m embarrassed to even
mention it. Even more embarrassing, I now know how to de-sequence
things in MATLAB, i.e. C{:} gives C{1}, C{2}, ...
I figured, “What the hell, I might as well leave it for now.” But I ran the function and it was damn slow. So I figured I was going to have to change the way I concatenated the columns together and I might as well push that back into the for loops to deal with varying numbers of columns. What you see below is the result.
function cellwrite(filename, cellarray)
[rows, cols] = size(cellarray);
fid = fopen(filename, 'w');
for i_row = 1:rows
file_line = '';
for i_col = 1:cols
contents = cellarray{i_row, i_col};
if isnumeric(contents)
contents = num2str(contents);
elseif isempty(contents)
contents = '';
end
if i_col < cols
file_line = [file_line, contents, ','];
else
file_line = [file_line, contents];
end
end
count = fprintf(fid, '%s\n', file_line);
end
st = fclose(fid);
The function changes every cell to a string and then adds it to
a line accumulator with commas between the cell values. Yes,
continually
adding strings together is a
bad idea. However, in MATLAB, it turns out that using bracket
concatenation is about twice as fast as calling strcat repeatedly
(i.e. file_line = strcat(file_line, contents, ',')).
This line accumulator is then written to a file.
Unfortunately, this new function wasn’t any faster than my original implementation. But it does have the advantage of supporting more than 15 columns and looking much nicer.
On a whim, I decided to see how my function compared to the
built-in MATLAB function CSVWRITE. I made an array with
random numbers that was 1000 rows by 100 columns made a cell array
copy and fed it through my CELLWRITE functions. It
took an awful 67.6 seconds to return. Apparently, that’s not bad by
MATLAB standards. CSVWRITE took 67.7 seconds to
write the numerical array to disk. Sixty-seven seconds is an
awful long time. I get bored with waiting after about five
seconds.
So I spent some time at home tonight and rewrote the function in
Python. Below is the first revision, which is fairly similar to the
MATLAB version, with the exception that it uses the join
function to add the strings together. That’s faster and nicer looking.
def cellwrite(filename, cellarray):
rows, cols = len(cellarray), len(cellarray[0])
fid = file(filename, 'w')
for i_row in range(rows):
cells = []
for i_col in range(cols):
contents = cellarray[i_row][i_col]
if isinstance(contents, (int, float, long)):
contents = str(contents)
cells.append(contents)
file_line = ','.join(cells)
file_line = ''.join([file_line, '\n'])
fid.write(file_line)
fid.close()
That ran through a similar data set in <3.5 seconds. Yes, my work and home computers are different. But they don’t vary by an order of magnitude.
To top them all, the more Pythonic version below ran in <0.4 seconds!
def cellwrite2(filename, cellarray):
fid = file(filename, 'w')
lines = [','.join(map(str, row))+'\n' for row in cellarray]
fid.writelines(lines)
fid.close()
So why does MATLAB suck so much? A few
profiler runs indicate that
the number to string conversion function, num2str is
damn slow. Why? I don’t know. Maybe I’ll look into it more.
Maybe I’ve done enough already.