Distinguish between ASCII and Binary

44 views (last 30 days)
Tero
Tero on 5 Nov 2020
Edited: DGM on 9 Oct 2025 at 13:46
What could be an elegant way to distinguish an ASCII file from a Binary one? Specifically, I'm working with STL files that can be both, and I need a solution how to seperate those two
Thanks,
Tero
  3 Comments
DGM
DGM on 29 Sep 2025 at 4:55
Edited: DGM on 5 Oct 2025 at 9:48
It is true. It's not that uncommon even. I'm pretty sure there are some encoders which violate this rule as a matter of procedure (SolidWorks maybe?). I know Magics starts some headers with ";SOLID", but I don't know if the semicolon is consistently present. Even if it werent for default behavior of certain software, I have some files where the user's object description just incidentally starts with the word "solid".
If you can't rely on anyone being prevented from abusing this magic word over the last 38 years, then you can't rely on it having unambiguous meaning.
DGM
DGM on 5 Oct 2025 at 8:18
Edited: DGM on 9 Oct 2025 at 13:46
Somebody cleaned up a prior comment to the effect of "ASCII files start with a 'solid' keyword, and binary file headers should not start with the word 'solid' so that file encoding can be identified".
The comment wasn't wrong. I'm fairly sure that was the intent, or at least the prevailing interpretation of the intent has been similar. If nothing more, it was at least sensible.
I've spent way too much time trying to find primary sources regarding the history of the STL format documentation. I have copies of early drafts of the SLA-1 software manual and the interface specification from 1987. These are available because they're appendices to the patent application filed prior to commercial release. The canonical spec document should be the 1989 interface specification. I have not seen any trace of the 1989 document, and something tells me that the only people who have either (somehow) bought a copy or worked for corporate customers and were under NDA. My impressions are based on the 1987 documents. I have a lot of room to be wrong, but I'm no stranger to embarrassing myself.
From the wikipedia page and other descriptions, we can gather that the interface specification isn't exactly thorough. That's an understatement. In the 1987 documents (both cases) the descriptions of STLA and STLB each consist of a simple template file with one or two paragraphs worth of description. Not only is there a lot left to imagination, there are some curiosities that seem to defy modern descriptions of the format. ... Actually hold on ...
I had to do some splicing to reassemble it into one image, but that's from the interface specification. The software manual is pretty much the same.
Note the lack of assertive specificity. Familiar restrictions such as keyword case, the meaningfulness of the 'solid' keyword, and the importance of spaces between 'facet normal' and the absence of spaces in 'endsolid' aren't actually mentioned. As a consequence, history provides us with encoders which violate these conventions which seem to have little basis in the documentation.
As to the usage of the 'solid' keyword for encoding disambiguation, it's obviously not mentioned here. There may be other contemporary sources (e.g. Marshall Burns?) which popularized this suggestion, but I haven't found an actual root example.
On the other hand, commonly described ambiguities such as the lack of a restriction to triangular meshes aren't really substantiated either. In fact, the text describes "triangles" specifically, and there appears to be no possible means to unambiguously encode anything other than a predetermined number of vertices (i.e. 3) in STLB. Why is anyone imagining that support for other (e.g. tetrahedral) meshes was ever implied? I was under the impression that there was latitude on facet length, but the more I look, the more certain I am that nobody has ever done anything but repeat the unsubstantiated suggestion.
For all that's missing, there are some things in this document that we wouldn't expect if we had read all the available nth-hand summaries of the topic. First, STLA was essentially deprecated before the commercial release. It's not mentioned in this copypasta (I think it's in the software manual), but the earliest prototype versions of the slicer only supported STLA (hence the rationalization that it might be a useful crutch during development). In other words, STLA was a developmental format, and STLB exists to replace it. It was already supported and recommended for production code by 1987, yet here we are using it for apparently no good reason nearly 40 years later. I admit that I felt a little bit vindicated to see that.
As an aside, I find it hard to believe that STLA makes development anything but more difficult. To me, the suggestion that STLA might still be useful for development purposes sounds a lot more like the software contractor rationalizing the amount of time (money) they wasted developing software around a format that they eventually abandoned.
More interestingly, there are two aspects of both STLA and STLB which are described here in a way that I've never seen clearly mentioned anywhere else:
  • The STLA description includes an arbitrary number of 'attribute' entries per facet.
  • The STLB description of the trailing uint16 field is not "facet attributes" but "number of attributes" or "attribute count".
This offers a consistency that actually makes more sense than most modern descriptions. In the rest of the document, "attributes" appear to be region-specific directives for the slicer and CAM processor, and this appears to be a reserved (and likely never used) method for storing them in a per-facet manner in the geometry file. It's unclear how they're to be stored or labelled in STLA (e.g. whether an attribute entry is strictly scalar, or how the decoder knows which attribute you're setting with the given value, or whether these entries are packed fields), but it would make sense that there would be a variable number of these entries per facet. Accordingly, it makes sense that the trailing word of an STLB facet block would be the number of appended attribute entries; it would not directly contain attribute data. This suggests a few hypothetical conflicts:
  • An STLA with attribute data entries will break some modern decoders.
  • An STLB with variable-length facet blocks will break many (if not all) modern decoders.
  • A modern STLB with proprietary color data packed in the attribute count field would break any hypothetical decoder which adheres to the implied functionality in this document.
I think it's justifiable to argue that this was the implied intent, but does it carry any weight? It's clearly stated that the feature wasn't implemented at the time, and it likely never was. Again, this is an older document, but assuming that little additional clarification was provided in the subsequent years, it leaves the picture quite murky.
Now that I've cast (my) doubt on the effective authority of the canonical documents, I should note that there are at least two proprietary conventions for storing color data in an STL, both by abusing the facet attribute count field, and by embedding raw data in the header field. Appropriately for the mess we're in, these methods are not only both contradictory and programmatically indistinguishable; they also have no apparent publicly available documentation. The wikipedia page covers some of the details, but without any citations. Trying to get hold of anyone at the corresponding vendors or their waning user communities suggests that even first-party tech representatives don't know these features exist, let alone how they are to be presented. I have found zero primary documentation for the VisCAM/Magics/Fusion/Inventor extensions beyond the completely unsubstantiated wiki blurb for VisCAM/Magics. I only have a handful of varied example files. It's just an extra layer of uncertainty that seems so appropriate.
This is a long way of making the point, but I guess what I'm trying to say is that the more I look for clarification and solid references, the less I trust that there are any, and the less I trust that the extant references carry any weight. The impression I get is that STL started with a specification which was either poorly-documented and/or not broadly available. Everything we have today is the product of people filling in the blanks based on third-hand summaries of a vague description. The canon is defined less by the documentation than it is by the unknowable totality of everything that's been written in its impression. Edge cases are everywhere. The fact that STLA even still exists is a big chunk of the problem.

Sign in to comment.

Accepted Answer

Bruno Luong
Bruno Luong on 5 Nov 2020
I don't know if it's an elegant way but I just test if any charater is > 255
fid = fopen(stlfilename,'rt');
if fid > 0
try
c = textscan(fid,'%s','delimiter','\n');
fclose(fid);
catch ME
message = ME.message;
h = errordlg(message);
waitfor(h);
OK = -2;
return
end
else
OK = -2;
message = 'Cannot open STL file';
h = errordlg(message);
waitfor(h);
return
end
c = c{1};
c(cellfun(@isempty,c)) = [];
if max(cellfun(@max,c)) > 255
% Binary
...
else
% Ascii
...
end
  3 Comments
Ameer Hamza
Ameer Hamza on 5 Nov 2020
This test can produce false negatives. For example
fid = fopen('file.bin', 'w');
fwrite(fid, [65 66 67 68], 'uint8')
fclose(fid)
Test
fid = fopen('file.bin','rt');
c = textscan(fid,'%s','delimiter','\n');
fclose(fid);
c = c{1};
c(cellfun(@isempty,c)) = [];
Result
>> max(cellfun(@max,c)) > 255
ans =
logical
0
Bruno Luong
Bruno Luong on 5 Nov 2020
We are talking about STL file, that can be ascii/binary, no any binary file.

Sign in to comment.

More Answers (2)

DGM
DGM on 25 Mar 2024
Edited: DGM on 29 Sep 2025 at 0:30
See also stlGetFormat() from stltools on the FEX, or the improved version included here:
FWIW, I tested this and the accepted answer on a list of 1000 STL files of different encodings, from various sources. In all cases, they produced matching results, but stlGetFormat() was faster.
  • for the accepted answer: 131.298121 seconds.
  • for stlGetFormat(): 0.204969 seconds.
That's a big difference, but nobody is trying to read a thousand STL files as fast as possible, so it doesn't really matter. stlGetFormat() uses a concise three point test, so it doesn't need to read the whole file. It's a balance of simplicity and adequacy, but still robust against a handful of edge cases. For example:
  • A binary file which has a header starting with 'solid'. These are not that uncommon.
  • An ASCII file which contains embedded raw data or multibyte encoded text in the header
  • ASCII files with more than 70B worth of header/footer text
  • ASCII files with invalid "end solid" footer keywords as created by certain defective encoders
It also makes an attempt to discern whether the file is even an STL at all. That is, it won't generally identify random binary files as being binary STL files. It's not perfect, but it's better than feeding random garbage to the decoder.
That said, it relies on some nearly universal assumptions which are technically not guaranteed. It presumes that the file describes a strictly triangular mesh, and it presumes that binary files have fixed-length facet blocks (i.e. that there are no trailing attribute entries beyond the attribute count field). There are plenty of ways a contrived file could break it, but the way STL was poorly defined and has been broadly abused, I'm pretty sure it's not possible to make STL tools which accomodate all possible presentations.

Walter Roberson
Walter Roberson on 29 Sep 2025 at 0:49
Edited: Walter Roberson on 29 Sep 2025 at 0:52
ASCII files will contain characters in the positions:
9 (tab), 10 (newline), 13 (carriage return), and 32 (space) to 126 (tilde). Occasionally they will contain 8 (backspace) and 11 (vertical tab, fairly rarely).
ASCII files may sometimes include 26 (eof). This is increasingly uncommon, but used to be common for CP/M and DOS and early MS Windows
Binary files may contain any of those characters, and also anything else in the 0 to 255 range.
filename = 'file.bin';
[fid, msg] = fopen(filename, 'r', 'n', 'ISO-8859-1');
if fid < 0; error('failed to open file "%s" because "%s"', filename, msg); end
bytes = fread(fid, '*uint8');
fclose(fid)
if all(ismember(bytes, [8:11, 13, 26, 32:126]))
fprintf('likely ascii file\n');
else
fprintf('likely binary file\n');
end

Categories

Find more on 2-D and 3-D Plots in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!