Byte by byte: consuming files

Byte by byte: consuming files. (column)
by Tom Campbell

Computer files come in different flavors. Some have a predictable structure-like text files or files of fixed-length records-but most are unpredictable. This month we'll learn how to read files with an unknown composition (notably COM, EXE, and OBJ files), and we'll see how to pick out the text strings hiding inside.

But before we begin, let's step back and look at what files actually are. A file is anything stored on disk. This includes what you normally think of as data files, such as the WK2 files from a spreadsheet, DOC files from a word processor, or DBF files from a database. But it also includes DOS, contained in hidden files on your boot disk; COMMAND.COM, your command line interpreter; and programs like DBASE.EXE, WP.EXE, and XCOPY.EXE.

It's no accident that DOS stands for Disk Operating System. Many people become confused when they discover that the operating system itself is usually nothing more than a file. But because they are files, DOS and executable programs can be read like any other data files.

That's why this month's program, SNOOP, can read through any kind of file looking for messages in ASCII text format. To use SNOOP, just enter SNOOP and a filename at the command line. Any messages the file contains will be written on the screen. Try entering these commands (supplying your system's path to each file):

SNOOP QB.EXE SNOOP COMMAND.COM SNOOP MODE.COM

Many have claimed that MSDOS is arbitrary, illogical, and difficult to learn; and that may be true of some of its aspects. But file handling is one of the exceptions. To appreciate how logically DOS handles files, consider the peculiar foibles of the early Macintosh operating system.

Apple tried to avoid the term file when the Macintosh was introduced. Instead, programs (executable files) were called applications, and the files they created were known as documents.

it was a noble but misguided idea. Applying the term document to a 200-layer CAD drawing or a database containing I 0,000 employees didn't make the notion of files more concrete but only added another confusing level of abstraction.

In short, a file is anything stored on disk, and the steps you take in using a disk file are analogous to the steps you would take with a manila folder. You must

1. Open the file.

2. Use the file (read it or write to it).

3. Close the file.

You deal with Manila folders the same way. You can't take anything out of a file before you open it, and you'll run into problems if you don't close the file and put it away when you're finished.

The next larger unit of the operating system is the subdirectory, roughly equivalent to a filing cabinet. DOS's earliest version left out subdirectories, and even after a tree subdirectory SYStem was featured, many programs were unable to make use of it.

In QuickBASIC, a formal syntax chart of the OPEN command looks scary:

OPEN @file$ [FOR model [ACCESS

access] [lock] AS [#]filenumber%

[LEN=reclen%]

Indeed, the options are almost overwhelming, but we'll pay attention only to the configuration of this month's OPEN statements. The first courtesy owed a user by a program that uses existing files (as opposed to one that creates files) is to ensure that the requested file exists and to display a suitable error message if it doesn't.

QuickBASIC, like Turbo Pascal, doesn't have a particularly attractive means of doing that. You have to lie in wait with ON ERROR, open a file for sequential access (that is, as if it were a text file), input only, and wait for runtime error 53, which will occur when a file with that name doesn't exist. I couldn't find runtime errors (which is what ON ERROR traps) listed in the QuickBASIC documentation, so this information comes to you by way of experimentation. Other modes, such as BINARY and RANDOM, create a file if the file doesn't already exist. Then you must close the file and start your program-in this case, by immediately reopening the file in binary mode.

The first OPEN in the program, the dummy one whose only purpose in life is to see if the requested file is available, looks like this:

OPEN COMMAND$ FOR INPUT AS #1 Make sure the file exists.

This means Open the file named on the command line for sequential access, and use file descriptor number 1. Note that the word sequential doesn't appear anywhere. This is because of the history of file management. BASIC originally could open only text files, and other modes were tacked onto the syntax later. File handling is one of the features that seem to be completely different on each implementation of BASIC on minis and mainframes and among dialects in those environments.

As mentioned, opening a nonexistent file triggers a branch to the user's error-handling routine at runtime; this month's ON ERROR has a hard-coded check for error 53 because that's QuickBASIC's internal error code for File not found If the file exists, execution continues. We close the file immediately because it's been opened in the wrong mode) and reopen it in the next statement:

OPEN COMMAND$ FOR

BINARY AS #1

Binary access means the file is treated as a row of bytes on the disk, which the program is responsible for managing. In a text file, INPUT # searches for delimiters such as carriage returns instead of reading a certain number of bytes. So if you wanted to look for text strings in a file such as COMMAND.COM or WP.EXE, all kinds of nasty errors could happen because you have no guarantee that a delimiter will appear anywhere in a nontext file.

The best way to deal with a file of bytes is to create a data type that contains only one byte. You could use TYPE, but the easiest alternative here is to create an anonymous data type and immediately allocate space for it, a trick that C has had for years, Pascal still doesn't have, and QuickBASIC has acquired recently.

DIM NextByte AS STRING * 1

This statement creates a variable called NextByte that holds just one byte of data. We retrieve a byte from the input file this way:

GET # 1, , NextByte: Get the next

' character from the input file.

The empty parameter between # 1 and NextByte is the record-length parameter used in RANDOM mode. It's not necessary here, but it must be retained as a placeholder. In the program the GET statement is placed in a norm WHILE NOT EOF/WEND loop.