Pack Up Your Data

PACK UP YOUR DATA

Jim Butterfield

Associate Editor

There's nothing wrong with the way the CBM/PET/VIC writes data to sequential files. But sometimes it can be useful to pack the data in order to save space or aid certain types of processing.

If your program contains a statement like PRINT#1, V and if you execute that statement when V contains a value of, say, 159, five characters will be placed on the file: Space, 1, 5, 9, and RETURN. Caution: if you don't have a 4.0 BASIC, one more character will be put to the file – a Line Feed – and it may give you problems. In this case, your program should say PRINT#1, V;CHR$(13); and be sure to include both semicolons. This applies to the VIC as well as to earlier PET/CBM units.

This is ideal for many purposes. An INPUT#..statement executed at a later time will receive the characters just as if you had typed them on the keyboard, and the value of 159 will be input. All neat and orderly. What's more, the file is made up of conventional ASCII characters: it may be manipulated by text editors, sent to a communications line, or handled in a number of conventional ways.

But occasionally – rarely! – we might find a need to change the rules. We might have a utility program (notably a sort routine) that wants to handle the data in "columns" as if it were on a punched card. In this case, we would want to organize our data more formally. On the opposite side of the coin, we might need to crunch our data – it's very large and the file size is becoming a problem.

Formatted Data

Normally we would write the various fields of a computer record as individual items. To write name, initials, address, and balance, we might write:

PRINT #1, N$
PRINT #1, I$
PRINT #1, A$
PRINT #1, B

and it's written. Corresponding INPUT# statements would bring it back when needed. It's fairly compact and not hard to handle.

If we wanted to go into "fixed column" format, we'd need to make decisions. The name might be fitted into columns 1 to 15; the initials into columns 16 to 18; the address into columns 19 to 40; and the balance into columns 41 to 46. Now that we've made the decisions, we must pack the data that way.

Each field of data must be fitted to the fixed size. If the name were too long, we would need to trim it back with LEFT$(N$, 15); if it were too short, we'd need to extend it with spaces by coding N$ + " ". We can do both together by writing LEFT$(N$ + " ", 15). We must be sure to allow enough spaces to fill needed space; it's most convenient to define a lot of spaces as S$, which will make our coding more compact.

Names must align on the left, so that the B of BUTTERFIELD will fall into the same column as the P of PUNTER; in this way, a column sort will place the two names in correct alphabetic order. Numeric values must go the other way: 123 and 45 must be placed so that the 3 and the 5 digits are lined up. This is called "right justification" and is done with the RIGHT$ function: RIGHT$(" " + STR$(B),6). One caution on numerics: be careful with fractions; it's usually better to change everything to integer values, such as cents rather than dollars-and-cents.

The whole record then becomes:

S$ = "        " (spaces)
R$ = LEFT$(N$ + S$, 15) + LEFT$(I$ + S$, 3) + LEFT$ (A$ + S$, 22) + RIGHT$(S$ + STR$(B), 6)

Note that, in this case, every record will be exactly 46 characters long.

When we read this record (one INPUT# statement will do the job), we must extract the various fields. This is quite easy if we use the MID$ statement:

N$ = MID$ (R$, 1, 15)
I$ = MID$ (R$, 16, 3)
A$ = MID$ (R$, 19, 22)
B = VAL(MID$ (R$, 41, 6)

The strings will be their original values, except that they will be padded out with extra spaces to make up the specified length.

Packing Them In

In contrast to the previous formatting, binary packing saves space. It makes the information almost indecipherable, however, unless you have the key. Also, as we crunch the information together, we lose the capability to manipulate the data with other programs, since what we are writing is not readable ASCII.

The principle is this: why store a value like 169 in five bytes of storage when the binary value of 169 will fit into one byte? It's a dangerous road. We must be sure to leave enough space for the size of the number we plan to hold. Two bytes, for example, will hold an integer value from zero to 65535.

When we print binary values to a file, we must abandon all our "normal" formatting rules. For example, a value of 13 stored in binary will be indistinguishable from a RETURN character, so we won't be able to use the INPUT statement to read it. A word of caution to cassette tape users: two characters cannot be written to tape files: CHR$ (10) (Line Feed) and CHR$ (0) (Null). This makes cassette tape of limited use in building packed files.

Let's write some packed numbers to a file. We'll assume that the numbers will fit into two bytes, so the values will range from zero to 65535. We'll write ten numbers to a binary file:

100 OPEN l, 8, 2, "0: DATBIN, U, W"

Note that we designate the file as type USR (User). This is the same as Sequential. We just want to mark it as being in unusual format.

110 FOR J = l TO 10
120 INPUT V : IF V < 0 OR V > 65536 GOTO 120
130 V% = V/256 : L = V% * 256

We have split V into low and high bytes.

140 PRINT #1, CHR$ (L); CHR$ (V%);

Don't forget the semicolons.

150 NEXT J
160 CLOSE 1

Ten numbers have been written into 20 bytes. Now let's read them.

100 OPEN 1, 8, 2, "DATBIN, U, R"
110 FOR J = l TO 10
120 GET #l, A$, B$

We must use GET; INPUT can't cope.

130 PRINT ASC(A$ + CHR$(0)) + ASC(B$ + CHR$(0)) * 256

The CHR$ (0) is needed to allow for zeros; they will be received by the GET statement as a null string.

140 NEXT J
150 CLOSE 1

We've just coded numbers very compactly. One hundred numbers would fit into 200 bytes or one disk sector. Similar numbers in conventional sequential files would take up three or four sectors.

Most of the time, you'll want to stay with ordinary data files. They are more orderly and easier.

But you can build special types of files if you wish. Formatting and compacting are perfectly logical manipulations. Use them with care – when you need them.