VBA Project Compression
Compression
As described in the first articlefirst article [link to StructuredStorage.php], prior to Word 2007, Word Documents (and
Excel Workbooks, amongst others) were held in a containing structure that mimics a mini file system. Despite
creating an entirely new XML‑based format for documents themselves, Microsoft still use the old format for
storing VBA Projects within the new structure. I have written about the
VBA structureVBA structure [link to VBAStorage.php], and now turn to the compression used,
supposedly to save space.
It is not compulsory for you to follow along, but I will describe what I do in a way that you can if you wish. I will say now, and may say again, that VBA is not really the language of choice for this but I like to use VBA to demonstate because it is available to all users of Word: you don’t need special software to copy what I do.
As an arbitrary starting point, create a new document, and insert a new module, “Module1”; copy the
VBA code from the first page of this seriesVBA code from the first page of this series [link to StructuredStorage.php#TheCode] into the new module and save the document, in Word
2007‑format as a macro‑enabled document, calling it, say, “Arbitrary.docm”. I was
writing an article on the 2007‑format files before I allowed myself to be sidetracked into writing this one,
and do not propose to provide any specific details here, other than those necessary for the matter at hand. If you
rename the document file as “Arbitrary.docm.zip”, open the resultant zip folder, and navigate to the
“word” directory inside it, the VBA project will be, by default, in the file
“vbaProject.bin”; if you extract this file you will have something to work with. When you have extracted
the file, you can close the folder and rename the file back to “Arbitrary.docm”. You don’t
actually have to do this yourself: I have done it for you and you can download a zip folder containing the two files,
by clicking here:
[link to the file on this site at files/ArbitrarySample.zip]
I should, perhaps, say that the above description and download are exactly the same as presented on the
page dedicated to the structure of the VBA Projectpage dedicated to the structure of the VBA Project [link to VBAStorage.php]. Whilst the two processes could, clearly, be combined,
there is no dependency and I have chosen not to combine them for the purposes of the demonstration I offer here.
The sample code that I posted is just some scaffolding on which to build; it shows how to navigate the physical file, but it doesn’t really offer much help when you want to work with the logical contents of the file that are held inside the physical wrapper. The navigation presented ends with the extraction of what is called the dir Stream. The dir stream contains essential information about the Project, and must be read, and interpreted, before any other stream. On this page, I will explain how to decompress the stream, and on the next (NOT WRITTEN YET!), how to understand the decompressed contents.
This page is technical and most of what marketers would call its target audience are expected to be able to understand it; if you are a programmer, you can almost certainly skip this section. I do, however, try to make my writings accessible to all, and I feel an explanation of the numbers used here is in order.
Numbers (in modern Western systems) are written, to base 10, using a positional notation; each individual digit has a particular meaning that depends on its position in the context of surrounding digits. As I’m quite sure you know, the “9” in “95” means 90, that is 9 × 10 (9 times the base), whilst the “9” in “3917” means 900: 9 × 10 × 10 (9 times the base times the base). This, however, is largely meaningless to computers, which work with binary numbers, numbers to the base 2.
Binary numbers are, similarly, written using a positional notation but, in the binary system, there are only two digits: zero and one. The “1” in the binary number “10” means 2, that is 1 × 2 (1 times the base), and the “1” in “1000” means 8: 1 × 2 × 2 × 2 (1 times the base times the base times the base). For clarity, whenever binary numbers are used here they are written with a prefix of “0b”, so the decimal number 25 would be written as 0b11001.
Binary numbers can get quite long, and easy to misread. For example the binary equivalent of 4000 would be 0b111110100000. To make life easier for people (computers have no difficulties), number systems based on higher numbers are better. A system based on 8, called octal is sometimes seen, but a system based on 16, called hexadecimal, is more normally used.
In hexadecimal, there are 16 different digits, and the letters A through F are used for the digits 10 through 15. Hexadecimal is, like the other systems, written positionally, and the “B” in “B0” means 176, that is B × 16 (B, or 11, times the base), whilst the “B” in “1B0F” means 2816, that is B × 16 × 16 (B times the base times the base). Hexadecimal (or, more usually, just hex) numbers used here are written with a prefix of “0x”, so the decimal number 4000, 0b111110100000 in binary, you’ll remember, would be written as 0xFA0. When presenting data in ‘dump’ format, however, the numbers are always hexadecimal and are just presented as numbers; to give each of them an “0x” prefix would make them unreadable.
Before you can run the code in the document, assuming you downloaded, or otherwise created, it, you need to make one small changes to make it read the right file. If you have the document and the file in the same folder, as the download presents it, then you don't need to know, or hard code, exactly where it is. Just changing these two lines:
FilePath = "C:\Path\To\Your\"
FileName = "Document.doc"
.. to this:
FilePath = MacroContainer.Path & Application.PathSeparator
FileName = "vbaProject.bin"
.. will suffice. If, having done that, you run the code and put a breakpoint on the Close FileNo statement, you will be able to look at the contents of the stream after it has been extracted, but it isn’t easy looking at a byte array like that, so here is a hex version of the complete stream.
000000 01 31 B2 80 01 00 04 00 00 00 03 00 30 2A 02 02 .1².........0*..
000010 90 09 00 70 14 06 48 03 00 82 02 00 64 E4 04 04 ...p..H.....dä..
000020 00 07 00 1C 00 50 72 6F 6A 65 63 74 05 51 00 28 .....Project.Q.(
000030 00 00 40 02 14 06 02 14 3D AD 02 0A 07 02 6C 01 ..@.....=....l.
000040 14 08 06 12 09 02 12 80 35 CC 87 51 10 00 0C 02 ........5Ì.Q....
000050 4A 12 3C 02 0A 16 00 01 72 73 74 64 10 6F 6C 65 J.<.....rstd.ole
000060 3E 02 19 73 00 74 00 00 64 00 6F 00 6C 00 65 50 >..s.t..d.o.l.eP
000070 00 0D 00 68 00 25 5E 00 03 2A 00 5C 47 7B 30 30 ...h.%^..*.\G{00
000080 30 32 30 B0 34 33 30 2D 00 08 04 04 43 00 0A 03 020°430-....C...
000090 02 0E 01 12 30 30 34 36 7D 23 00 32 2E 30 23 30 ....0046}#.2.0#0
0000A0 23 43 3A 00 5C 57 69 6E 64 6F 77 73 00 5C 53 79 #C:.\Windows.\Sy
0000B0 73 74 65 6D 33 04 32 5C 03 65 32 2E 74 6C 62 00 stem3.2\.e2.tlb.
0000C0 23 4F 4C 45 20 41 75 74 80 6F 6D 61 74 69 6F 6E #OLE Aut.omation
0000D0 00 60 03 00 02 83 45 4E 6F 72 6D 61 6C 05 83 45 .`....ENormal..E
0000E0 4E 80 43 72 00 6D 00 61 51 80 46 0E 00 20 80 11 N.Cr.m.aQ.F.. ..
0000F0 09 80 01 2A 0C 5C 43 03 12 0A 06 B8 7D 3F 51 04 ...*.\C....¸}?Q.
000100 04 00 83 21 4F 66 66 69 63 11 84 67 4F 00 66 80 ...!Offic..gO.f.
000110 00 69 00 63 15 82 67 9E 80 1F 94 82 21 47 7B 32 .i.c..g.....!G{2
000120 00 44 46 38 44 30 34 43 2D 00 35 42 46 41 2D 31 .DF8D04C-.5BFA-1
000130 30 31 40 42 2D 42 44 45 35 80 67 41 6A 41 80 65 01@B-BDE5.gAjA.e
000140 34 80 05 32 88 67 80 BA 67 00 72 61 6D 20 46 69 4..2.g.ºg.ram Fi
000150 6C 65 00 73 5C 43 6F 6D 6D 6F 6E 01 04 06 4D 69 le.s\Common...Mi
000160 63 72 6F 73 6F 00 66 74 20 53 68 61 72 65 00 64 croso.ft Share.d
000170 5C 4F 46 46 49 43 45 00 31 35 5C 4D 53 4F 2E 44 \OFFICE.15\MSO.D
000180 18 4C 4C 23 87 10 83 4D 20 31 35 20 2E 30 20 4F .LL#...M 15 .0 O
000190 62 81 E3 20 4C C0 69 62 72 61 72 79 80 25 80 00 b.ã LÀibrary.%..
0001A0 22 0F 82 7A 02 00 13 C2 01 15 80 02 19 42 65 54 "..z...Â.....BeT
0001B0 68 69 73 44 6F 00 63 75 6D 65 6E 74 47 00 0A 18 hisDo.cumentG...
0001C0 C0 09 54 C0 66 69 00 73 00 22 44 C0 48 63 00 75 À.TÀfi.s."DÀHc.u
0001D0 40 49 65 00 AA 6E C0 6E 1A CE 0B 32 DA 0B 1C C0 @Ie.ªnÀn.Î.2Ú..À
0001E0 12 A8 00 00 48 42 01 31 42 89 0D 40 A1 16 1E 42 .¨..HB.1B..@¡..B
0001F0 02 01 05 2C C2 21 11 1D 22 15 42 08 2B 42 01 19 ...,Â!..".B.+B..
000200 82 A1 4D 6F 64 80 75 6C 65 31 47 00 0E 00 05 1A .¡Mod.ule1G.....
000210 4D 80 21 64 80 21 81 8D 31 00 1A 0D 09 08 32 10 M.!d.!..1.....2.
000220 08 4F 1D E1 57 00 00 B1 4D 1D A8 D2 21 C2 1B 43 .O.áW..±M.¨Ò!Â.C
000230 1D 10 C2 02 00 ..Â..
You can see bits of recognisable text there but, overall, it isn't readable; this is because it is compressed. From what I have seen, I doubt whether the compression saves as much space as is wasted by the unnecessary inclusion of extra data, but what I think is of little consequence; the data is compressed and must be decompressed before going any further.
A fairly consistent feature of the documentation that Microsoft has released is that it makes everything appear more complicated than it really is; so it is with the compression that is used here. What Microsoft call the Compressed Container contains a signature byte (0x01) followed by a series of “Chunks”. The term, Chunk, is one I despise (should anybody who worked with me in the late nineties happen to be reading, you know what I mean!) and, although I know it can be difficult to come up with distinct terminology, I do think that the use of such terms smacks of desperation.
Each chunk, with the exception of the last one in a Container, which will usually be smaller, contains 4096 bytes of data, and will be compressed if space can be saved by so doing. Each chunk begins with a 16‑bit ‘header’, the low order 12 bits of which have a value three less than the length of the chunk; the reason for this is that the maximum possible size of a chunk (the 2‑byte header followed by 4096 bytes of data) is 4098, three gretaer than the maximum value that can be held in 12 bits, which is 4095. The high order bit of the header is a flag indicating whether or not the chunk is compressed, and the other three bits have a fixed value of 0b011.
The compressed data consists of a series of what are called Token Sequences. Each Token Sequence consists of a byte, to be viewed as eight separate flag bits, followed by eight Tokens, the type of each token being indicated by the corresponding flag bit. If a flag bit is 0, the corresponding token is a single byte, a Literal Token, to be taken ‘as is’; if the flag bit is 1, the token is a two byte code, a Copy Token, which, after being unscrambled, gives the position and length of a sequence earlier in the decompressed chunk, which must be copied.
The Copy Token is made up of two parts, an offset code and a length code. The offset code is one less than the number of bytes to the left of the current position in the decompressed chunk from where to start copying, and the length code is three less than the number of bytes to copy. There is no good reason for these small increments but they are used and you need to know about them. The number of bits (of the 16 in the copy token) used for the offset (and, thus, those remaining that are used for the length) is calculated as being the smallest integer that is equal to or greater than the logarithm to base 2 of the length, so far, of the decompressed chunk, subject to it never being less than four or greater than 12.
If you are one of those unfortunate people who shiver at the mere mention of numbers, do not fear. None of this is difficult and I am here to explain it, and to give you some code so that you don’t even need to understand it. The logarithm (or log) of a number is just a way of saying how many copies of another number (the base of the log) have to be multiplied together to get the (first) number. For example, the logarithm (to base 2) of 8 is 3, because three copies of 2 are needed to make 8: 8 = 2 × 2 × 2, and the log (to base 2) of 16 is 4, because four copies of 2 are needed to make 16: 16 = 2 × 2 × 2 × 2. So what, you may ask, about 12, or 14? Well, the answer is three and a bit; those numbers are more than 8, and less than 16, so the logs of those numbers are more than 3 and less than 4, and that is all you need to know for the purposes of this decompression.
Logs are most usually presented as logs to base 10, and the unqualified term, “Log”, usually means log to base 10, but mathematicians are more likely to use natural logarithms, logs to the base e. e is a special number in mathematics, equal, if you are interested, to approximately 2.71828, and formally defined as the limit of (1 + 1/n)n as n tends to infinity, which means that the bigger the value of n, the closer to e, the result is. The term, “Ln”, is generally used for natural logarithms. Logs to base 2 are sometimes useful in computing, where everything is in the binary system, and based on 2, but there is no special term for such logs.
VBA is not the best language for working with logs; it does have what it calls a Log function, but it should really be named Ln, as it returns natural logarithms, not logs to base 10. There is no function to return a log to base 2, but there is a way to convert logs from one base to another: you do this by dividing the log of your number to the first base by the log of the second base to the first base. If you want to use VBA to get the log to base 2 of 7, you do this:
LogBaseTwoOfSeven = Log(7) / Log(2)
I must just say, before moving on, that I have been plagued by “Expression too complex” errors when I use this code in anger, and, indeed, with other code involving mathematical calculations, and have used a different mechanism, one that VBA does seem able to cope with, for the code you will see later on this page.
Working, manually, through the beginning of the dir stream here, will, I hope, make everything clear. As seen in the picture above, the stream begins with 0x01: the expected signature byte. This is followed, immediately by the first compressed chunk, the first two bytes of which, the ‘header’, are 0x31 and 0xB2. I’m sure I explained the ‘little‑endian’ format on the previous page, so you should know that these two bytes represent the 16‑bit value 0xB231, the low order 12 bits of which are 0x231. In this case, with a single chunk, it is relatively easy to verify this from the view above. The stream length is 0x235, one more than the chunk length of 0x234 (0x231 plus 3). The high order four bits of the header are 0xB, or 0b1011: the high order “1” signifies that this chunk is compressed and the low order 0b011 is the fixed value it should be.
After the chunk header comes the first Token Sequence. The first byte of the token sequence is 0x80, which, in binary, is 0b10000000. When reading bits from a byte it is usual to work from low order to high order, that is, right to left, so the eight flags represented by this byte are 0, 0, 0, 0, 0, 0, 0, and 1. This means that the first seven tokens of this sequence are literal tokens, single bytes to be copied (the actual bytes, here, are 0x01, 0x00, 0x04, 0x00, 0x00, 0x00, and 0x03), and the eighth token is a copy token, two bytes (0x00 and 0x30) to be interpreted. The interpretation depends on the length of the decompressed chunk so far, and, so far, it is:
000000 01 00 04 00 00 00 03
Just seven bytes. The logarithm to base 2 of 7 is a bit less than 3. You saw, above, how the log, to base 2, of 8 is exactly 3, and as 7 is a little less than 8, so the log of 7 is a little less than the log of 8. If you remember from my description, all you want is the smallest possible integer (whole number) that is at least as large as the log; given a log of 2 and a bit, that whole number is, I hope you can see, 3. Again, if you remember, the number you want is subject to a constraint of not being less than 4, so the actual number of bits of the Copy Token used for the offset code in this case, is 4.
The two bytes of the Copy Token were 0x00 and 0x30, a 16‑bit value of 0x3000. You now know that 4 bits of this are used for the offset, and the remaining 12 for the length code. Despite what I said earlier about usually reading bits from right to left, it is the leftmost (high order) bits of the copy token that are taken as the offset code, and the rightmost, or low order, ones that make up the length code. The first 4 bits are 0b0011 (a value of 3, representing an offset, one greater, of 4 bytes), and the remaining 12 bits are 0b000000000000 (a value of 0, representing a length 3 bytes greater, of 3). The token tells you to go back four bytes and copy three bytes from there. Doing this gives a decompressed chunk that now looks like this, with the copied bytes highlighted:
000000 01 00 04 00 00 00 03 00 00 00
Phew! I have tried to explain everything in detail; I hope I have succeeded. From here on it should be plain sailing. Going back to the compressed chunk, the next token sequence begins with a flag byte of 0x2A, equal, in binary, to 0b00101010. First a simple literal token: copy the next byte (0x02) and the decompressed stream becomes:
000000 01 00 04 00 00 00 03 00 00 00 02
Next a copy token, 0x9002. The decompressed chunk is now 11 bytes long but the number of bits for the offset code is still subject to the minimum of 4. The offset code is 9, so the offset is 10 bytes, and the length code is 2, giving a number of bytes to copy of 5. With the copied bytes highlighted as before, here is the result:
000000 01 00 04 00 00 00 03 00 00 00 02 00 04 00 00 00
Another literal token (0x09), is followed by another copy token of 0x7000. The length of the decompressed chunk is now 17 bytes, so the number of bits dedicated to the offset code is 5, because log base 2 of 17 is 4 and a bit. The first five bits of the copy token are 0b01110, a decimal value of 14, indicating an offset of 15; the remaining bits are all zero, indicating a length of 3. Adding the literal, and then copying the appropriate bytes, extends the result, so far, to:
000000 01 00 04 00 00 00 03 00 00 00 02 00 04 00 00 00 000010 09 04 00 00
The next literal token is 0x14, and the copy token that follows is of 0x4806. The length of the decompressed chunk is now 21 bytes, but the number of bits dedicated to the offset code is still 5; the first five bits of the copy token are 0b01001, a decimal value of 9, indicating an offset of 10, and the remaining bits (0b00000000110) have a value of 6, indicating a length of 9. Adding this literal, and then copying the nine bytes, gives:
000000 01 00 04 00 00 00 03 00 00 00 02 00 04 00 00 00 000010 09 04 00 00 14 00 04 00 00 00 09 04 00 00
To finish this token sequence there are two literal tokens, 0x03 and 0x00, to be added to the decompressed chunk. If you really felt inspired, you could continue like this all the way to the end, but I rather suspect you are more interested in the end result than the laborious process, and, so, I have done it for you, and this is that end result:
000000 01 00 04 00 00 00 03 00 00 00 02 00 04 00 00 00 ................ 000010 09 04 00 00 14 00 04 00 00 00 09 04 00 00 03 00 ................ 000020 02 00 00 00 E4 04 04 00 07 00 00 00 50 72 6F 6A ....ä.......Proj 000030 65 63 74 05 00 00 00 00 00 40 00 00 00 00 00 06 ect......@...... 000040 00 00 00 00 00 3D 00 00 00 00 00 07 00 04 00 00 .....=.......... 000050 00 00 00 00 00 08 00 04 00 00 00 00 00 00 00 09 ................ 000060 00 04 00 00 00 35 CC 87 51 10 00 0C 00 00 00 00 .....5Ì.Q....... 000070 00 3C 00 00 00 00 00 16 00 06 00 00 00 73 74 64 .<...........std
000080 6F 6C 65 3E 00 0C 00 00 00 73 00 74 00 64 00 6F ole>.....s.t.d.o
000090 00 6C 00 65 00 0D 00 68 00 00 00 5E 00 00 00 2A .l.e...h...^...*
0000A0 5C 47 7B 30 30 30 32 30 34 33 30 2D 30 30 30 30 \G{00020430-0000
0000B0 2D 30 30 30 30 2D 43 30 30 30 2D 30 30 30 30 30 -0000-C000-00000
0000C0 30 30 30 30 30 34 36 7D 23 32 2E 30 23 30 23 43 0000046}#2.0#0#C
0000D0 3A 5C 57 69 6E 64 6F 77 73 5C 53 79 73 74 65 6D :\Windows\System
0000E0 33 32 5C 73 74 64 6F 6C 65 32 2E 74 6C 62 23 4F 32\stdole2.tlb#O
0000F0 4C 45 20 41 75 74 6F 6D 61 74 69 6F 6E 00 00 00 LE Automation...
000100 00 00 00 16 00 06 00 00 00 4E 6F 72 6D 61 6C 3E .........Normal>
000110 00 0C 00 00 00 4E 00 6F 00 72 00 6D 00 61 00 6C .....N.o.r.m.a.l
000120 00 0E 00 20 00 00 00 09 00 00 00 2A 5C 43 4E 6F ... .......*\CNo
000130 72 6D 61 6C 09 00 00 00 2A 5C 43 4E 6F 72 6D 61 rmal....*\CNorma
000140 6C B8 7D 3F 51 04 00 16 00 06 00 00 00 4F 66 66 l¸}?Q........Off
000150 69 63 65 3E 00 0C 00 00 00 4F 00 66 00 66 00 69 ice>.....O.f.f.i
000160 00 63 00 65 00 0D 00 9E 00 00 00 94 00 00 00 2A .c.e...........*
000170 5C 47 7B 32 44 46 38 44 30 34 43 2D 35 42 46 41 \G{2DF8D04C-5BFA
000180 2D 31 30 31 42 2D 42 44 45 35 2D 30 30 41 41 30 -101B-BDE5-00AA0
000190 30 34 34 44 45 35 32 7D 23 32 2E 30 23 30 23 43 044DE52}#2.0#0#C
0001A0 3A 5C 50 72 6F 67 72 61 6D 20 46 69 6C 65 73 5C :\Program Files\
0001B0 43 6F 6D 6D 6F 6E 20 46 69 6C 65 73 5C 4D 69 63 Common Files\Mic
0001C0 72 6F 73 6F 66 74 20 53 68 61 72 65 64 5C 4F 46 rosoft Shared\OF
0001D0 46 49 43 45 31 35 5C 4D 53 4F 2E 44 4C 4C 23 4D FICE15\MSO.DLL#M
0001E0 69 63 72 6F 73 6F 66 74 20 4F 66 66 69 63 65 20 icrosoft Office
0001F0 31 35 2E 30 20 4F 62 6A 65 63 74 20 4C 69 62 72 15.0 Object Libr
000200 61 72 79 00 00 00 00 00 00 0F 00 02 00 00 00 02 ary.............
000210 00 13 00 02 00 00 00 15 80 19 00 0C 00 00 00 54 ...............T
000220 68 69 73 44 6F 63 75 6D 65 6E 74 47 00 18 00 00 hisDocumentG....
000230 00 54 00 68 00 69 00 73 00 44 00 6F 00 63 00 75 .T.h.i.s.D.o.c.u
000240 00 6D 00 65 00 6E 00 74 00 1A 00 0C 00 00 00 54 .m.e.n.t.......T
000250 68 69 73 44 6F 63 75 6D 65 6E 74 32 00 18 00 00 hisDocument2....
000260 00 54 00 68 00 69 00 73 00 44 00 6F 00 63 00 75 .T.h.i.s.D.o.c.u
000270 00 6D 00 65 00 6E 00 74 00 1C 00 00 00 00 00 48 .m.e.n.t.......H
000280 00 00 00 00 00 31 00 04 00 00 00 0D 03 00 00 1E .....1..........
000290 00 04 00 00 00 00 00 00 00 2C 00 02 00 00 00 11 .........,......
0002A0 1D 22 00 00 00 00 00 2B 00 00 00 00 00 19 00 07 .".....+........
0002B0 00 00 00 4D 6F 64 75 6C 65 31 47 00 0E 00 00 00 ...Module1G.....
0002C0 4D 00 6F 00 64 00 75 00 6C 00 65 00 31 00 1A 00 M.o.d.u.l.e.1...
0002D0 07 00 00 00 4D 6F 64 75 6C 65 31 32 00 0E 00 00 ....Module12....
0002E0 00 4D 00 6F 00 64 00 75 00 6C 00 65 00 31 00 1C .M.o.d.u.l.e.1..
0002F0 00 00 00 00 00 48 00 00 00 00 00 31 00 04 00 00 .....H.....1....
000300 00 E1 57 00 00 1E 00 04 00 00 00 00 00 00 00 2C .áW............,
000310 00 02 00 00 00 A8 D2 21 00 00 00 00 00 2B 00 00 .....¨Ò!.....+..
000320 00 00 00 10 00 00 00 00 00 .........
You probably still can’t make much sense of this, but it is easier to read than the compressed version. I will explain all the contents in due course: just be patient! You have now seen an explanation, and an example. As I’m quite sure you realise, mad as I may be, I did not decompress that whole stream by hand. VBA may not be the best language for the job, but it can do it, and you have it at your fingertips, so now it’s time to find out how to use VBA for this task.
Here is a routine based on the notes you have just read. There are some comments in it, but they, largely, just repeat what you already know. Place it somewhere in the module - at the end is as good as anywhere.
Sub DecompressContainer(ByRef CompressedContainer() As Byte, _
ByRef Compndx As Long, _
ByRef DeCompressedData() As Byte)
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' This routine receives a Stream as a byte array, and an index into it, which '
' points to the start of a Compressed Container. There is nothing to indicate '
' where the container ends, so the only possible assumption, that it runs all '
' the way to the end of the Stream, is taken. The routine must also be passed '
' an empty byte array, which it will resize and fill with decompressed data. '
' It is done this way to avoid the necessity of copying afterwards. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
Dim Decompndx As Long
Dim DecompLen As Long
Dim ChunkHeader As Long
Dim ChunkSignature As Long
Dim ChunkFlag As Long
Dim ChunkSize As Long
Dim ChunkEnd As Long
Dim BitFlags As Byte
Dim Token As Long
Dim BitCount As Long
Dim BitMask As Long
Dim CopyLength As Long
Dim CopyOffset As Long
Dim ndx As Long
Dim ndx2 As Long
Dim PowerOf2(0 To 16) As Long
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' A d****d irritating bit of initialisation. I have been having no end of '
' trouble with "Expression too complex" errors, always seeming to be when I '
' use exponentiation. To avoid them I pre-calculate the values and index into '
' the resulting array. Perchance this is actually an unintended optimisation, '
' although it would be better done, once, rather than every time, here. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
PowerOf2(0) = 1
For ndx = 1 To UBound(PowerOf2)
PowerOf2(ndx) = PowerOf2(ndx - 1) * 2
Next
Do ' Once per chunk
If (Not DeCompressedData) = True Then
ReDim DeCompressedData(0 To 4095)
Decompndx = 0
Else
ReDim Preserve DeCompressedData(UBound(DeCompressedData) + 4096)
End If
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' The 16-bit chunk header contains the length of the chunk, and four flag '
' bits. The high order bit is a flag (0 = uncompressed, 1 = compressed), and '
' the next three bits must be 0b011. '
' '
' VBA really isn't the language for bit twiddling and I am not going to fully '
' explain the code; you'll have to trust me when I say that these statements '
' grab the desired bits and right align them! '
' '
' If the Chunk Signature does not have a value of 3 (0b011), the chunk is '
' invalid; the possibility of this is not considered in this routine. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
ChunkHeader = CompressedContainer(Compndx) + _
256& * CompressedContainer(Compndx + 1)
Compndx = Compndx + 2
ChunkSize = (ChunkHeader And &HFFF)
ChunkEnd = Compndx + ChunkSize
ChunkSignature = (ChunkHeader And &H7000) \ &H1000&
ChunkFlag = (ChunkHeader And &H8000) \ &H8000&
If ChunkFlag = 0 Then
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' This just copies 4096 bytes from input to output. I would normally use the '
' RtlMoveMemory API, but prefer not to use it in demonstration code, so this '
' is a simple loop that copies a byte at a time. I have never seen a chunk '
' that is not compressed, so am not unduly concerned about the inefficiency. '
' I am - a little - concerned about what might happen when there are less '
' than 4096 bytes but the compression routine decides not to compress; the '
' documentation is silent on the issue so, maybe, it can't happen. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
For ndx2 = 0 To 4095
DeCompressedData(Decompndx + ndx2) = CompressedContainer(Compndx + ndx2)
Next ndx2
Compndx = Compndx + 4096
Decompndx = Decompndx + 4096
Else
Do
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' The data in a chunk is a series of what are called Token Sequences. '
' Each Token Sequence consists of a byte to be viewed as eight separate '
' flag bits, followed by eight elements, the type of each being indicated '
' by the individual flag bits. If a flag bit is 0, the corresponding '
' element is a single byte to be taken 'as is'; if the flag bit is 1, the '
' element is a two byte code, which, after being unscrambled, gives the '
' position and length of a sequence earlier in the (decompressed) stream, '
' which must be copied. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
BitFlags = CompressedContainer(Compndx)
Compndx = Compndx + 1
For ndx = 0 To 7
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' The final token sequence is not padded, and the chunk could end at '
' any point. Loop control, therefore, is here, rather than at the end.'
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
If Compndx > ChunkEnd Then Exit Do
If (BitFlags And PowerOf2(ndx)) = 0 Then
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' A Literal Token: just copy the single-byte literal. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
DeCompressedData(Decompndx) = CompressedContainer(Compndx)
Compndx = Compndx + 1
Decompndx = Decompndx + 1
Else
Token = CompressedContainer(Compndx) + _
CompressedContainer(Compndx + 1) * 256&
Compndx = Compndx + 2
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' A 16-bit Token consists of an offset (to the left, from the current '
' position in the decompressed data), and a length (the number of '
' bytes to copy). The number of bits used for the offset (and, thus, '
' those used for the length) is the smallest integer that is greater '
' than the logarithm to base 2 of the length, so far, of the current '
' decompressed chunk subject to it never being less than 4 or greater '
' than 12. Rather than use logs, this little loop has the constraints '
' built in and stops at the appropriate point. As each chunk (bar the '
' last) is exactly 4096 bytes long, the length so far of the current '
' decompressed chunk is as shown. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
DecompLen = Decompndx Mod 4096
For BitCount = 4 To 11
If DecompLen <= PowerOf2(BitCount) Then Exit For
Next BitCount
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' Having determined the number of bits dedicated to each component of '
' the token, some bit twiddling is needed to extract the numbers. The '
' offset first, then the length. No further explanation; work it out! '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
BitMask = PowerOf2(16) - PowerOf2(16 - BitCount)
CopyOffset = (Token And BitMask) \ PowerOf2(16 - BitCount) + 1
BitMask = PowerOf2(16 - BitCount) - 1
CopyLength = (Token And BitMask) + 3
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' Given the offset and the length, the copy can be done. '
' Note that the source and target may overlap. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
For ndx2 = 0 To CopyLength - 1
DeCompressedData(Decompndx + ndx2) _
= DeCompressedData(Decompndx - CopyOffset + ndx2)
Next ndx2
Decompndx = Decompndx + CopyLength
End If ' Literal Token or Copy Token
Next ' Token
Loop ' For next Token Sequence
End If ' Was chunk compressed?
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' If not yet at the end of the Stream, the assumption is that there is '
' another chunk: there is no possible information to the contrary. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
If Compndx > UBound(CompressedContainer) Then Exit Do
Loop
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' Only after having finished decompressing the final chunk is the final size '
' known. Now the output array can be correctly sized. '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
ReDim Preserve DeCompressedData(0 To Decompndx - 1)
End SubTo run the code, there are a couple of minor changes required to the driver module. Firstly, two new variables must be declared, one an index into the stream, the other a container for the decompressed data. Although the declarations can go anywhere, I prefer to follow convention and place them at the start of their procedure, so change:
Dim Stream() As Byte
.. to this:
Dim Stream() As Byte
Dim DeCompressedData() As Byte
Dim Compndx As Long
Declarations in place, you just need two more lines to actually run the new routine. After the Stream = ExtractStream("dir") line, add a line to set the index variable to 1 (the position after the Signature 0x01 byte), and a call to the new routine:
Compndx = 1
Call DecompressContainer(Stream, Compndx, DeCompressedData)
If you do this, you won’t see anything dramatic but you will have decompressed the stream and will need to
read my next page to understand it. The next page (when it has been written!) will build on this code. To make
things as easy as possible for you, I have taken the “Arbitrary.docm” document as available for
downloading at the start of this article, added the extra code detailed here, and saved it as a file called
“Decompress.docm”. I have zipped this up with the same “vbaProject.bin” file as before,
and you can download it from here:
[link to the file on this site at files/DecompressSample.zip]