I know there exists a standardized .NET zip library, yet it's large and not easy integrated if you choose to include original code inside your projects. So I decided to write a small version of Zip library with only several thousands lines of code and easy to import to any project.
Currently I'm using deflate algorithm shipped with .NET framework since my own implement is slow compared to the one written by Microsoft. Though nearly every .NET platform has deflate algorithm, I still want to play around. Yes, because I'm a geek.
According to the plan, I will publish the beta version of Zip library during the Xmas holiday since I have exam session in the following two weeks.
This is a beta version of Office File Extracter with GUI. Currently it is only supported in Windows platform installed with .NET 4.0. I will move the main library to Mono and redesign the GUI in GTK#, so that users in Linux as well as Mac system can also use it without bothering console mode. Yes, its GUI is designed in WPF and that's why I need to redesign if I choose to migrate to other platform.
1. Microsoft Office File Support List:
a) PPTX. PowerPoint File.
b) DOCX. Word File.
2) Media File Extraction
Currently it can only extract media files from PPTX. DOCX support is coming. Video file extraction is being test and will be verified in next version.
3) Text Extraction
It can extract text and output txt file to read and copy.
1. I tried my best to interpreted paragraph setting in OpenXML, yet the behavior is out of my control, i.e., the page rendering in the program will not be the same as it in Microsoft Office Word. I will keep working on it.
Beta version is released.
You can download from here or from my Download page.
To my best friend, Henry.
Since Microsoft Office 2007, Microsoft changed their default PowerPoint file format from purely binary file to Open XML, and most importantly, it is a zipped file. With the help of traditional deflate compression algorithm, you can do whatever you want to do with PPTX.
That's being said, it is not so easy to manipulate so detailed file format by myself; it requires thousands of software engineers working several years to finish the file format and rendering process.
However, the existing difficult doesn't mean I could not play with this interesting format. Actually, you can do something besides rendering the slides.
As I said before, pptx as well as docx is compressed into a standard zip file with xml files controlling contents. Consequently, it is possible extract text and media from pptx and docx without rendering them or parsing them entirely into memory. In addition, the media files for pptx are in the media folder.You can use any popular Zip library to implement the idea. Here I decide to write my own simple zip library to do the job.
The PKZip format consists of three main parts, namely, file header entries, central directory header entries. And end of central directory. Detailed explanation can be found at PKWare. I will upload my own library here after I finish all the basic implements for a zip library.
The pseudocode for my library is something like below:
If the last 4 bytes is 0 // it means there’s no comment. It’s sufficient for extracting media from pptx
Read the header from stream position -22 // the length of end of central directory is 22 without comments.
Go to the position where end of central directory points to.
Read central directory. If the header is wrong, break. Use the traditional approach to read the Zip file.
Else // there is comment
Read 0xFF bytes from the end the stream (bottom to top) and exam the bytes one by one.
The traditional way to read entry headers // in this way we don’t need central directory header.
Read stream from beginning, after read the entry comment, jump to next header by adding uncompressed size to the position.
If the crc32, compressed size, and uncompressed size is zero /* I don’t know how to deal with it fast. But the problem is that Google Docs use this way to generate pptx. I guess it’s easy to write the stream.*/
Search the signature of data descriptor header by examining each byte following the header.
DTX1, aka BC1, is one of the DDS formats used in DirectX engine. It is optimised for parsing into memory and rendering, so the decompression is straightforward.
The first four bytes are FourCC, namely,0x20534444. Then comes the header, whose length is 124. Consequently, before reading the actual image data, you need to read 128 bytes to verify the data and read width, heights, and flags from the header. More header details can be found at MSDN.
DXT1, as I said, is a simple image compression using R5G6B5 format to store 3 bytes colors (it's said human eyes are more sensitive to green so people give green one more bit to be more accurate). In DirectX11, DTX1 is also called BC1 because the image is splitted into many small 4x4 blocks.
The original length of a 4x4 block is 3*16 = 48 (in DXT1 usually alpha is excluded). After compression, the length becomes 2*2+16*2/8 = 8. It means there are two R5G6B5 colors, c0 and c3, and a lookup table which using two bits to indicate the color needed to fill in. Another two colors, c1, and c2 are generated linearly form c0 and c3. If c0 > c3, then c1 = 2 * c0 /3 + c3 / 3; c2 = c0/3 + 2* c3 /3; else then c1 = 2 * c3 /3 + c0; c2 = 2*c0/3 + c3/3.
The sequence of colors is from left to right first in pixels, both in reading colors in blocks and reading blocks.
Here is the code example in C# about how to read and render it into Bitmap quickly.
Read DTX1 Image:
DTX 1 decompression is finished yesterday with a good quality, yet compression algorithm is still in the early stage. I read several papers about matrix data compression and found vector quantization promising.
I will continually work on this project and hopefully it will be published at the beginning of next year.