I know there exists a standardized .NET zip library, yet it's large and not easy integrated if you choose to include original code inside your projects. So I decided to write a small version of Zip library with only several thousands lines of code and easy to import to any project.
Currently I'm using deflate algorithm shipped with .NET framework since my own implement is slow compared to the one written by Microsoft. Though nearly every .NET platform has deflate algorithm, I still want to play around. Yes, because I'm a geek.
According to the plan, I will publish the beta version of Zip library during the Xmas holiday since I have exam session in the following two weeks.
Since Microsoft Office 2007, Microsoft changed their default PowerPoint file format from purely binary file to Open XML, and most importantly, it is a zipped file. With the help of traditional deflate compression algorithm, you can do whatever you want to do with PPTX.
That's being said, it is not so easy to manipulate so detailed file format by myself; it requires thousands of software engineers working several years to finish the file format and rendering process.
However, the existing difficult doesn't mean I could not play with this interesting format. Actually, you can do something besides rendering the slides.
As I said before, pptx as well as docx is compressed into a standard zip file with xml files controlling contents. Consequently, it is possible extract text and media from pptx and docx without rendering them or parsing them entirely into memory. In addition, the media files for pptx are in the media folder.You can use any popular Zip library to implement the idea. Here I decide to write my own simple zip library to do the job.
The PKZip format consists of three main parts, namely, file header entries, central directory header entries. And end of central directory. Detailed explanation can be found at PKWare. I will upload my own library here after I finish all the basic implements for a zip library.
The pseudocode for my library is something like below:
If the last 4 bytes is 0 // it means there’s no comment. It’s sufficient for extracting media from pptx
Read the header from stream position -22 // the length of end of central directory is 22 without comments.
Go to the position where end of central directory points to.
Read central directory. If the header is wrong, break. Use the traditional approach to read the Zip file.
Else // there is comment
Read 0xFF bytes from the end the stream (bottom to top) and exam the bytes one by one.
The traditional way to read entry headers // in this way we don’t need central directory header.
Read stream from beginning, after read the entry comment, jump to next header by adding uncompressed size to the position.
If the crc32, compressed size, and uncompressed size is zero /* I don’t know how to deal with it fast. But the problem is that Google Docs use this way to generate pptx. I guess it’s easy to write the stream.*/
Search the signature of data descriptor header by examining each byte following the header.