That's being said, it is not so easy to manipulate so detailed file format by myself; it requires thousands of software engineers working several years to finish the file format and rendering process.
However, the existing difficult doesn't mean I could not play with this interesting format. Actually, you can do something besides rendering the slides.
As I said before, pptx as well as docx is compressed into a standard zip file with xml files controlling contents. Consequently, it is possible extract text and media from pptx and docx without rendering them or parsing them entirely into memory. In addition, the media files for pptx are in the media folder.You can use any popular Zip library to implement the idea. Here I decide to write my own simple zip library to do the job.
The PKZip format consists of three main parts, namely, file header entries, central directory header entries. And end of central directory. Detailed explanation can be found at PKWare. I will upload my own library here after I finish all the basic implements for a zip library.
The pseudocode for my library is something like below:
If the last 4 bytes is 0 // it means there’s no comment. It’s sufficient for extracting media from pptx
Read the header from stream position -22 // the length of end of central directory is 22 without comments.
Go to the position where end of central directory points to.
Read central directory. If the header is wrong, break. Use the traditional approach to read the Zip file.
Else // there is comment
Read 0xFF bytes from the end the stream (bottom to top) and exam the bytes one by one.
The traditional way to read entry headers // in this way we don’t need central directory header.
Read stream from beginning, after read the entry comment, jump to next header by adding uncompressed size to the position.
If the crc32, compressed size, and uncompressed size is zero /* I don’t know how to deal with it fast. But the problem is that Google Docs use this way to generate pptx. I guess it’s easy to write the stream.*/
Search the signature of data descriptor header by examining each byte following the header.