Skip to content

PowerPoint file parser

The goal of this module is to simply take a PowerPoint file and parse it.
We will later use this parsed data to create the Template data.

Unfortunately, we could not find an existing solution that met our needs, so we built this from scratch.
The codebase is originally a fork of an existing converter (PPTX to Sketch) from a previous project.
For these reasons, and because this is not the core of our product, the code is still of relatively low quality and quite tricky to work with.

First of all, a PowerPoint file (with the .pptx extension) is actually a ZIP archive.
So it can be explored by unzipping it and browsing the internal files.
I won’t go into full detail here, but generally:

  • most of the internal files are XML
  • each major chunk of content is contained within an internal file
  • internal files are grouped into internal directories (e.g. a directory for the layouts, one for the slides)
  • links between different chunks of content are handled in a dedicated file with the extension *.xml.rels (e.g. links between a slide and its layout, a slide and an image) each link is represented in the XML by an id pointing to this file

At a high level, there are three categories of slides in a PowerPoint file:

  • regular slide: what is actually presented to the audience, containing the actual content
  • layout slide: a special kind of slide that contains content placeholders that can be filled to create a slide
  • master slide: a special kind of slide that controls common properties and content shared by a set of layouts (e.g. common placeholders, page number)

These three types are hierarchically connected: a slide inherits from a layout, which inherits from a master.

In addition to these types of slides, there is also the theme: a set of properties controlling the general appearance (e.g. color palette, main fonts).
This theme is linked to the master, so all layouts inheriting from the same master will also inherit its theme.

Since our target data structure is much simpler than a PowerPoint file, one of the core goals of the process is to extract and package only the information we need from the file.
To do this, we follow this high-level series of steps:

  • receive a .pptx file as input
  • unzip the file
  • load all the chunks that we need into memory
  • compute the links between the different chunks (used later in the process)
  • loop over masters
    • loop over layouts
      • loop over slides
        • loop over shapes -> transform each shape into our internal data structure
  • output a ParsedTemplate JSON data structure containing all slides and layouts