Lamdas, zip files and streams

Published on August 20, 2021 • 3 min read

While automating our deployments via AWS CodePipeline, we needed a way to get the zip file, extract it and upload it to an S3 Bucket that serves our static files.

The first version I came up with was using a lambda to unzip the file using streams. Use the S3.getObject method to get a read stream for the zip and pipe the stream to node-unzipper. For every file node-unzipper emits, upload it into the S3 bucket with the S3.upload method (which accepts a read-stream).

index.js

const AWS = require('aws-sdk')
const unzipper = require('unzipper')

const S3 = new AWS.S3()

exports.handler = async () => {
  const source = //source bucket
  const destination = //destination bucket

  let promises = []

  const readStream = S3.getObject({
    Key: zipFileName,
    Bucket: source
  }).createReadStream()

  await readStream
    .pipe(unzipper.Parse({ forceStream: true }))


  for await (const entry of zip) {
    const type = entry.type

    if (type === 'File') {
      promises.push(S3.upload({
        Bucket: destination,
        Key: entry.fileName,
        Body: entry,
      }).promise())
    } else {
      entry.autodrain()
    }
  }

  console.log(`Found ${promises.length} files`)

  await Promise.all(promises)
}

I like this approach. The good thing about streams is that you don't load the entire file into memory before unzipping it. It also means that the default 128M is enough for the lambda to run. So low memory footprint, no hacky usage of tmp/, and easy to read code.

Huge letdown though, the lambda always stopped running after a few files were extracted. I'm not entirely sure if this is a bug within lambda that somehow leads it to think that the entire event-loop is flushed. I'm leaning towards it being a case where it's impossible to extract the zip file without reading it in its entirety. Zip files aren't suited to be unzipped via streams because of the reasons outlined here.

The final version I ended up with uses a buffer to store the entire zip file in memory, then uses yauzl to get a read-stream of every extracted file. I did have to increase the memory of the lambda to 256MB though.

index.js

const yauzl = require('yauzl')
const mime = require('mime-types')
const { promisify } = require('util')

const fromBuffer = promisify(yauzl.fromBuffer)
const S3 = new AWS.S3()

exports.handler = async () => {
  const source = //source bucket
  const destination = //destination bucket

  const { Body: content } = await S3.getObject({
    Key: zipFileName,
    Bucket: source
  }).promise()

  const zip = await fromBuffer(content)

  return new Promise((resolve, reject) => {
    let promises = []

    zip
      .on('entry', async (entry) => {
        // Directory file names end with '/' so we skip those.
        if (!/\/$/.test(entry.fileName)) {
          const openReadStream = promisify(zip.openReadStream.bind(zip))

          const stream = await openReadStream(entry)
          promises.push(uploadFile(entry, stream))
        }
      })
      .on('end', async () => {
        console.log(`Found ${promises.length} files`)

        await Promise.all(promises)
        resolve()
      })
      .on('error', reject)
  })
}

Harris Jose

Software Engineer at Chronicle HQ.

Contact

linkedin.com/in/harrisjose

Twitter

@harrispjose

Github

github.com/harrisjose

Now

Whipping up a text editor and thinking about presentations on the web. Getting my hands dirty with Typescript and Rust.

Not Playing