Author API Profiling & Optimisation

Table of Contents

1 Issue

GraphQL responses take a long time for larger surveys. Loading the current draft of the Low Carbon Survey takes anywhere in the range of 500ms to over 1s for each response to resolve:

network_preopt.png

2 First culprit - AJV

Disabling AJV entirely results in response times resolving in approx. 100ms:

network_noajv.png

2.1 Performance analysis of AJV’s validate call

2.1.1 Benchmarking of execution time as survey size increases

To determine the contribution of AJV itself to the problem, the validate function call was called with a varying size questionnaire based on the Low Carbon Survey draft.

The final section of the survey was duplicated to take the questionnaire to 25 sections, then the benchmark was ran and sections deleted to obtain the following time-by-section-count graph:

Benchmark pre-optimisation.png

As can be seen, AJV itself - even with our custom keywords - is only responsible for approximately 50ms of the response’s total execution time. The execution time of the validate call increases linearly with survey size.

2.1.2 Flame chart

Below is a flame chart showing where CPU time is spent during the validate(questionnaire) call in src/validation/index.js:

flame_chart.png

This shows that the majority of CPU time is spent in the cheerio dependency, followed by the getPreviousAnswersForPage utility function.

2.2 Optimising custom keywords

Disabling custom keywords entirely resulted in AJV validate times in the order of 1 - 2 ms for the 18 section draft LCS, showing that our custom keyword code was significantly slowing down execution.

Searching for getPreviousAnswersForPage in the custom keywords files showed that it was called by the validatePipingInTitle custom keyword. This file also happened to be the source of the calls to cheerio.

2.2.1 Optimising validatePipingInTitle

validatePipingInTitle used the cheerio library to parse HTML in the question title - and as can be seen by the flame graph this was a very resource-intensive process. The HTML was parsed in order to find the included piped answer IDs - if any.

      const $ = cheerio.load(entityData);
      const pipedIdList = [];

      $("p")
        .find("span")
        .each(function(index, element) {
          pipedIdList.push($(element).data());
        });

Replacing this approach with a regular expression maintained the original behaviour - piped answers are still detected to be deleted in titles - but avoided the need to use the cheerio library to parse HTML for every question in the questionnaire.

      //... in file scope
      const pipedAnswerIdRegex = /data-piped="answers" data-id="(.+?)"/gm;

      //... within custom validation function
      const pipedIdList = [];

      let matches;
      do {
        matches = pipedAnswerIdRegex.exec(entityData);
        if(matches && matches.length > 1) {
          pipedIdList.push(matches[1]);
        }
      } while (matches);

Making this change and re-running the benchmark showed that the execution time for the validate call (for the 18-section LCS survey) fell to 8.65ms, a decrease of approximately 80%.

3 Second culprit - slow code in src/validation/index.js

With the above changes made and AJV performance improved, the following is a flame chart of the API server while running under load (repeatedly requesting questionnaire & pages from the UI):

flame_server_pre.png

As seen here, our validation code in src/validation/index.js is responsible for a huge amount of the total CPU time spent on generating responses.

The key culprit seems to be the section involving calling the lodash uniqWith function.

3.1 Optimising the validation code

src/validation/index.js calls AJV via validate(questionnaire) and then proceeeds to generate a list of error messages using createValidationError.

The part which was the performance bottleneck was that which checked for duplicated error messages, in order to remove them:

  const uniqeErrorMessages = uniqWith(formattedErrorMessages, (one, two) => {
    const oneClone = Object.assign({}, one);
    const twoClone = Object.assign({}, two);

    delete oneClone.id;
    delete twoClone.id;

    const isDuplicate = isEqual(oneClone, twoClone);

    return isDuplicate;
  });

The function already iterates over the entire array multiple times, but the crunch point for performance (especially looking at the V8 profiling logs) seems to be the repeated use of Object.assign, causing repeated memory allocation events.

This function was removed and checking for uniqueness implemented in the first iteration over the error messages using a javascript object as a hashmap. Error objects are stringified using the minimal information required by the downstream code (dataPath and message) to produce their hash.

Checking for duplicates in this way can therefore be done in constant time (using hashmap lookup). The entire function was also refactored to iterate only once over the array of errors rather than iterating several times throughout the process.

The new approach is as follows:

  const uniqueErrorMessages = {};
  const formattedErrorMessages = [];

  for (const err of validate.errors) {
    if (err.keyword === "errorMessage") {
      const key = `${err.dataPath} ${err.message}`;

      if (uniqueErrorMessages[key]) {
        continue;
      }
      uniqueErrorMessages[key] = err;

      formattedErrorMessages.push(formatErrorMessage(err, questionnaire));
    }
  }

4 Results

4.1 Effect on response times

With these two changes, the overall response times are down to a level similar to that expected with validation disabled in the server code:

network_postopt3.png

This represents a reduction in total response times of up to 80-90% compared to the pre-optimised codebase.

Author: Thomas Baggaley

Created: 2020-12-11 Fri 14:28