Race conditions in Author API

1. The problem
- 1.1. Local reproduction
2. Solutions
- 2.1. Transactions
  - 2.1.1. MongoDB
- 2.2. Optimistic locking
3. Application to Author
4. Further steps

1 The problem

Since Author stores questionnaires as standalone JSON documents (and MongoDB reads and writes these atomically) concurrent updates to the same questionnaire have a risk of being lost.

Figure 1: The current state of affairs. MongoDB smiles on as writes disappear into the æther.

In Figure 1, two clients concurrently (i.e. unaware of the other’s actions) make two seemingly unrelated changes to the same questionnaire. Client 1 wants to change the questionnaire’s title; Client 2 wants to add a new question page.

Their requests are routed to two separate API containers - but this could also occur with two concurrent requests to the same container since NodeJS will start executing the second request while the first is blocked, waiting for IO. If you spam graphql calls, especially on large questionnaires, you can see this issue in action despite having very low network latency and a single server instance.

1.1 Local reproduction

Here’s a quick and dirty way to show it in action. withDuplicatePage.js:26 was modified so that it prints out the current section’s page count when a page is duplicated:

    return mutate({
      variables: { input },
    })
      .then(get("data.duplicatePage"))
      .then((data) => {
        console.log("Pages in section: ", countPagesInSection(data.section));
        return data;
      })
      .then(tap(redirectToNewPage(ownProps)));

I then hit the “Duplicate” button like a madman. Here’s the result:

Figure 2: Le console after rapid duplication

We expect a monotonically increasing amount of pages - but as can be seen, sometimes the page count is unchanged and sometimes even goes down as later responses come through to the client. Only a few of the actual duplicate requests have made their mark on the questionnaire itself, having been overwritten by other concurrent attempts at duplication.

2 Solutions

The general problem is this: we read data and may later change it - but when we do, we do not know if the original data has been changed. If it has, the foreign changes will be completely lost.

To solve this, we have to rely on some flavour of locking ¹.

2.1 Transactions

In traditional relational databases, locking is done for you by the database engine via transactions - either explicitly through the use of SELECT ... FOR UPDATE² or through cranking up the isolation level³ and leaving it to the database vendors to figure out.

2.1.1 MongoDB

MongoDB does have some support for transactions ⁴ although the focus is on their use when writes span multiple documents. Starting a transaction requires initiating a client session and providing a async function which serves as the transaction body; exceptions may then be thrown and caught if the transaction fails.

The property of MongoDB’s transactions relevant for this use case is that when another write occurs to an object read as part of a transaction, the transaction will be made to fail and a writeConflict exception thrown - see e.g. here for more details.

This could then be caught by the API server and the request re-processed as necessary.

2.2 Optimistic locking

An alternative approach to using transactions is to manually implement a form of optimistic locking⁵ by modifying the query of MongoDB’s update function (which would be called in e.g. saveQuestionnaire in Author). Instead of querying just for the questionnaire ID, we add a new unique ID which changes between saves - either a UUID or a monotonically increasing integer.

This stack overflow answer details the technique. If the “losing” container / nodeJS task tries to update a questionnaire which has changed, this new unique ID will have been changed as part of the foreign write, the update query will no longer match any documents and nothing will be written.

The application can then query to see if the write was successful - if not, it knows it needs to retry.

Such an approach is mentioned briefly on the MongoDB docs concerning concurrency control:

Another approach is to specify the expected current value of a field in the query predicate for the write operations. – Atomicity and Transactions — MongoDB Manual

3 Application to Author

Any solution applied to Author will need to take into consideration current patterns within Author API - or at least weigh up the developer time and cost of refactoring existing code (unless beneficial for other features down the line) versus getting something which works and requires minimal other changes.

To help decide, first we have a quick recap of Author’s current data flow and then consider the changes which would be required to implement transactions or “manual” optimistic locking.

3.1 Current API data architecture

The following is a rough sketch of how things currently work (generally) in Author API.

Figure 3: Current data architecture for Author API, as regards modifying questionnaires

The flow is as follows:

The loadQuestionnaire middleware uses the questionnaireId header passed in with each request to identify the active questionnaire
loadQuestionnaire fetches the questionnaire from MongoDB and stores it in the ctx accessible to all GraphQL resolvers
GraphQL queries are parsed by Apollo server and trigger execution of all relevant resolvers. Resolvers pull data out of ctx.questionnaire to find what they need
Mutations mutate ctx.questionnaire in place and use a helper function, createMutation, which takes care of the business of saving the questionnaire
createMutation saves the questionnaire to MongoDB by updating the document with ID ctx.questionnaire.id and replacing it with the contents of ctx.questionnaire
A response is sent to the client by Apollo server based on the resolvers’ return values

3.2 Option 1: Adding transactions

Using MongoDB transactions would require both reading and writing the questionnaire as part of one transaction.

Since at present these two stages are decoupled - the reading happens in the express middleware function loadQuestionnaire while the writing happens in the Apollo server then mutation-triggered createMutation function - this represent a change compared to the current approach.

3.2.1 Transactions inside mutation resolvers

Mutations would have to start a transaction, read the questionnaire, modify the questionnaire and then commit the transaction. Exceptions thrown by the transaction could be caught and the mutation re-attempted, e.g. by simply re-running the entire function body. There would be no more mutation of ctx.questionnaire since the data fetching would have to happen as part of the mutation’s lifetime.

3.2.2 Transactions as part of `createMutation` wrapper

Transactions could potentially be incorporated into createMutation by having createMutation load the questionnaire, pop it into ctx, call the custom mutation code, save the questionnaire and then end the transaction. Potentially errors could be handled inside createMutation in a similar way too - the entire thing could be repeated if the transaction fails to commit. Custom callbacks passed to createMutation would need to be idempotent⁶ given any particular ctx.questionnaire, which is likely (?) already the case.

As part of the spike, I had a quick go at enabling transactions by making createMutation use a new wrapper function wrapping questionnaire access in a transaction:

const withQuestionnaireTransaction = async (questionnaireId, fn) => {
  const session = client.startSession();
  try {
    return await session.withTransaction(async () => {
      const questionnaire = await getQuestionnaire(questionnaireId, session);
      await fn(questionnaire, session);
    }, transactionOptions);
  } catch (e) {
    console.log("TRANSACTION FAILED");
  } finally {
    await session.endSession();
  }
};

The wrapper provides a fresh copy of the questionnaire and a session variable to pass into calls you wish to be part of the transaction.

This quickly ran into the error reported elsewhere⁷ - ERROR (MongoError): Unable to get latest version of questionnaire with ID: 80bac2f3-56aa-4c5d-9033-be65ff1b65e3 which occurs when MongoDB isn’t running as a replica set. Our development MongoDB instance is run as a standalone server, ruling this out without switching to a replica set instead. This would take a bit more investigation to see how long it’d take to get working - there are some resources⁸ around which might simplify it.

Testing on our staging AWS Documentdb, however, was a resounding success! With an ssh tunnel allowing access to DocumentDB, I wired up my local dev server to use the DocumentDB server as its database. Spamming duplicate like a madman now results in a pleasingly monotonically increasing amount of pages in the section!

I used createMutation to handle the retry logic when a transaction failed and used the above withQuestionnaireTransaction code to perform the actual transaction passed in via fn, which has the advantage of allowing most of our mutation code to remain unchanged. Note that saveQuestionnaire and getQuestionnaire were modified to accept an optional session parameter for passing to the internal MongoDB calls (for association of the calls with the transaction).

  const questionnaireId = ctx.questionnaire.id;

  let result, transactionSuccessful;
  for (
    let retryCount = 0;
    retryCount < MAX_RETRIES && !transactionSuccessful;
    retryCount++
  ) {
    transactionSuccessful = await withQuestionnaireTransaction(
      questionnaireId,
      async (questionnaire, session) => {
        ctx.questionnaire = questionnaire;
        result = await mutation(root, args, ctx);

        if (ctx.questionnaire.publishStatus === PUBLISHED) {
          ctx.questionnaire.publishStatus = UNPUBLISHED;
          ctx.questionnaire.surveyVersion++;
          hasBeenUnpublished = true;
          await createHistoryEvent(
            ctx.questionnaire.id,
            publishStatusEvent(ctx)
          );
        }

        await saveQuestionnaire(ctx.questionnaire, session);
      }
    );
  }

  if (!transactionSuccessful) {
    throw new Error(
      `Failed to commit transaction after ${MAX_RETRIES} retries.`
    );
  }

createMutation therefore attempts to apply the mutation at most MAX_RETRIES times before giving up with an exception.

Note that we have to re-fetch the questionnaire as part of createMutation in order for MongoDB to associate that read with the on-going transaction. This is inefficient since we already read it in the middleware (via loadQuestionnaire) - this could be remedied by e.g. lazily fetching the ctx.questionnaire as part of loadQuestionnaire instead of always fetching it, since mutations using this new approach will not utilise that read (but resolvers still would).

The only downside to this at the moment is that it does not work using a standalone MongoDB instance, as used in the local dev environment. Work would need to be done to enable the local service to run as a replica set.

3.3 Option 2: Using DIY optimistic locking

DIY optimistic locking - implementing compare and set⁹ by taking advantage of MongoDB’s ability to query and update documents in one step (and querying on an ID which changes with every write - hereafter the CAS ID) - is another approach which sidesteps the need to use the transactions API, provided we only need to care about one document at a time.

In Author, however, we have to deal with two objects on each save - those taken from the questionnaires and those from the versions collection.

The questionnaires collection provides metadata about the questionnaire; each object within is a slimmed-down version of the current state of the questionnaire (which is stored as the latest object in versions with the same ID).

At present, when any change is made to the questionnaire, a new version object is inserted into the versions collection corresponding to the change. Since this collection always gets appended to rather than having its documents modified we would have to use a “compare and set ID” in a different place - such as on the questionnaires metadata object for the questionnaire.

This poses a problem because without multi-document transactions (which we were hoping to sidestep if we went down this route), it is possible that the CAS ID could be updated on the questionnaires object and a new change concurrently inserted into the versions collection by a foreign write. The technique only works if you can atomically update the document AND check the CAS ID remains unchanged, which we cannot do between two documents without using multi-document transactions.

We could keep track of which questionnaires are active by using only the versions collection and doing away with questionnaires, ensuring a single document atomic commit. But this is still probably less preferable than using native MongoDB transactions.

4 Further steps

During this spike, transactions were successfully implemented - albeit when using a remote DocumentDB cluster. Using transactions solved the initial problem of lost writes as demonstrated using the page duplication example, hopefully serving as a generic solution (relevant to e.g. the “duplicate page bug” ticket and “answer labels disappearing” ticket¹⁰ in the backlog) for race conditions in Author’s DB usage.

DIY optimistic locking was considered but would require changes to how we use our collections so that all questionnaire data is saved atomically - not necessary if transactions can be made to work.

Once working on the dev environment further work would be needed to add relevant tests and ensure all existing mutations continue to work. Some mutations, e.g. lockQuestionnaire would need to be updated manually to use the new withQuestionnaireTransaction interface as they intentionally bypass createMutation.

Footnotes:

https://en.wikipedia.org/wiki/Record_locking

Locking reads in MySQL - https://dev.mysql.com/doc/refman/8.0/en/innodb-locking-reads.html

Transaction isolation levels in PostgreSQL - https://www.postgresql.org/docs/9.5/transaction-iso.html

⁴

MongoDB documentation: Transactions - https://docs.mongodb.com/manual/core/transactions/

⁵

Optimistic locking involves writing changes “optimistically” - i.e. assuming that they will succeed (“pessimistic” locking would be to take an exclusive lock over the entire row / object in advance) and then checking to see if any conflicts have ocurred after the fact and rolling back. MongoDB’s transactions themselves use an optimistic model where writes proceed but are interrupted and reset if it detects another concurrent (foreign) write. See e.g. https://en.wikipedia.org/wiki/Optimistic_concurrency_control

⁶

https://en.wikipedia.org/wiki/Idempotence#Computer_science_meaning

⁷

MongoDB needs to be running as a replica set in order to allow transactions - https://stackoverflow.com/questions/62343611/enabling-mongodb-transactions-without-replica-sets-or-with-least-possible-config

⁸

Zero config MongoDB replica set runner - http://thecodebarbarian.com/introducing-run-rs-zero-config-mongodb-runner

⁹

Compare and set - https://en.wikipedia.org/wiki/Compare-and-swap

¹⁰

Bug ticket - answer labels disappearing: https://collaborate2.ons.gov.uk/jira/browse/EAR-556