a technique for duplicate detection with overlapping imports of key-less records
My financial affairs are simple enough that I can keep track of them by exporting the transaction data from my bank accounts and annotating the exports in a Google sheet. The annotations allow me to classify each income or expenditure item with a category that allows me to generate reports that aggregate the data across each category for reporting and analysis purposes.
One of the tedious aspects of this process is the need to export the transaction data from the bank. I will eventually write a Chrome extension to help me automate this, but for now on a weekly basis I manually export the transaction data from each account as a CSV file and then run an import process over the resulting files.
One challenge with this is that so that I don't miss any data I need to be able to overlap the periodic exports. The overlap means that the export files from consecutive exports may contain duplicate records.
Obviously, from an accounting point of view, it is important to eliminate the duplicates from the import and, from a tedium mimimisastion perpective, to do this automatically.
This would be a trivial problem if the source transaction data had a natural or surrogate key for each record but unfortunately it does not. In fact, each source record has the following fields:
- account
- transaction date
- narrative
- credit amount
- debit amount
No combination of these fields is guaranteed to produce a unique key for a record - it is possible that two records may specify the same account, transaction date, narrative, credit and debit amounts. And, in fact, I have observed that the bank occasionally alters the narrative for a given transaction record between subsequent exports so the narrative cannot be used as part of a record key since to do so would risk creating two imported records, with different keys, for the same transaction.
The solution I chose for this problem is to extend the imported record with two additional fields: repetition and md5 sum. So:
- account
- transaction date
- narrative
- credit amount
- debit amount
- repetition
- md5 sum
Repetition is initially blank. md5 sum is calculated over the account, transaction date, credit and debit amount and repetition fields. If two records in the same import duplicate the same fields, then the md5 for those two records will be identical. In this case, the repetition for the current record is set to 1 and the md5 sum for that record is recalculated. The process is repeated, incrementing repetition on each occasion, until the record obtains an md5 sum that is unique within the same import.
The md5 sum so calculated is then effectively a surrogate key for the record which can be assumed to identify the same transaction across all previous (and future) imports under the additional assumptions that all imports always contain a full days worth of data, the bank never changes the date of a transaction and never withdraws (as opposed to reverses) a transaction previously published. Should those assumptions be violated, a manual intervention would still be required to delete downstream references to the altered or deleted records.
With a surrogate key for each record now established, I can then merge the newly imported data with previously imported data by matching the new and existing records on the basis of their surrogate keys.
The key assignment process is both stateless and robust - I don't need access to all the data I previously imported in order to generate unique keys for each record and it doesn't matter if different imports at different times contain overlapping data -in both cases, the same keys will be generated for the same set of same transactions.*
* It is technically true that if the order of transactions with identical dates and amounts is transposed between subsequent imports then the narratives for two records may be swapped on subsequent imports. This potentially becomes problematic if downstream processing depends on automated interpretation of the narrative. It might be necessary, in this case, to account for narrative changes during the merge process, if required.
To support this pattern, I have written some tools in golang that can generate additional fields to ensure that each record of a CSV input stream contains a unique natural key and another tool that can then generate a surrogate key from the MD5 sum of the unique natural key.
So, for example:
uniquify \
--partial-key BankAccount,Date,CreditAmount,DebitAmount \
--additonal-key Repetition |
surrogate-keys \
--natural-key \
BankAccount,Date,CreditAmount,DebitAmount,Repetition \
--surrogate-key KeyMD5
For more information, see: https://godoc.org/github.com/wildducktheories/go-csv/uniquify, https://godoc.org/github.com/wildducktheories/go-csv/surrogate-keys and https://github.com/wildducktheories/go-csv .