Solving Subset Sum Problems using Binary Optimization with Applications in Auditing and Financial Data Analysis

Many applications in automated auditing and the analysis and consistency check of financial documents can be formulated in part as the subset sum problem: Given a set of numbers and a target sum, find the subset of numbers that sums up to the target. The problem is NP-hard and classical solving algorithms are therefore not practical to use in real applications. We tackle the problem as a QUBO (quadratic unconstrained binary optimization) problem and show how gradient descent on Hopfield Networks reliably finds solutions for both artificial and real data. We give an outlook for the application of specialized hardware and quantum algorithms.


I. INTRODUCTION AND PROBLEM STATEMENT
The financial auditing process involves the writing and proofreading of financial reports for the audited company. This process is still largely a manual one: auditors must read documents, compare document content with previous reports, check the completeness of the content to financial regulation checklists and check against both compliance and mathematical errors.
One aspect of mathematical correctness is the correctness of numerical tables, e.g. describing profit and loss for a given year or quarter. All values in these tables must of course correspond to the actual financial situation of the company, which includes the correctness of sums in the tables. For example, for a table depicting the revenue, expenses and income, the values for expenses and income must sum up to the revenue.
During the manual auditing process, one would apply knowledge about financial reports to evaluate which values correspond to which sums for each table and recheck the correctness of the calculations. This is of course highly time and labour intensive, and mistakes are easy to make when auditing a large number of tables.
Automating this process however proves difficult. While a machine has no problem evaluating the correctness of calculations for given table entries, evaluating which values must sum up to which other values is a complex task. Human auditors are either able to apply knowledge on financial reports or knowledge on structure of tables in general: While tables are formatted in a way that human readers can easily evaluate how sums are structured, e.g. by sum values being at the bottom of the tables, indicted by text in the row headers, split from other values by bold lines, teaching a machine to understand these intuitive rules is almost impossible. A strict rule-based approach would be highly dependent on the formatting of specific tables and not generalize well.
A different approach to this problem is therefore ignoring the table structure altogether. Treating single columns or the entire table as a (ordered) set of numbers, one can try to find the values which can be described as a sum of a subset of all other numbers. This problem can of course be solved exactly by deterministic algorithms. However, the size of tables and magnitude of their entries (for financial documents) make many algorithmic approaches impractical.
In this preprint, we evaluate how stochastic algorithms based on gradient descent on so-called Hopfield networks can solve the problem of finding sums in large (both in size and magnitude) tables: • We first restate the problem of finding sums in tables as the subset sum problem and briefly discuss known deterministic solving algorithms. • We then derive the general algorithm for gradient descent on Hopfield networks and how to restate the subset sum problem as a problem solvable by these networks (QUBOs). • We evaluate our Hopfield algorithm on both artificial and real data and discuss applications and future work.

A. Subset Sum Formulation of the problem
We call a set of rules that describe the behaviour of sums in a document, e.g. rows 1-4 sum up to row 5, rows 5 and 6 sum up to row 8, as a sum structure. See Figure 1 for an example.
Many tables found in financial reports show the same sum structure in multiple columns, e.g. when comparing financial statements for several quarters and years. Having an efficient algorithm for discovering sums in columns could aid consistency checks by applying the algorithm on one column, extracting a sum structure and checking if the other columns also comply to the found sum structure. If the new column does not comply to the same sum structure, it is an indication for some inconsistency happening in the table.
Finding sum structures in tables is closely related to a well known problem in algorithmic combinatorics, the subset sum problem. The subset sum problem is defined by a set of numbers X = x 1 , x 2 , . . . , x n ⊂ N and a target sum T ∈ N. We aim to find a subset Y ⊆ X , such that the sum of the subset is equal to the target sum: In general, due to the fact that there are 2 n possible combinations of numbers for the subset, the problem is NPhard.
In the framework of consistency checks and finding sums in tables, we can consider the entire table as a set of numbers and apply the problem to each entry: Taken the entry as a target sum, is it possible to find a subset of all other numbers that sums up to the target? Iterating a solving algorithm over each entry in the table yields a sum structure for the table.

B. Classical solving algorithms and algorithms for approximate solutions
There are several known algorithms for solving the subset sum problem.
The naive approach consists of cycling through each of the 2 n possible subsets, summing up all elements and comparing the sum to the target sum. This has a total complexity of O(2 n n). The algorithm can be improved by several heuristics (i.e. sorting the numbers and stopping iteration when the target sum is surpassed by the subset) but the exponential complexity remains.
Additionally there exist dynamic programming algorithms for solving the subset sum problem exactly in pseudo-polynomial time. That is O(n 2 C), where C = B − A for A, B being the lower and upper bounds of the set of numbers S.

C. Rule-based algorithms for finding sums in financial tables
The problem of finding sum structures in tables does not have to be broken down to the subset sum problem. By ignoring the inherent structure and logic of the table, the complexity of the binary combination problem is increased. Applying rule-based logic and understanding of the general structure of tables can result in efficient algorithms to solve the problem of finding sum structures in tables.
These rules-based approaches can apply multiple heuristics to find sums: sums are generally more likely to be structured top-to-bottom, sums are likely to occur in adjacent rows to the corresponding subset, the last entries in columns are likely to be sums.
However, rule-based approaches require specialization to each type of table and are hard to generalize.

II. SUBSET-SUM AS QUBO
A quadratic unconstrained binary optimization problem (QUBO) is defined by a function f : {0, 1} n − → R which is a quadratic polynomial over its binary input variables, The QUBO problem consists of finding the optimal binary vector z * ∈ {0, 1} n such that The problem can be rewritten in matrix notation as with a symmetric and hollow matrix P ∈ R n×n and a vector p ∈ R n .
To convert the subset sum problem into a QUBO, we recall the problem statement. Given set X = x 1 , x 2 , . . . , x n and target value T , determine a subset Y * ⊆ X such that The subset sum problem can therefore be stated as finding Y * such that Collecting the numbers contained in set X in a vector x = x 1 , x 2 , . . . , x n ⊺ ∈ R n and introducing a binary indicator vector z ∈ {0, 1} n with entries the subset sum problem can alternatively be written as Expanding the equation we write where we introduced the shorthands Closely related to QUBOs are Ising Models, where we optimize over s ∈ {−1, 1} instead of z ∈ {0, 1}: Both problem statements are in fact equivalent, with conversion via z = 1 2 (s + 1) and s = 2z − 1. Converting the QUBO derived from the subset sum problem above, we have where we introduced the shorthands All in all, we can thus consider the subset sum problem as a minimzation problem over {−1, 1} n by with Q and q defined in (6) and P and p defined in (7).

A. QUBO-Solving with Hopfield Networks
A Hopfield Network is a recurrent neural net of n interconnected neurons. The state of the network is described by a bipolar vector s ∈ {−1, 1} n . Each neuron is connected to every other neuron, with connection weights given by a matrix W ∈ R n×n . Each neuron s i is a bipolar threshold unit with threshold θ i , such that A Hopfield network architecture is therefore fully described by a matrix W , a vector θ and a current state s. An update of the network is done via (9), either for all neurons at once or only a subset of neurons.
We define the energy of the Hopfield network by We find that if the weight matrix W is symmetric and is hollow (i.e. has diagonal of all zeros), then the Hopfield energy can never increase when updating one neuron by (9). Since the updates in (9) amount to s i = sign(−∇E(s) and each update performs gradient descent on E(s). Since there are only 2 n possible states of the network, successive updates of single neurons will reach a local or global minimum after finitely many updates. This behaviour can be leveraged to solve problems stated as QUBOs. Encoding the problem in weight and bias parameters W and θ, such that minimum energy states solve the underlying QUBO problem, the network may find solutions to the QUBO by the described gradient descent updates of single neurons. To apply Hopfield networks to the subset sum problem, recall the problem statement as minimization problem over s ∈ {−1, 1} n in (8). Defining we find suitable weights and biases such that a the state of a Hopfield network optimized to a global minimum encodes a solution to the subset sum problem. Note that a Hopfield network with a random initialization does not necessarily converge to a global optimum and local optima are not solutions to the subset sum problem. However, initializing the network multiple times with random states and running until convergence increases the chances of finding a global optimum.
See Algorithm 1 for a description of the full solving algorithm.
Parallel optimization of multiple independent Hopfield networks can efficiently be done on GPUs. Given a vector of numbers x ∈ N n and target sum T ∈ N, construct Hopfield network weights W and biases θ by

A. Data
We conduct experiments with both artificial data and real data.
To create artificial data we uniformly sample n integers between X min and X max , select k of the sampled integers at random and calculate the target sum T as the sum of the selected integers.
Note that the selection of n, X min and X max has a defines the number of possible solutions to the problem and therefore influences the difficulty of finding a solution. Given a set X of n integers between X min < 0 and X max > 0, the sum of any subset of X must be in the interval  for a total of #T = n (X max − X min ). However, there are 2 n possible subsets of X . For many combinations of n, X min and X max we have 2 n >> n (X max − X min ) and therefore some target values must have multiple solutions. Finding solutions for a problem with many distinct solutions is of course easier than finding one correct solution in 2 n possible combinations. We construct artificial data in the configurations described in Table I. For each configuration, we sample M = 5 different subset sum problems.
We evaluate our algorithm on a set of real data problems. We parse a financial report 1 containing multiple sheets with financial reports for the quarters from Q1 2019 to Q4 2020 for a total of 190 independent columns. Each column contains numbers describing amounts up to multiple billion C, exact to one cent, for a total of 14 significant figures. See Table II for details on the dataset.

B. Experiments and Results
We run the Hopfield algorithm on artificial data for up to 1e+8 initializations in batches of 1e+4 on one NVIDIA A100   ), a column length of n + 1 which corresponds to n values in the subset sum problem, and values between X min and X max . Again, R describes the expected number solutions when sampling a set of numbers and a random target solution defined by n, X min and X max . We see that each R << 1 and in most columns there is only one unique solution to the subset sum problem.
GPU, for a maximum computation time of around 8 minutes.
For almost all configurations the algorithm reliably finds a correct solution for all samples. Only for the configuration of n = 256 and X max = 1e+6 not all samples are solved in the specified maximum number of runs (2 of 5 found). See Figure  2 (right) for comparison of computation time against n and X max . We see that the number of values n has a smaller impact on the computation time until solution than the magnitude of the numbers X max . We run the Hopfield algorithm with the same configuration of the financial data. The algorithm finds a correct solution to the problem under the maximum number of iterations in all cases. Note that unlike for most of the artificial data, the combination of n and X max lead to a situation where each problem likely only contains one correct solution, which is found by our algorithm. See Figure 2 (left) for a comparison of computation time for different tables. We see that average computation time clearly increases with the size of the problem, i.e. amount of values in the table.
See Tables III and III for statistics on the runs. In total, we conclude that optimization of binary vectors with Hopfield networks is a reliable algorithm for solving subset sum problems and can be applied to real-world examples.

IV. CONCLUSION AND OUTLOOK
In this work we investigated how the subset sum problem plays a vital part in the automation of the auditing process, how the subset sum problem can be restated as a well known problem architecture which can be solved by the application of Hopfield networks. We found that the proposed algorithm reliably finds correct sum structures for artificial and real data.
Future work will include an investigation of the application of special purpose hardware (FPGAs) for QUBO-solving and implications of this algorithm architecture for quantum computing applications.
In the near future, the algorithm will be ready to deploy on existing smart auditing software to directly benefit auditors in their daily work.