Achieving k-anonymity of a large-scale database in a distributed memory environment

The k-anonymity problem introduced by Samarati and Sweeney in 1998, guarantees that it is impossible to distinguish user data from at least (k − 1) others in the same database. The methods used to achieve k-anonymity result in an information loss as the data in the database is modified, making it less accurate through a process of generalization or micro-aggregation of the stored data. Mauger et al. proposed a O(n²)-time sequential algorithm that gives good results while minimizing the information loss using their designed metrics. However, their solution is very time-consuming and therefore not suitable for large-scale databases. In this paper, we tackle this problem using parallelism. We propose three coarse-grained parallel algorithms to solve the k-anonymity problem. The first is the straightforward algorithm that runs in O(n ²/p) execution time with O(n) communication rounds, where n is the number of lines in the database and p is the number of processors. The second runs in O(n² /p²) execution time with O²(p) communication rounds. The third runs in O(n² /plog2(p)) execution time with O(np/ (log2(p))²) communications rounds. For the latter two algorithms, we introduce the concept of data reorganization to minimize the information loss when data are partitioned. Experimental results show that for a database of size n = 10^6 , p = 2^7 , and k = 10^2 , second, and third parallel algorithms are respectively 1127.59× and 41.13× faster than the sequential algorithm while achieving anonymity with 4.03% and 2.62% information loss.

13 Jan 2024

25 Jan 2024