Assessing the information loss
Most anonymization techniques consist of reducing the level of detail in the information provided, or in suppressing information. They therefore typically result in loss of information. The challenge for the statistician is to strike a balance between the conflicting objectives of reducing the disclosure risk and minimizing this loss.
Various methods are available to assess information loss. For categorical data, these methods include direct comparison, comparison of contingency tables, and entropy-based measures. For continuous data, methods include comparisons of mean square, mean absolute, and mean variation.
Information on these techniques is available from the following resources.
|Author(s)||A. F. Karr, C. N. Kohnen, A. Oganan, J. P. Reiter, and A. P. Samil|
|Description||When releasing data to the public, statistical agencies and survey organizations typically alter data values in order to protect the conﬁdentiality of survey respondents’ identities and attribute values. To select among the wide variety of data alteration methods, agencies require tools for evaluating the utility of proposed data releases. Such utility measures can be combined with disclosure risk measures to gauge risk-utility tradeoffs of competing methods. This article presents utility measures focused on differences in inferences obtained from the altered data and corresponding inferences obtained from the original data. Using both genuine and simulated data, we show how the measures can be used in a decision-theoretic formulation for evaluating disclosure limitation procedures.|
|Author(s)||Josep Domingo-Ferrer, Josep M. Mateo-Sanz and Vicenç Torra|
|Description||We present in this paper the first empirical comparison of SDC methods for microdata which encompasses both continuous and categorical microdata. Based on re-identification experiments, we try to optimize the tradeoff between information loss and disclosure risk. First, relevant SDC methods for continuous and categorical microdata are identified. Then generic information loss measures (not targeted to specific data uses) are defined, both in the continuous and the categorical case. Disclosure risk is assessed using empirical re-identification. Two approaches to empirical re-identification are used: Euclidean record linkage and probabilistic record linkage. The results of this comparison will be used to come up with better SDC for microdata in the recently started EU-funded project CASC.|
|Author(s)||Josep Domingo-Ferrer and Vicenç Torra|
|Author(s)||Shanti Gomatam and Alan F. Karr|
|Author(s)||Josep Domingo-Ferrer and David Rebollo-Monedero|
|Description||Before releasing anonymized microdata (individual data) it is essential to evaluate whether: i) their utility is high enough for their release to make sense; ii) the risk that the anonymized data result in disclosure of respondent identity or respondent attribute values is low enough. Utility and disclosure risk measures are used for the above evaluation, which normally lack a common theoretical framework allowing to trade oﬀ utility and risk in a consistent way. We explore in this paper the use of information-theoretic measures based on the notion of mutual information.|
|Author(s)||William E. Winkler|
|Description||A public-use microdata file should be analytically valid. For a very small number of uses, the microdata should yield analytic results that are approximately the same as the original, confidential file that is not distributed. If the microdata file contains a moderate number of variables and is required to meet a single set of analytic needs of, say, university researchers, then many more records are likely to be re-identified via modern record linkage methods than via the re-identification methods typically used in the confidentiality literature. This paper compares several masking methods in terms of their ability to produce analytically valid, confidential microdata.|
|Author(s)||Matthias Schmid and Hans Schneeweiss|
|Description||Microaggregation is a set of procedures that distort empirical data in order to guarantee the factual anonymity of the data. At the same time the information content of data sets should not be reduced too much and should still be useful for scientific research. This paper investigates the effect of microaggregation on the estimation of a linear regression by ordinary least squares. It studies, by way of an extensive simulation experiment, the bias of the slope parameter estimator induced by various microaggregation techniques. Some microaggregation procedures lead to consistent estimates while others imply an asymptotic bias for the estimator.|