This phase describes the cleaning of data records and their preparation for analysis. It is comprised of sub-processes that check, clean, and transform the collected data, and may be repeated several times.
The “Process” and “Analyze” phases can be iterative and parallel. Analysis can reveal a broader understanding of the data, which might make it apparent that additional processing is needed. Activities within the “Process” and “Analyze” phases may commence before the “Collect” phase is completed. This enables the compilation of provisional results where timeliness is an important concern for users, and increases the time available for analysis. Key difference between these phases is that “Process” concerns transformations of microdata, whereas “Analyze” concerns the further treatment of statistical aggregates.
This phase is comprised of eight sub-processes:
- 5.1.Integrate data - This sub-process integrates data from one or more sources. The input data can be from a mixture of external or internal data sources and a variety of collection modes, including extracts of administrative data, resulting in a harmonized data set. Data integration typically includes matching and record linkage routines to link data from different sources, where those data refer to the same unit; and prioritizing, when two or more sources (with potentially different values) contain data for the same variable. Data integration may take place at any point in this phase, before or after any sub-processes. There may also be several instances of data integration in any statistical business process. Following integration, and depending on data protection requirements, data may be anonymized, i.e., stripped of identifiers such as name and address, to help to protect confidentiality.
- 5.2.Classify and code - This sub-process classifies and codes the input data. For example, automatic or clerical coding routines may assign numeric codes to text responses according to a pre-determined classification scheme.
- 5.3.Review, validate and edit - This sub-process applies to collected microdata. It looks at each record to identify and, where necessary, correct potential problems, errors, and discrepancies such as outliers, item non-response, and miscoding. Also referred to as input data validation. it may be run iteratively, validating data against predefined edit rules, usually in a set order. It may apply automatic edits, or raise alerts for manual inspection and correction of the data. Reviewing, validating, and editing can apply to unit records both from surveys and administrative sources, before and after integration. In certain cases, imputation (sub-process 5.4) may be used as a form of editing..
- 5.4.Impute - Where data are missing or unreliable, estimates may be imputed, often using a rule-based approach. Specific steps typically include:
- identification of potential errors and gaps;
- selection of data to include or exclude from imputation routines;
- imputation using one or more pre-defined methods e.g., “hot-deck” or “cold-deck;”
- writing imputed data back to the data set and flagging them as imputed; and
- the production of metadata on the imputation process.
- 5.5.Derive new variables and statistical units - This sub-process derives values for variables and statistical units that are not explicitly provided in the collection, but are needed to deliver the required outputs. It derives new variables by applying arithmetic formulae to one or more of the variables already present in the dataset. This may need to be iterative, as some derived variables may themselves be based on other derived variables. It is therefore important to ensure that variables are derived in the correct order. New statistical units may be derived by aggregating or splitting data for collection units, or by various other estimation methods. Examples include deriving households where the collection units are persons, or enterprises where the collection units are legal units.
- 5.6.Calculate weights - This sub-process creates weights for unit data records according to the methodology in sub-process 2.5 (Design statistical processing methodology). These weights can be used to “gross-up” sample survey results to make them representative of the target population, or to adjust for non-response in total enumerations. For informationand materials on sample calibration, see our related project page. Diego Zardetto, Istat -ReGenesees (R Evolved Generalized Software for Sampling Estimates and Errors in Surveys) is a full-fledged R (open source) software developed and disseminated by the Italian Statistics Office. ReGenesees is a tool for design-based and model-assisted analysis of complex sample surveys. The package (and its graphical user interface package ReGenesees.GUI) run under Windows, Mac OS, Linux and most Unix-like operating systems.
- 5.7.Calculate aggregates - This sub-process creates aggregate data and population totals from microdata. It includes summing data for records-sharing certain characteristics, determining measures of average and dispersion, and applying weights from sub-process 5.6 to sample survey data to derive population totals.
- 5.8.Finalize data files – This sub-process compiles results of the other sub-processes in this phase to produce a data file, usually of macrodata, which is used as the input to phase 6 (Analyse). Sometimes this may be an intermediate file rather than a final, particularly for business processes where there are strong time pressures and a requirement to produce both preliminary and final estimates.
|Author(s)||UK Information Commissioner's Office|
|Description||The code explains the issues surrounding the anonymisation of personal data, and the disclosure of data once it has been anonymised. It explains the relevant legal concepts and tests in the UK Data Protection Act 1998 (DPA). The code provides good practice advice that will be relevant to all organisations that need to convert personal data into a form in which individuals are no longer identifiable.|
|Author(s)||IRIS Center at the University of Maryland (College Park)|
|Description||Written in partnership with the IRIS Center at the University of Maryland, this detailed report identifies, evaluates, and compares the functionalities of software packages for the development of CAPI applications suitable for implementing complex household surveys.|
|Author(s)||UK National Statistics|
|Description||This protocol sets out how all those involved in the production of National Statistics will meet their commitment to protect the confidentiality of data within their care whilst also, and where appropriate, maximising the value of those data through data matching. Data matching involves the full or partial integration of two or more datasets on the basis of information held in common. It enables data obtained from separate sources to be used more effectively, thereby enhancing the value of the original sources. Data matching can also reduce the potential burden on data providers by reducing the need for further data collection.However where data matching involves the integration of records for the same units (persons, households, companies, etc) it also raises important issues of privacy and subject consent. The holding of identifying information also raises issues about confidentiality and security. The protocol covers the safeguards that must be taken to ensure confidentiality is maintained during data matching and in the treatment of matched datasets and the removal of identifiers. It also specifies who is responsible for any dataset created, who may access it, and its longevity.|
|Author(s)||United Nations Statistical Commission and United Nations Economic Commission for Europe|
|Description||Statistical Data Editing: Impact on Data Quality, is the third in the series of Statistical Data Editing publications produced by the participants of the United Nations Economic Commission for Europe (UN/ECE) Work Sessions on Statistical Data Editing (SDE). While the first two volumes dealt with the topics of what is data editing and how is it performed, the principal focus of this volume is its impact on the quality of the outgoing estimates. The aim of this publication is to assist National Statistical Offices in assessing the impact of the data editing process on data quality, that is, in assessing how well the process is working.|
|Description||This manual is primarily a practical guide to survey planning, design and implementation. It covers many of the issues related to survey taking and many of the basic methods that can be usefully incorporated into the design and implementation of a survey.|
|Date||Originally published in October 2003|
|Author(s)||United States Census Bureau|
|Description||The Census and Survey Processing System (CSPro) is a public domain software package used by hundreds of organizations and tens of thousands of individuals for entering, editing, tabulating, and disseminating census and survey data. CSPro is user-friendly, yet powerful enough to handle the most complex applications. It can be used by a wide range of people, from non-technical staff assistants to senior demographers and programmers.|
|Author(s)||Diego Zardetto, Istat|
ReGenesees (R Evolved Generalized Software for Sampling Estimates and Errors in Surveys) is a full-fledged R (open source) software developed and disseminated by the Italian Statistics Office. ReGenesees is a tool for design-based and model-assisted analysis of complex sample surveys.
The package (and its graphical user interface package ReGenesees.GUI) run under Windows, Mac OS, Linux and most Unix-like operating systems.