Hashing is a technique that transforms data into a fixed-length value, called a hash or a digest, that uniquely represents the original data. Hashing can be used to validate the integrity of data communicated between production databases and a big data analytics system by comparing the hash values of the data before and after the communication. If the hash values match, the data has not been altered; if they differ, the data has been tampered with or corrupted. Hashing is a better security control than encrypting, running and comparing the count function, or hosting a digital certificate for this purpose because:
Encrypting in-scope data sets can protect the confidentiality of the data, but not necessarily the integrity. Encryption algorithms can be broken or bypassed by malicious actors, or encryption keys can be compromised or lost. Moreover, encryption adds overhead to the communication process and may affect the performance of the big data analytics system.
Running and comparing the count function within the in-scope data sets can only verify the number of records or elements in the data sets, but not the content or quality of the data. The count function cannot detect any changes or errors in the data values, such as missing, duplicated, corrupted, or manipulated data.
Hosting a digital certificate for in-scope data sets can provide authentication and non-repudiation for the data sources, but not integrity for the data itself. A digital certificate is a document that contains information about the identity and public key of an entity, such as a person, organization, or device. A digital certificate does not contain or verify the actual data that is communicated between production databases and a big data analytics system.
[References:, Ensuring Data Integrity with Hash Codes, Database Security: An Essential Guide, Control methods of Database Security, , , , ]
Submit