Data Anonymization. What is it? How and when is it required?
- Darina Dragouleva

- 2 days ago
- 5 min read

Data is constantly being collected, stored and analyzed by researchers - especially in the healthcare and social sciences disciplines. It often includes highly sensitive and personal information which needs to be protected. At the same time, sharing and collaboration of data is essential for structured research, innovation, and breakthroughs. Data anonymization emerges as a balance between maintaining the utility of data for analysis while also protecting individuals’ privacy rights.
What is data anonymization?
Data anonymization is the process of removing any personally identifiable information from a dataset. After this process, the data should be fully non-identifiable and cannot be restored. Anonymization has important implications both for the individuals and the researchers. It protects the participants from unauthorized identification and misuse of their sensitive information. At the same time, for researchers, it facilitates ethical data sharing, compliance with privacy regulations and safer collaboration across institutions.
For example, even if something as simple as age and gender remains and the participant is in a remote region, other aspects of the record may readily identify the individual because of the small population size. Imagine a dataset that includes; “male, age 32, diagnosed with Huntington’s disease". There may be only one man in that town with this rare diagnosis making it easy for someone to deduce his identity. This highlights the importance of removing all identifiers.
How is data anonymized?

There is no single method to data anonymization. This process often combines multiple techniques depending on the data’s sensitivity, size and intended use. Common methods include:
Suppression - the removal of any identifying variables such as the participant’s name, address or phone number.
Generalization - making data less specific by using broader categories such as income ranges instead of exact salary.
Masking - hiding data with random characters such as “xxxx@gmail.com”.
Aggregation - combining data into groups such as “income range” instead of “exact income”.
Data Perturbation - adds small and random changes to mask the original data points without affecting the overall trends.
When is Data Anonymization required?
Anonymization is used when the data must not maintain a connection to identifiable individuals. Common scenarios include:
Legal and Regulatory Requirements - protecting an individual's privacy rights is no longer just an ethical responsibility but an actual legal requirement for organizations. Data anonymization is encouraged to reduce legal obligations and risks associated with processing personal data.
Examples of these regulations include:
California Consumer Privacy Act - allows companies to use de-identified or aggregated data for research and analytics only if requirements are met. This includes adopting reasonable security measures to prevent re-identification of data.
Personal Information Protection and Electronic Documents Act - establishes guidelines for businesses to manage personal information responsibly while allowing them to conduct their commercial activities.
Health Insurance Portability and Accountability Act - mandates that patient records are anonymized prior to sharing.
General Data Protection Regulation - enforces strict measures for anonymization, requiring that it is irreversible and that the risk of re-identification is limited.
Research and Analytics - Researchers remove identifiers from datasets to examine trends and patterns while avoiding privacy breaches.
External Data Sharing - anonymization is often required in order to share datasets publicly or with third parties unless there is a clear legal basis to share identifiable data. Anonymization is necessary when sharing datasets with collaborators, industry partners, journals, etc.
Challenges in Anonymization
Although Data Anonymization is an important and crucial process, it is not always straightforward, and common challenges arise. There is a risk that attackers re-identify anonymized data using auxiliary data. Additionally, over-anonymized data can reduce its usefulness and weaken its value for analysis. There are also inconsistent practices across departments.
Anonymization Tools
Many tools can be used by researchers to address these challenges and support effective anonymization and de-identification.
Some of these tools include:
John Snow Labs - advanced automated de-identification of sensitive data geared towards clinical, legal, financial and other text-heavy datasets.
EviData - a new service which evaluates whether the dataset meets privacy standards and provides an auditable report with evidence of the dataset’s privacy.
Amnesia - an open-source tool that transforms personal data into anonymized data through various methods such as generalization and suppression.
Informatica Persistent Data Masking - allows researchers to apply masking techniques that permanently replace the original data.
IBM InfoSphere Optim - replaces personal information with realistic and functional masked data that remains useful for testing, analytics and research.
How myLaminin supports anonymization
myLaminin is a Research Administration and Research Data Management platform that supports the full lifecycle of research data, from initial collection through to long-term archiving and publication. It provides a secure environment for data collaboration with role-based access controls that protect PHI and PII. Researchers can restrict access to identifiable information while sharing anonymized datasets with colleagues who do not require sensitive details for their analysis. myLaminin solves the governance and workflow issues often associated with the process through:
Secure Sharing - Only authorized members can access PII data. Datasets can be shared with external partners or published within myLaminin. This allows for controlled access and sharing of insights without exposing individual identities.
Realtime Audit Trails - With myLaminin, all actions taken within the research project by any member are logged, including who accessed the data, how it was altered and when it was anonymized. This audit trail is searchable and sortable, allowing users to easily understand who did what and when. Researchers can easily identify that personal data has been properly anonymized and aligns with protocols. Tracking modifications also prevents unauthorized use or accidental exposure of the sensitive data. Any unusual activity can quickly be identified and addressed.
Regulatory Compliance - Privacy regulations, such as HIPAA, PIPEDA, and GDPR, have been implemented worldwide to regulate the collection, handling and storage of personal data. They require organizations to monitor and document data handling. By maintaining detailed logs and enforcing anonymization, myLaminin helps researchers and institutions to comply with these privacy laws.
Final thoughts
Data anonymization is a key process in maintaining individuals’ privacy while supporting collaboration, innovation and analysis. It is a powerful technique that reduces the likelihood that unauthorized parties can access and exploit the data. With myLaminin, researchers have a secure, auditable, and compliant environment that makes the responsible handling and eventual anonymization of sensitive research data possible, straightforward and simple.
Sources
__________________________________

Darina Dragouleva (article author) is a myLaminin intern studying Health Sciences and Ivey AEO at Western University.




