Publicly accessible personal data under the DPDP Act: AI training and other public-source data uses

In our last paper in the Digital Personal Data Protection Series, we discussed the Significant Data Fiduciary framework and the need for proportionate notification design, particularly for enterprise IT and ITES businesses ( link).

In this second paper, we turn to how the Digital Personal Data Protection Act, 2023 treats publicly accessible personal data, and what this means for AI training, social media content, and other public-source data use cases.

The treatment of public-web personal data for AI training is a live issue globally. In the EU and UK, legitimate-interest analysis may provide a possible route, but it is not an automatic permission to scrape and train. It requires a legitimate interest, necessity, balancing, transparency, and accountability. In the US, the position is more flexible, but also more context-specific. There is no single rule that decides whether public-web personal data may be used for AI training. The legal assessment may depend on sectoral privacy rules, state privacy laws, privacy policy commitments, confidentiality obligations, consumer protection standards, and the facts of how the data was collected and used.

The issue is live in India too and the DPDP Act creates a more specific implementation challenge. The Act does not have a broad private sector legitimate-interest ground comparable to the EU or UK. It allows processing on the basis of consent or for the listed "certain legitimate uses". Its public-data exclusion is also narrow.

Section 3(c)(ii) is the key. It provides that the Act does not apply to personal data that is made or caused to be made publicly available by the Data Principal to whom the personal data relates, or by another person who is under an obligation under Indian law to make such personal data publicly available.

The effect of Section 3(c)(ii) is to keep such data outside the Act where the statutory condition is met. However, this is not a general public-internet exemption. The mere fact that personal data is visible, searchable, crawlable, indexed, or available through a third-party dataset does not mean that it is outside the DPDP Act.

The practical test is: Can the organisation reasonably explain why it concluded that the personal data was made public by the Data Principal, caused to be made public by the Data Principal, or made public under a legal obligation?

If yes, the public-data exclusion may apply. If no, the safer implementation position is to treat the data as in-scope personal data, unless another exemption or lawful basis applies.

This would not mean that an organisation must verify, for every item in a large dataset, that the relevant Data Principal made or caused it to be made public. However, this could mean classifying sources, excluding high-risk or unauthorised sources, and recording why a dataset is being treated as excluded, in-scope, anonymised, consented, licensed or otherwise lawfully processed. A practical approach could be to classify publicly accessible personal data into three broad categories.

First, Data Principal-public data: Personal data that the individual has herself made public or clearly caused to be made public, such as a public post, public blog, public professional profile, public portfolio, or public repository profile.
Secondly, law-mandated public data: Personal data made public because a person or authority is legally required to publish it.
Thirdly, uncertain public-source data: Scraped datasets, cached pages, mirror sites, third-party directories, data broker files, enriched datasets, compiled profile datasets, leaked databases, or datasets where source and authority cannot be reasonably established.

The third category is the most difficult. It is also highly relevant for AI training, search, enrichment, cybersecurity, fraud detection, recruitment intelligence, and analytics.

It is important to remember that the Act defines processing broadly, and training pipelines may involve scraping, collection, ingestion, cleaning, deduplication, labelling, embedding, indexing, storage, training, fine-tuning, retrieval, and logging. A model that is designed not to output personal data may reduce disclosure risk, but it does not by itself answer whether the input data and intermediate processing were lawful.

Anonymisation is a strong route where it removes the ability to identify or reasonably link the data back to an individual. Once data is no longer about an identifiable individual and cannot reasonably be linked back to one, it is not personal data for DPDP purposes. However, anonymisation is not a universal solution and may not be suitable for many contexts. It may reduce model quality, remove context-rich signals, weaken performance for under-represented groups, and limit use cases that require identity, individual-level context, fraud detection, cybersecurity, recruitment matching, personalisation, or safety enforcement.

Masking, hashing, tokenisation, encryption, embeddings, and pseudonymisation may be useful, but they are not necessarily anonymisation. If re-identification, linkage, reconstruction, or extraction is still reasonably possible, the data may still carry personal data risk.

Where Section 3(c)(ii) genuinely applies, the data is outside the DPDP Act. However, organisations should still be able to explain why they are processing the data. Purpose, context and risk remain important governance considerations, especially where public data is aggregated, enriched, profiled, retained at scale or used for AI training. A person may make information public in a social, professional, or expressive context without expecting unrestricted aggregation, enrichment, profiling, or model training in unrelated contexts.

Social media platforms occupy a distinct position. Where a user intentionally makes her own personal data public through a post, that data would fall within Section 3(c)(ii). However, platforms remain Data Fiduciaries for the personal data they process to provide the service, manage, and secure accounts, operate privacy settings, process metadata, recommend content, serve advertising, moderate content, and enforce user choices.

A harder issue arises where public user-generated content has personal data about individuals other than the user who posted it. Such third-party personal data should not automatically be treated as covered by the public-data exclusion, because the relevant Data Principal may not have made or caused the data to be made public. The issue is sharpened by the fact that the DPDP Act does not contain a broad private-sector legitimate-interest ground. Existing controls such as user rules, reporting, moderation, takedown and the IT Act framework can help manage misuse or harmful disclosure, but they do not by themselves resolve the DPDP processing-basis question. This appears to be a genuine implementation gap that merits Government review.

Legacy datasets raise a related implementation issue. The DPDP Act provides a transition for personal data for which the Data Principal had already given consent before the relevant provision comes into force. In such cases, the Data Fiduciary must provide notice as soon as reasonably practicable and may continue processing until consent is withdrawn. This is not a general validation mechanism for all legacy datasets. It applies only where prior consent existed. It does not cure uncertain provenance or automatically cover materially new uses such as AI training, enrichment, profiling, or onward commercial use where these were not within the original purpose.

For AI training, anonymisation, synthetic data and aggregation will be important tools, but they may not suit every use case without affecting quality, context, safety testing, fraud detection or identity-related functions. Public user-generated content containing other people's details is even more routine. Platform controls can help manage misuse, but they may not fully resolve the processing-basis question. These examples point to the need for practical implementation approaches that preserve the DPDP Act's privacy objectives. Industry and Government will need to navigate this together, with the interests of users at the centre. The paper is attached.

#DPDPAct #dataprotection #AIGovernance #PublicData #PrivacyCompliance

Download Attachment

ashish.aggarwal