Background
Department for Promotion of Industry and Internal Trade (DPIIT) which functions under the Ministry of Commerce and Industry, has published a working paper on the use of copyrighted material as input for AI training. The committee was tasked with assessing whether the current legal framework on copyright sufficiently addresses the issues raised by Gen AI models or whether amendment is needed. The policy examination becomes significant when we examine how the AI models are actually trained.
The Large Language AI models like Open AI’s ChatGPT, Claude, Gemini, etc. are trained on copyrighted data using Text Data Mining (TDM) technique. These are techniques used to automatically extract meaningful text, information, pattern, and insights from large volumes of unstructured data.
AI Training and Text Data Mining (TDM)
| Aspect | Description |
|---|---|
| Training Input | Large volumes of copyrighted and non-copyrighted textual data |
| Technique Used | Text Data Mining (TDM) |
| Purpose | Extraction of patterns, information, and insights from unstructured data |
| Examples of Models | ChatGPT, Claude, Gemini |
Even data protecting techniques like Technological Protection Measures (TPMs) are ineffective against TDM as TPMs control actions like viewing, copying, downloading, or printing. Measure like Data Rights Management (DRM) techniques used on Netflix, Amazon Kindle so the user won’t be able to pirate the content. As the AI model isn’t pirating or storing the data but learning the data in the real time. This technological gap shifts question from access control to whether such learning itself amount to copyright infringement.
Copyright violation of the authors and creators of literary works. These models are trained on expressions and not the idea, and copyright protects expression. These gen AI models can generate content that can replace books, articles, etc. Concerning this, DPIIT proposed a solution, “Hybrid Model.”
The Hybrid Model
In the working paper, they have proposed a “Hybrid Model.” In this model, a statutory blanket licensing with remuneration right will be imposed, Creators will not be able to withhold their work for use in the training AI system.
Copyright Royalties Collective for AI Training (CRCAT)
Copyright Royalties Collective for AI Training (CRCAT), a non-profit entity will be created by associations of rightsholders, Collective Management Organization (CMOs) and will be designated by the central government under Copyrights Act, 1957 to collect the royalties and further distribution to authors and creators.
Membership of CRCAT
Membership of CRCAT
Collective Management Organization (CMOs) formed by rightsholders and copyright societies will be members of CRCAT. CRCAT will be the governing body to safeguard collection administration and distribution of royalties. A committee formed by the central government “Rate Setting Authority” will determine royalties.
Royalty Setting and Distribution
Royalty setting and distribution.
Royalties are decided by the Rate Setting Committee, where it consists of senior government officers, senior legal experts, financial or economic experts, and technical experts, member from CRCAT and a representative of AI Developers Distribution is based on a flat rate model at the time. Where a certain percentage of gross global revenue earned by AI developers from the AI system. And the payment will be made after generating revenue from the AI model. And it is payable annually.
Structural Flaws in the Proposed Hybrid Model of Copyright Remuneration
- Mandatory Licensing Without Opt-Out Weakens the Property of the Nature of Copyright
Copyright is an exclusive right in nature, but with the hybrid model, it becomes statutory entitlement to uncertain remuneration. Right holders cannot refuse use of their work; consent is replaced with compulsion and mandatory inclusion. And compensation is delayed. This all together weakens the value of Copyrighted material in nature. - Royalty Distribution Inequality
CRCAT will distribute the royalty through CMOs and Copyright societies. The critical flaw is the valuation process where a single investigative report by a small outlet will be diluted by the large media houses with a massive archive. The immense size of archives of the large media houses will benefit them disproportionately leading to undermining the stated goal of protection of small creators. - Structural Flaws in Revenue Attribution System
The royalty system works best for traceable use of the copyrighted material. AI Model developer funds the Royalty pool by the flat fee. But the question is how will the royalty be allocated to the specific copyrighted work, as AI is trained on billions of heterogeneous data. The developers are not obliged to disclose the full data training sources because of trade secrets. Therefore, royalty distribution is statistical guess work and not rights-based compensation. And ultimately royalties become a tax on AI companies for using the work rather than compensation to the authors. - Absence of Detailed Usage Monitoring
The Hybrid model explicitly avoids the dataset level transparency because of innovation and trade secrets. The rightsholders have no way to verify how extensively the work was used. For example, if two works were used to train the model and work “A” is used 10 times more than work “B”, there is no way to verify and creators cannot challenge underpayment because they cannot verify the usage. And that leads to underpaying creators. - Governance and Risk of Institutional Capture
The CRCAT, a nominally centralized non-profit of designated copyright societies and CMOs under government oversight, faces a significant risk of institutional capture. Since CMOs largely represent major publishers and music labels, their likely dominance within CRCAT’s internal committees could lead to undue influence. This control would allow them to unilaterally set vital operational parameters, such as royalty allocation and administrative deduction rates. - Incentive Misalignment
Copyrights basically reward the work, its originality, investment, and efforts. But to the contrary, the Hybrid Model rewards the size of archive, historical dominance and registration volume and not the quality. In the long-term quality of the work will degrade as royalties are detached from actual value creation.
Conclusion
The working paper is accepting the responses and it’s not the Hybrid model won’t be the final mechanism. The Hybrid model does solve the lawful data access problem of AI developers, by introducing blanket licensing, but it undermines the core principles of copyright i.e. exclusivity to use the work, reproduction and moral rights. A redesign, considering all the issues can make this model work. A reasonable opt-out mechanism which should narrow the scope for AI bias and preserve the value of the copyrighted work, meanwhile full disclosure of the data sets so that creators and authors would be fairly compensated and this will preserve moral rights, creator’s dignity. This will also help the royalty model to work at its most efficient form rather than disproportionately benefiting the sheer mass and huge size of archives. The flat rate system fails to account for global AI market and legal disputes around the globe. This gets developer exposed to double payment for the data sets.


