πŸ“šPhilosophy & Principles

Your Social Media Data is Priceless for AI Training

Social media platforms abound with high-quality content, making them a treasure trove for training large-scale AI models (LLMs). Yet, platforms such as X, Snapchat, Instagram, TikTok, YouTube, Reddit, and others enforce bans on data scraping for AI training via their robots.txt files. This restriction underscores a growing debate over data ownership for AI training, presenting a major hurdle to AI advancements. As social media platforms increasingly tighten control over the use of user-generated content for training data, the development of large AI models encounters unprecedented challenges, stifling further progression.

Ban data scraping for AI training viarobots.txt

Web2 Giants Exploit User Data Rights

Web2 corporations are infringing on user rights by selling user data to AI companies without sharing any profits.

E.g. Reddit has struck a $60m deal with Google that lets the search giant train AI models on its posts

PublicAI - Web3 Data Ownership Revolution

Core Principle: User-generated, AI learned, Benefits user

At the heart of Web 3.0 lies a groundbreaking principle: the data produced by individuals is owned by them. This allows users to exercise control over whether their data can be utilized for training AI models, ensuring they reap benefits from such contributions. This concept is central to PublicAI's philosophy: User-generated, AI-learned, Benefits the user.

Integrating PublicAI with Social Media

AI builders have the option to link their blockchain digital identities with their social media profiles in PublicAI. This linkage enables them to grant permission to PublicAI to use their social media data in AI training. In return for their contribution, users receive token incentives from the protocol, promoting a fair and reciprocal relationship between data producers and AI development.

Reinforcement Learning from Human Feedback (RLHF)

At the heart of PublicAI’s innovation is the integration of Reinforcement Learning from Human Feedback (RLHF) principles. RLHF represents a cutting-edge approach to AI training, where models are refined based on quality human feedback rather than simple data ingestion. By harnessing a global, decentralized network, PublicAI aggregates diverse and rich feedback, accelerating the learning curve of AI models through nuanced human insights. This method not only enhances model accuracy but also aligns AI behaviors with human values and expectations.

Train-AI-To-Earn with Cryptocurrency

PublicAI is leading the creation of a Web3 Train-AI-To-Earn ecosystem. This initiative focuses on using decentralized networks to improve AI training by emphasizing dataset validation. This method significantly enhances the quality and effectiveness of AI models by ensuring the integrity of the training data.

PublicAI BFT Data Consensus Algorithm

Integrating Byzantine Fault Tolerance (BFT) principles, PublicAI's platform secures and integralizes data validation, countering adversarial threats and bolstering consensus reliability. The platform employs a BFT-based dynamic voting mechanism, adjusting validator numbers and consensus thresholds according to data complexity and sensitivity, as identified by AI. Validators and data contributors earn tokens upon consensus, promoting engagement. The BFT consensus algorithm enhances data validation robustness, maintaining accuracy amid Byzantine faults and ensuring system resilience against malicious attacks. Validators cross-validate data, marking it as "Authentic," "Helpful," or "Unknown" for novel insights, thereby reinforcing the knowledge base.

Discussion: Updating Large Language Models (LLMs) with Internet Data

In today's rapidly evolving information landscape, keeping large language models (LLMs) up-to-date with the latest internet content is a significant challenge. The prevailing method involves fine-tuning models using specific internet data to perform inference tasks. This approach is more cost-effective than full retraining, as it does not involve training the model from scratch. However, it has limitations, including a performance ceiling and an inability to guarantee the correctness of pulled content. Notably, restrictions "robots.txt" prevent GPT models from accessing content on most social media platforms, which is a crucial drawback. Currently, only two ways can solve the issues: Official Updates and Fine-Tuning.

Official Updates

The official method for updating LLMs involves retraining the entire model with newly collected data, utilizing significant computational resources. This process is so resource-intensive that it is only undertaken during major version upgrades, such as the transition from GPT-4 to GPT-4 Turbo, with the training data updated from September 2019 until April 2023.

Fine-Tuning

Fine-tuning LLMs with internet data offers a less resource-intensive alternative. By pulling specific content from the internet, this method allows models to infer new information without comprehensive retraining. It's a widespread approach due to its cost-effectiveness, but it faces limitations in the quality and timeliness of the information it can incorporate, especially from social media.

Conclusion

As we navigate the challenges of keeping LLMs updated with the most recent information, it's clear that both methodologies have their strengths and weaknesses. Official updates provide comprehensive, quality-controlled data at the expense of time and resources, while fine-tuning offers a nimble but potentially less reliable way to refresh a model's knowledge base. Addressing the shortcomings of these approaches, particularly the access limitations imposed by social media platforms, is crucial for the future development and utility of LLMs.

Last updated