Red Pajama 2: The Public Dataset With a Whopping 30 Trillion Tokens
$ 15.00 · 4.5 (703) · In stock
Together, the developer, claims it is the largest public dataset specifically for language model pre-training
RedPajama Reproducing LLaMA🦙 Dataset on 1.2 Trillion Tokens, by Angelina Yang
Data management recent news
2311.17035] Scalable Extraction of Training Data from (Production) Language Models
Top 10 List of Large Language Models in Open-Source
Data recent news
RedPajama training progress at 440 billion tokens
NLP recent news, page 7 of 30
RedPajama's Giant 30T Token Dataset Shows that Data is the Next Frontier in LLMs
Language models recent news, page 7 of 25
RedPajama Project: An Open-Source Initiative to Democratizing LLMs - KDnuggets
RLHF: Reinforcement Learning from Human Feedback
Data science recent news
AI releases RedPajama-Data-v2 dataset, Aleksa Gordić posted on the topic