Subnet 63

Finest Data

Finest Data

Finest Data builds a high-quality, large-scale pretraining dataset for LLMs using Bittensor.

SN63 : Alpha Trade Exchange

SubnetDescriptionCategoryCompany
SN63 : Alpha Trade ExchangeInference verification & optimizationGenerative Al
Manifold

Finest Data (Subnet 63) recognizes that the performance of large language models (LLMs) is deeply tied to the quality and scale of their pretraining datasets. While the datasets behind leading open LLMs like LLaMA 3 and Mixtral remain undisclosed, and little is known about their construction, a new large-scale dataset, FineWeb, has emerged. FineWeb, built from 96 snapshots of CommonCrawl, contains 15 trillion tokens (44TB of disk space) and has outperformed other open pretraining datasets.

For this subnet they are using the same algorithm behind FineWeb to develop an even larger, higher-performing dataset. Their dataset will be further refined and enhanced through the decentralized Bittensor network, ensuring superior quality and scalability.

The Finest Data subnet employs an optimized mechanism for dataset creation, consisting of two primary roles:

Miners: They are responsible for generating refined datasets from raw crawled data.

Validators: They are tasked with evaluating the performance of miners and ensuring the quality of the datasets produced.

Both miners and validators are rewarded with TAO based on their scores and trust within the network.

Main Mechanism of the Subnet

Miners receive tasks from the task server via the task retrieval API. This server manages and organizes tasks, primarily splitting the CommonCrawl data and tracking the miners’ status. Once miners process the task, they upload the refined dataset to their Hugging Face repository and submit the commit, including the Hugging Face URL, to the blockchain.

Validators periodically check miners’ commits every few blocks, retrieve the new submissions, and evaluate the elapsed time and the quality of the resulting dataset. Based on the miners’ performance, validators assign weights according to their scores.

Dataset Evaluation Method

Validators train a small model using the miner’s dataset and assess its quality based on the model’s accuracy. If the trained model performs well, this indicates the dataset is of high quality. On the other hand, if the model performs poorly, it suggests the dataset quality is suboptimal. This method effectively evaluates the dataset’s quality.

Awaiting Data