Harvest: How We Turn Petabytes of Video into Training Data
Jan 15, 2026
/
1 min. read
You wouldn't train a world-class barista on instant coffee. The same principle applies to AI: the quality of training data shapes the capability of the model.
Raw video comes in wildly different grades. Some sources are specialty-grade with clean encoding, clear scene boundaries, and consistent caption metadata. Others are commodity-grade with defects, compression artifacts, and inconsistencies.
The challenge isn't just acquiring data at scale; it's curating and processing heterogeneous sources into a consistent, training-grade blend.
We built Harvest to do exactly that.
The Opportunity
When a partner network notified us that over a petabyte of video needed to be collected in December 2025, we had a narrow window to acquire it. The dataset was distributed across hundreds—if not thousands—of sources, and we needed infrastructure that could handle the acquisition in time for a February deadline.
Think of it like sourcing beans from every farm across multiple continents, except each farm uses its own grading system, has different harvest windows, and some don't even have reliable shipping addresses.
Quality varies wildly. Some lots are specialty-grade with HD resolution and clean metadata, while others arrive with incomplete archives, inconsistent naming, or mixed file formats. Now imagine doing that for over a petabyte of video, and then categorizing every clip so researchers can search by tasting notes:
"Find me all grande jeté ballet jumps" or "drone shots of urban environments at night."
Volume Statistics
Total Downloaded
239.56 TiB
Total Uploaded
219.94 TiB
Ratio
0.92
Today
↓ 3.5 TB / ↑ 3.5 TB
7-Day Average Rates
Day↓ 11.5 TB / ↑ 10.8 TB
Month↓ 344.3 TB / ↑ 324.6 TB
Year↓ 4189.45 TiB / ↑ 3949.71 TiB
Draft0
Pending11K
Downloading38
Uploading238
Uploaded16K
Failed737
Building Harvest
We built an end-to-end system that transforms raw video into training-grade data. Harvest covers seven critical stages:
Sourcing – Discovering where data lives across distributed sources, each with its own access patterns and grading schemas
Picking – Managing access requests, coordinating acquisition sequences, and building resilient job systems with retry logic and checkpointing for graceful failure handling
Transporting – Moving large quantities of raw data to high-bandwidth processing infrastructure geolocated for maximum throughput
Processing – Scene detection, clip splitting, and dense captioning that transforms raw footage into training-ready clips
Storing – Moving processed clips to secure, zero-egress storage in strategic geolocations with high bandwidth availability
Delivering – Provisioning secure access keys to partners with controlled, auditable distribution
Cataloging (forthcoming) – Every clip is indexed and searchable via natural language, making it easy to discover and assemble targeted fine-tuning blends
The result: ML teams receive a curated training blend, not raw commodity beans to sort themselves. A stringent grading process removes defects—corrupt frames, watermarks, low-resolution segments, compression artifacts—like a specialty roaster sorts out quakers and insect damage. Upfront curation can cut data pre-processing time for ML teams by 30-40%.
espressoOnlinePaused
Active Downloads10 / 10
Completed Today215
Downloaded Today2.8 TB
Ratio (Today)0.94
Ratio (All Time)0.92
Last Heartbeat6s ago
Auto Rename
Auto Upload
Delete After Upload
B2 Only Mode
Pause Downloads
cortadoOnlinePaused
Active Downloads10 / 10
Completed Today211
Downloaded Today2.7 TB
Ratio (Today)0.93
Ratio (All Time)0.92
Last Heartbeat5s ago
Auto Rename
Auto Upload
Delete After Upload
B2 Only Mode
Pause Downloads
mochaOnlinePaused
Active Downloads10 / 10
Completed Today189
Downloaded Today2.5 TB
Ratio (Today)0.94
Ratio (All Time)0.92
Last Heartbeat12s ago
Auto Rename
Auto Upload
Delete After Upload
B2 Only Mode
Pause Downloads
Processing Video for AI Training
Raw video files aren't directly useful for training. Modern video AI models need carefully prepared datasets. Harvest was built to handle not only acquisition, but the transformation of large quantities of video into even larger quantities of clips:
Scene Detection identifies natural boundaries—cuts, fades, transitions—ensuring each training sample is visually coherent. We leverage a combination of PySceneDetect for rule-based detection and TransNet v2 for learned boundary prediction, achieving high precision across diverse content types. A dataset full of clips that jump from a car chase to a dialogue scene mid-frame isn't useful for teaching a model anything but ADD.
Clip Splitting breaks longer content into appropriately-sized segments optimized for your training requirements. We also provide clean start and end frames using the Laplacian operator for blur detection, ensuring crisp boundaries rather than motion-blurred transitions. Too short and you lose context; too long and you waste compute on redundant frames. As models continue to improve, clean start and end frames will become the basis for synthetic datasets that will reduce training costs by orders of magnitude.
Dense Caption Generation produces detailed tasting notes for each clip—not just what's happening, but how: camera movements, lighting, composition, and temporal dynamics. Like a Q grader describing acidity, body, and finish, our Gemini 3 Flash powered annotation system captures the full context of each clip, with minimal human error, at scale.
Semantic Search & Discovery (Forthcoming) is where the catalog comes alive. After sourcing and cupping all those beans, you need a way to find exactly the flavor profile you're looking for. Harvest indexes every clip with its tasting notes and metadata, enabling natural language search across the entire dataset. Need to assemble a fine-tuning blend of "cooking scenes with close-up shots of hands"? Query it directly. Looking for "slow-motion water splashes with high contrast lighting"? It's searchable. This transforms a petabyte archive from an overwhelming warehouse into a curated catalog you can navigate in seconds.
Harvest is quickly approaching half a petabyte and counting. We will continue processing hundreds of terabytes per month, running largely autonomously with monitoring and alerting for edge cases.
For frontier labs, this translates to:
Scale: Petabyte-ready infrastructure that grows with your data needs
Quality: Curated clips, not raw volume—specialty-grade training data with defects removed
Speed: 30-40% reduction in your team's data pre-processing time
Cost efficiency: Zero-egress, geolocated storage minimizes transfer overhead
And last but not least, focus. Your ML engineers work on tasting, not pre-processing.
What We Learned
Building Harvest at this scale reinforced something we already knew: in AI, data curation is as critical as model architecture when it comes to invigorating the visionary minds of tomorrow.
Teams that can efficiently source, grade, and continuously deliver training data have a structural advantage in an agentic world that never sleeps.
We operate Harvest so you don't have to. If you're building frontier video models and need petabyte-scale, training-grade data, let's talk.
sign up for our newsletter
Subscribe to our newsletter and stay up to date with the latest news, announcements, and articles.
By subscribing, you consent to receive emails from us and agree to our