Responsive Example

THVD Dataset

High end 4k video dataset for stress testing and training models.

About

We provide a comprehensive talking-head video dataset with over 50,000 videos, totaling more than 600 hours of footage and featuring 23,841 unique identities from around the world.

Who Can Use It

List examples of intended users and their use cases:
illustration

Data Scientists: Training machine learning models for video-based AI applications.

illustration

Researchers: Studying human behavior, facial analysis, or video AI advancements.

illustration

Businesses: Developing facial recognition systems, video analytics, or AI-driven media applications.

Distribution

Detailing the format, size, and structure of the dataset:

Data Volume:
Total Size2.5TB
Total Videos47,573
Identities Covered20,841
Resolution60% 4K (1980), 33% Full HD (1080)
FormatsMP4
Full-length videos with visible mouth movements in every frame.
Minimum face size of 400 pixels.
Video durations range from 20 seconds to 5 minutes.
Faces have not been cut out, full-screen videos including backgrounds.

Usage

This dataset is ideal for a variety of applications:

Face Recognition & Verification: Training and benchmarking facial recognition modelsAction Recognition: Identifying human activities and behaviorsRe-Identification (Re-ID): Tracking identities across different videos and environmentsDeepfake Detection: Developing methods to detect manipulated videosGenerative AI: Training high-resolution video generation modelsLip Syncing Applications: Enhancing AI-driven lip-syncing models for dubbing and virtual avatarsBackground AI Applications: Developing AI models for automated background replacement, segmentation, and enhancement.

Coverage

Explaining the scope and coverage of the dataset:
  • Geographic Coverage: Worldwide
  • Time Range: Time range and size of the videos have been noted in the CSV file.
  • Demographics: Includes information about age, gender, ethnicity, format, resolution, and file size.
Languages Covered (Videos):
English23,839 videos
Polish1,818 videos
Arabian1,691 videos
Dutch1,668 videos
Japanese1,433 videos
Portuguese1,359 videos
Deuch1,281 videos
Turkish1,245 videos
Hindi1,194 videos
Indonesian1,182 videos
Romanian1,144 videos
French1,107 videos
Swedish1,059 videos
Greek1,006 videos
Italian1,006 videos
Tagalog924 videos
Spanish688 videos
Czech590 videos
Norwegian586 videos
Chinese (cn)444 videos
Chinese (tw)241 videos
Bulgarian340 videos

Statistics

Gender

chart1
Male:31,830
Female:15,509
Others:234

Age

chart1
20-29:23,904
30-39:17,003
40-49:3,561
Others:3,105

Race

chart1
White:33,280
Asian:9,123
Black:3,556
Indian:1,380

Resolution

chart1
2160p:24,856
1440p:296
1080p:21,964
720p:457

Additional Notes

Ensure ethical usage and compliance with privacy regulations. The dataset’s quality and scale make it valuable for high-performance AI training. Potential preprocessing (cropping, downsampling) may be needed for different use cases. Dataset has not been completed yet and expands daily, please contact for most up to date CSV file. The dataset has been divided into 20GB zipped files and is hosted on a private server (with the option to upload to the cloud if needed). To verify the dataset's quality, please contact me for the full CSV file. We’d be happy to provide example videos selected by the potential buyer.