“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”
Performance of machine learning algorithms, supervised or unsupervised, is often significantly enhanced when a variety of feature families, or multiple views of the data, are available. For example, in the case of web pages, one feature family can be based on the words appearing on the page, and another can be based on the URLs and related connectivity properties. Similarly, videos contain both audio and visual signals where in turn each modality is analyzed in a variety of ways. For instance, the visual stream can be analyzed based on the color and edge distribution, texture, motion, object types, and so on. YouTube videos are also associated with textual information (title, tags, comments, etc.). Each feature family complements others in providing predictive signals to accomplish a prediction or classification task, for example, in automatically classifying videos into subject areas such as sports, music, comedy, games, and so on.
We have released a dataset of over 100k feature vectors extracted from public YouTube videos. These videos are labeled by one of 30 classes, each class corresponding to a video game (with some amount of class noise): each video shows a gameplay of a video game, for teaching purposes for example. Each instance (video) is described by three feature families (textual, visual, and auditory), and each family is broken into subfamilies yielding up to 13 feature types per instance. Neither video identities nor class identities are released.
We hope that this dataset will be valuable for research on a variety of multiview related machine learning topics, including multiview clustering, co-training, active learning, classifier fusion and ensembles.
The data and more information can be obtained from the UCI machine learning repository (multiview video dataset), or from here.