In addition to crawling, syndication, and user contribution, other forms of acquisition include broadcast capture and event based capture. Broadcast capture may involve analog to digital conversion and encoding, but it is becoming more common to simply capture digital streams directly to disk.
For consumers, broadcast may be received over the air, via cable, via direct broadcast satellite, through IPTV or Internet TV multicast. Eventbased acquisition is used in security applications where real-time processing may detect potential points of interest using video motion detection and this is used to control the recording for later forensic analysis.
It is in the context of these streaming (or live, linear) sources that real-time processing considerations and highly available systems are paramount. In the previous examples of acquisition that we were considering, there is effectively a source buffer so that if the acquisition system were to go offline the result would only be slightly delayed appearance of new content – for streaming acquisition, this would result in irretrievable content loss. Of course, for user contributed situations high reliability is also desired to preserve a satisfying user experience.
A particular set of concerns arise with user contributed content, and appropriate mitigation steps should be taken prior to inserting UGC into the search engine for distribution. Users may post copyrighted material, inappropriate or offensive content and may intentionally misrepresent the content with inaccurate metadata.
Sites may employ a review process where the content is not posted until approved, or may rely on other viewers to flag content for review. Fingerprinting technology is used to identify a particular segment of media based on features, and this operation can take place during the media processing phase to reject posting of copyrighted content.
Newly acquired media is available in a state that facilitates subsequent operations on the media such as transcoding and metadata extraction. Transcoding engines may operate independently on the content with the goal of preparing the content for delivery (perhaps via streaming) and normalizing to a single format suitable for archiving. We can consider a path for the media separate from the path for the metadata. The media will be positioned on user-facing media servers or origin servers for content distribution. Metadata will head to the index and storage for use in browsing and be tied together using content identifiers.