Searching requires browsing sets of candidate results. Video is a continuous (or linear) medium: if paused, only a single frame remains, audio is lost. Text is displayed in a more parallel fashion and can therefore be browsed easily.
Video storage and transmission requirements are several orders of magnitude greater than those for text.
Textual features (characters, words) are well defined, can be efficiently encoded, and are limited in number. Video features (edges, colors, motion) and acoustic features (pitch, energy) are less well-defined, computationally expensive to extract, and bulky to represent. In fact, there is little consensuses on which features are best for a given application.
Furthermore, users can formulate textual queries easily using a keyboard so that, to a first approximation, the information retrieval problem reduces to a symbol look-up (i.e. find me the documents containing this word). For video databases, the query-response cycle is cross-modal (enter text, retrieve video).
Query by image content involves building a query by specifying image or video attributes perhaps with a graphical tool which is beyond the patience limits of the typical user.
Query by example or relevance feedback methods are easier to use but require some seed search to bootstrap the process. Comparing some of the issues faced by video search engine systems to their analogs from the text domain sheds light on the nature and scope of the challenges encountered.