The term invisible Web or hidden Web refers to Web resources that are not easily indexed by Web search engines.
Search engines use crawlers (also called spiders) to locate content for indexing by following links that they encounter in each page that they parse. However, instead of maintaining large collections of HTML files, many sites generate HTML pages dynamically from content stored in XML files or in relational databases.
The content may be exposed only if users search using a Web form, an action which crawlers cannot easily mimic.
Another problem for crawlers arises from sites that require user registration and authentication in order to access content. Estimating the size of the invisible Web is obviously difficult since by definition the content cannot be seen, but it may be orders of magnitude larger than the surface (visible) Web.
There are also socioeconomic aspects to this issue since surface content is dominated by commercial enterprises and is funded largely by advertising, while hidden Web content is often premium, academic, etc.
Some would go so far as to dismiss the invisible Web content entirely by saying that since users only use search engines to locate content then it does not matter if content exists out of the reach of their favorite search engine.
Although the scale is not easily quantifiable, as far as users’ expectations are concerned, the phenomenon of invisible Web is more severe for video than for text.
There are cases where Web pages contain links directly to static video files, but this is the exception rather than the norm. Video content is typically accessed through a player with complex scripting used to specify the video asset.
Due to the size of the media objects and complexities of maintaining news content, asset management or publishing tools are typically used which are linked to databases. In some cases the publisher may have rights to publish the content only for a limited time. Professionally produced video entails high production costs and sites recover the investment through advertising or subscriptions.
Video advertising via forced playlists also foils search engine crawlers.
Video protected by digital rights management (DRM) precludes content based analysis. Attempts by search engines to circumvent any of these revenue-persevering schemes will not be received favorably by the content owners.
Consumer produced content posted on sharing sites, on the other hand, is often open to all viewers for free and sites may have mechanisms to generate permanent links to videos.
Crawlers may encounter these links on other sites and the links point back to a full page rather than directly to the video file. Stream saver or downloader tools have been developed to work around these issues.
Stale links arise from content being moved or deleted after a crawler has indexed the content. While this is a problem in both the text and video domains, it may be more likely for video files because large file sizes or rights issues may lead sites to remove content after a certain period of time.