Information Retrieval is an Arms Race Between Algorithms and Data Growth and Complexity

IR is the focus for an arms race between algorithms to extract information from repositories as those repositories get larger and more complex, and users' demands get harder to satisfy (either in terms of response time or complexity of query).

One obvious issue with respect to IR over the Web is that the Web has no QA authority. Anyone with an ISP account can place a page on the Web, and as is well known the Web has been the site of a proliferation of conspiracy theories, urban legends, trivia and fantasy, as well as suffering from all the symptoms of unmanaged information such as out-of-date pages and duplicates, all the difficulties pertaining to multimedia representations, and all the indeterminacies introduced by the lack of strictly constrained knowledge representation. Understanding exactly what information is available on a page waiting to be retrieved remains a serious problem.

Perhaps more to the point, traditional IR has been used in benign environments where a mass of data was mined for nuggets of sense; typical problems were complexity and lack of pattern. Benchmark collections of documents for IR researchers tend to be high-quality and almost never intentionally misleading, such as collections of scientific papers in particular journals. Other Web-like mini-structures that can be used, such as Intranets, are also characterised by the good faith with which information is presented. But malicious attempts to subvert the very IR systems that support theWeb so well are increasingly common. Web-based IR has to cope with not only the scale and complexity of the information, but potential attempts to skew its results with content intended to mislead [139].

Notes:

As the web gets larger and data grows more complex, less trustworthy in many regards, algorithms will need to grow more sophisticated to adapt to it.

Folksonomies: web science semantic web information retrieval

Taxonomies:
/technology and computing/software/databases (0.406799)
/shopping/resources/warranties and service contracts (0.336705)
/technology and computing/networking/vpn and remote access (0.292210)

Keywords:
IR (0.900939 (negative:-0.226331)), arms race (0.900744 (positive:0.365665)), strictly constrained knowledge (0.822412 (negative:-0.657745)), traditional IR (0.735376 (neutral:0.000000)), IR researchers (0.725881 (negative:-0.290075)), Web-based IR (0.724419 (negative:-0.640114)), IR systems (0.709184 (negative:-0.418747)), information (0.648695 (negative:-0.108607)), complexity (0.614793 (negative:-0.072749)), Information Retrieval (0.614558 (positive:0.508015)), Data Growth (0.603887 (positive:0.508015)), unmanaged information (0.600165 (negative:-0.613435)), obvious issue (0.598500 (neutral:0.000000)), out-of-date pages (0.597667 (negative:-0.613435)), response time (0.588567 (neutral:0.000000)), QA authority (0.587850 (neutral:0.000000)), Web-like mini-structures (0.587402 (neutral:0.000000)), ISP account (0.583596 (neutral:0.000000)), multimedia representations (0.579352 (negative:-0.443014)), conspiracy theories (0.576276 (negative:-0.443195)), benign environments (0.574939 (neutral:0.000000)), urban legends (0.572195 (neutral:0.000000)), malicious attempts (0.567832 (negative:-0.418747)), potential attempts (0.565654 (negative:-0.630157)), typical problems (0.563945 (negative:-0.521414)), scientific papers (0.558228 (neutral:0.000000)), particular journals (0.557963 (neutral:0.000000)), Benchmark collections (0.555878 (negative:-0.290075)), good faith (0.553476 (positive:0.439704)), algorithms (0.474791 (positive:0.162811))

A Framework for Web Science (Foundations and Trends(R) in Web Science)

Books, Brochures, and Chapters>Book: Berners-Lee, Tim (2006-09-15), A Framework for Web Science (Foundations and Trends(R) in Web Science), Now Publishers Inc, Retrieved on 2010-11-15

Source Material [eprints.ecs.soton.ac.uk]

Folksonomies: web science