Seek and meta-seek and ye shall find…
Search engines (SEs) are among the great hallmarks of the information society. Computers that ‘crawl’ the Web cataloguing everything they come across have become indispensable. Google has even become a verb – ‘Let’s Google it’.
There’s Google, of course, and Alta Vista, Yahoo!, All The Web, Lycos, even a Hog Search -have you ever heard of it? There are search engines specialized in specific sectors, search engines that specialize in finding people, there are even search engines that specialise in getting and vetting information from other SEs, called meta-search engines.
Have you got a question? Just feed it to your SE and in a few seconds – after scanning its index containing millions and millions of WebPages – most SEs will spit out a list of articles, references, ads, photos, indeed anything and everything that has to do with your query. Not even the best librarian in the biggest, best, most up-to-date library in the world can match this performance.
Of course, some of the SE feedback lists tens, hundreds, thousands and, yes, even billions of references – and many of those are as closely related to what you are looking for as a free-association response on a psychoanalyst’s couch. Try looking up, ‘The Internet’, on Google; you will get more than one and a half billion responses in less than one quarter of a second. What if the answer you need is in the 3,002,644th reference? Now that’s really useful. Sure!!!
The Internet, the Word Wide Web, has become one of the great repositories of human knowledge, perhaps the greatest of all time. In time, the projects to digitalise the world’s knowledge, all the world’s books, will certainly be realised and the value of the Web will multiply accordingly. The Web will have everything; the problem will be finding it. Today, the number of hopeless responses, and those that merely miss the mark a bit, far surpass the number of responses that really answer the questions you ask.
Checking for search engines (I ran a quick search using – right – a search engine), I immediately found about 70 sites, but there are probably hundreds. I recently read – I don’t remember where and do not know how true it is – that all the search engines together only cover about 60 per cent of the material on the Web. The article said that, none of the major search engines has catalogued more than twenty per cent, and that the search engines, at least partially, all duplicate results of other SEs. Still, the search engines cover a staggering amount of information and most of the time, if you’ve got the time to hunt a bit, you can usually find what you are looking for.
Given the complexity of searching the Web, the number of false responses you get, and the vast number of ways you can get a question wrong, it is a wonder we find as much as we do and it’s astonishing how valuable and time-saving it is. It is no wonder, though, that a generation of meta-search engines has sprung up. A meta-SE has no database of its own; instead, it sends your search terms to a number of SEs. If two heads are better than one, then two, or five or more SEs should be better still. On the other hand, if both heads belong to idiots, the result will be, well, idiotic. For the most part, it seems obvious that despite the fancy technology used to select the results – the clustering, linguistic analysis and textual analysis some of the better meta-SEs employ – the meta-SE’s results can be no better than the material available on the SEs. Meta-SEs cannot turn search engine hip-hop into Shakespearean sonnets.
Of course, the problems are well-known. Researchers in universities, in start-ups, major ITC corporations and the search engine companies themselves have been working long and hard to build better search algorithms.
One way to improve search engines is to personalise their search algorithms so they return results more closely attuned to the information seeker’s interests. If I search for the ‘Web’, I really don’t want to know about spiders – although an entomologist might – unless they are search engine ‘spiders’, also known as crawlers. These are programmes that search the Internet for information to catalogue and file in the search engine’s database. Algorithms that include information about the searcher’s past history are much more likely to return the sort of results required and leave out unsuitable responses.
The best-known search engines all give advertisers a certain priority when they display results so they appear more important – that is more relevant than they are. Corporate advertisers pay for your free ride on the search engines, but ‘there is no free lunch’; they rob a bit of the SE’s utility and accuracy in the process.
A group at Colorado State University has developed a software search agent called QueryTracker, a software agent that mediates between the users and ‘normal’ search engines. QueryTracker seeks information that the user wants to track over the long-term – be it the stock market, ICT trends, or string theory research -– by querying the search engines each day and checking for changes and new results on the Web. What makes QueryTracker interesting, is that, based on the results of past searches, it can re-formulate and improve its queries day-to-day, refining them to provide much more sharply focused results.
There are many ways to personalise searches, and each provides a certain amount of added value, but none of them are even close to what users really want – exact answers to the imprecise questions of the sort we ask one another, understand intuitively and respond to precisely.
Personalised searches exist that just search for specified types of pages. The pages might look for news about banking industry technology, running shoes or butterflies – whatever interests you. A more sophisticated SE might start looking for specific pages, but would rank the results based on an analysis of the pages that were actually accessed and read and, perhaps, how much time was spent looking at each page. Specific search criteria and logical search operators such as and, not, or, if… then, might also be employed. Tabulations of the numbers of searchers that have read a certain article, the number of links from other sites, the number of citations in scholarly papers are also used, but none of these techniques – or any other – have been wondrously successful.
Google’s Custom Search Engine Service helps anyone, any company, to put a specialised Google search box on their Website. The service can access Google’s entire index, but selects its results from a predefined universe of sites and pages. Google provides tools, menus and wizards, that let anyone quickly build their own custom search engine. In a way, this is just another manifestation of the social search mechanism whereby users feedback their opinions and help select the results for other like-minded users within their own community or group. Users appreciate the pre-selections made by their peers; many feel that the search experience is enriched by the feedback from users with similar interests.
The developers of Wikipedia, the user written online encyclopaedia, are now developing a ‘people powered’ search engine that will weigh its results based upon user input.
For the moment, brute force, rather than sophistication seems to hold the most promise – that is until someone develops a breakthrough search algorithm. IBM tried something of the sort with its WebFountain technology. WebFountain’s crawlers index many millions of pages per day; it mines mountains of data using some of the most sophisticated linguistic analysis to ‘understand’, to discover the meaning, of whatever text it encounters and tags it all using XML to make it searchable. Still, like all computer programmes, it’s fairly good at facts, but not so good at determining what people mean. We all understand when a statement is meant to be ironic, but computer programmes don’t, the best still interpret statements literally when any of us would assign a wholly different meaning.
WebFountain is pretty good, but the service which is sold only to corporate clients lost one of its biggest users because it took to long to fully analyse all of the data it collected. Given the constant growth of computer capacity things are bound to speed up considerably in the near future and the brute force approach will become better.
Time will tell what approach works best. There are still moments when wildly wrong, inadvertently funny results, remind me of a story from the early days of computerised language translation. Many years ago, at the height of the Cold War, the Pentagon was supposedly working on a programme to translate Russian to English and vice versa. To test it, they fed in the proverb, “Out of sight, out of mind”. The resulting sentence in Russian was then fed into the counterpart programme and translated back to English. The result? – “Blind Idiot” – a pretty good description of some of today’s search engines.
Our next Connect-World North America Issue will be published later this month.
The issue will be widely distributed to our reader base and, as well, at shows where we are one of the main media sponsors such as: ISCe, International Satellite & Communications Conference (June 5-7 San Diego, USA) and NXTComm (18-21 June, Chicago).
The theme of this issue of Connect-Word North America will be The Broadband connection -from enterprise to entertainment
Now broadband has been a hot topic for some time. Throughout the world, even in the least developed nations, the goal for the universalisation of telecommunications has been raised from telephone access for everyone to a broadband/Internet connection for everyone. Service providers of all types have been delivering broadband for some time, but the demand is growing still. TV, for example, has always been hot; today it is hotter. Everyone wants a piece of the advertising, everyone wants a piece of the audience and not just the broadcasters, but also the mobile operators, the wireline telephone operators, the ISPs – in fact, everyone – and this means more broadband – wireless broadband, copper wired broadband, fibre broadband. Everyone, in fact, who can pump a signal to a screen – even if it’s the size of two postage stamps, is looking for the content to drive the viewer into their own version better mousetrap TV.
It doesn’t stop there. Virtual worlds such as Second Life float in a broadband sea and business applications, always broadband intensive, will call for ever broader, ever quicker links. With virtual presence applications, we might be seeing headquarters executives ‘strolling’ down the aisles of their factories halfway around the world. This takes a lot of bandwidth, but this is just the start.
Much of the bulge in bandwidth demand comes from mobile operators, handset producers, turbocharged DSL manufacturers, content producers, traditional telephone companies jockeying for their share of entertainment revenues, but the next generation of business services, and exponentially intensified service outsourcing will multiply this demand. This is already driving interest in new business models and promises to generate active searches for new partners. Many of the consolidations, the mergers and acquisitions that will take place would have been considered strange marriages not so very long ago.
The drive for broadband capacity will, once again, revolutionise the industry