A community's first line of defense is high quality information architecture and navigation, as discussed at the end of the "Content Management" chapter. Users are better at browsing than formulating search queries. A community's second line of defense, however, is a superb full-text search facility. The search database must include both publisher-authored and user-contributed content. Here are some example query categories:
which, by the time the bind variable
select * from content where body like '%' || :user_query || '%'
:user_queryis substituted, turns into
In Oracle this won't pick up a row whose message contains the same word but with a different capitalization. Instead we do
select * from content where body like '%running%'
What if the user typed multiple words? The query
select * from content where upper(body) like upper('%running%')
would not pick up a message that contained the phrase "shoes for running". Instead we'll need multiple where clauses:
select * from content where upper(body) like upper('%running shoes%')
This AND clause isn't quite right. If there are lots of documents that contain both "running" and "shoes", these are the ones that we'd like to see. However, if there aren't any rows with all query terms, we should probably offer the user rows that contain some of the query terms. We might need to use OR, a scoring function, and an ORDER BY so that the rows containing both query terms are returned first. If we insist on the AND clause, we've created a situation in which the more the user tells us about her interests the fewer documents we'll return in response to a search, eventually returning "0 results found" if she keeps adding words. (Note that public search engines circa 2005, such as Google, Yahoo, A9, and MSN, do implicitly use AND and do return 0 results if a user keeps adding words to a query and there aren't any documents in the database that contain each and every one of those words.)
select * from content where upper(body) like upper('%running%') and upper(body) like upper('%shoes%')
There are some deeper problems with the Caveperson SQL Programmer approach to full-text search. Suppose that a message contains the phrase "My brother-in-law Billy Bob ran 20 miles yesterday" but not the word "running". Or a message contains the phrase "My cousin Gertrude runs 15 miles every day". These should be returned as relevant to the query "running", but the LIKE clause won't do the job. What is needed is a system for stemming both the query terms and the indexed terms: "running", "runs", and "ran" would all be bashed down to the stem word "run" for indexing and retrieval.
What about a message saying "I attended the 100th anniversary Boston Marathon"? The LIKE query won't pick that up. What is needed is a system for expanding queries through a thesaurus powerful enough to make the connection between "running" and "marathon".
The RDBMS must examine every row in the
select * from content where body like '%running%'
contenttable to answer this query, i.e., must perform a sequential table scan(O[N] time, where N is the number of rows in the table). Suppose that a standard RDBMS index is defined on the
bodycolumn. The values of
bodywill be used as keys for a B-tree and we could perform
and maybe, depending on the implementation,
select * from content where body = 'running'
in O[logN] time. But the user's interest isn't restricted to documents whose only word is "running" or documents that begin with the word "running". The user wants documents in which the word "running" may be buried. A single B-tree index is not going to help.
select * from content where body like 'running%'
How does it work? Like every other indexing strategy: extra work at insertion time is traded for less work at query time. Consider constructing a big table of every word in the English language next to the database keys of those documents that contain the word:
If we build this as a hash table, we have O access to a row in the table. If we merely keep the rows in sorted order, we have O[log W] access to any row in the table, where W is the number of words in our vocabulary. Performance does not vary with the number of documents in the collection... or does it? Just about every English document will contain the word "the" and therefore simply returning the value of the
Word Document IDs absquatulate 612 bedizen 36, 9211 cryptogenic 9 dactylioglyph 7214 exheredate 57, 812, 4010 feuilleton 87, 349, 1203 genetotrophic 5000 hartebeest 710 inspissate 549, 21, 3987 ... samoyed 17, 91, 1000, 3492 sesquipedalian 723 the 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,... uberous 6, 800 velutinous 45, 2307 widdershins 7300 xenial 3611 ypsiliform 5607 zibeline 4782
document_idscolumn for the word "the" will take O[N] time, where N is the number of documents in the corpus. This row isn't useful anyway because it isn't selective, i.e., we could get the same information almost as fast with a sequential scan of the documents table, collecting all the document IDs. While indexing a document, a full-text search system will refer to a list of stopwords, words that are too common to be worth indexing. For standard English, the stopword list includes such words as "a", "and", "as", "at", "for", "or", "the", etc.
Inserting a new document into the collection will be slow. We'll have to go through the document, word by word, and update as many rows in the index as there are distinct words in the document. But that extra work at insertion time pays off in a reduction in query time from O[N] to O.
Given a data structure of the preceding form, we can quickly find all documents containing the word "running". We can also quickly find all documents containing the word "shoes". We can intersect these result sets quickly, giving us the documents that contain both "running" and "shoes". With some fancier indexing data structures we can restrict our search to documents that contain the contiguous phrase "running shoes" as opposed to documents where those words appear separately. But suppose that there are 1000 documents in the collection containing these two words. Which are the most relevant to the user's query of "running shoes"?
We need a new data structure: the word-frequency histogram. This will tell us which words occur in a document and how frequently they occur in a way that is easily adjusted for the total length of a document.
Here's a word-frequency histogram for the first sentence of Tolstoy's Anna Karenina:
Word Count Frequency all 1 1/16 another 1 1/16 but 1 1/16 each 1 1/16 families 1 1/16 family 1 1/16 happy 1 1/16 in 1 1/16 is 1 1/16 its 1 1/16 one 1 1/16 own 1 1/16 resemble 1 1/16 unhappy 2 2/16 way 1 1/16
One might argue that this sentence makes better literature as "All happy families resemble one another, but each unhappy family is unhappy in its own way," but the full-text search software finds it more useful in this form.
After the crude histogram is made, it is typically adjusted for the prevalence of words in standard English. So, for example, the appearance of "resemble" is more interesting than "happy" because "resemble" occurs less frequently in standard English. Stopwords such as "is" are thrown away altogether. Stemming is another useful refinement. In the index and in queries we convert all words to their stems. The stem word for "families", for example, is "family". With stemming, a query for "families" would match a document containing "family" and vice versa.
Given a body of histograms it is possible to answer queries such as "Show me documents that are similar to this one" or "Show me documents whose histogram is closest to a user-entered string." The inter-document similarity query can be handled by comparing histograms already stored in the text database. The search string "platinum mines in New Zealand" might be processed first by throwing away the stopwords "in" and "new". By using histogram comparison, the software would deliver articles that that have the most occurrences of "platinum", "mines", and "Zealand". Suppose that "Zealand" is a rarer word than "platinum". Then a document with one occurrence of "Zealand" is favored over one with one occurrence of "platinum". A document with one occurrence of each word is preferred to an article where only one of those words shows up. A document that contains only the words "platinum mines Zealand" is a better match than a document that contains 100,000 words, three of which happen to match the query terms.
The power of this kind of system is enticing and raises the question "Can we run our entire Web application from a specialized full-text search database system?" Indeed, why not chuck the RDBMS altogether?
We don't chuck the RDBMS because we put it in to handle the problem of concurrency: two users trying to update the same item simultaneously. A better query tool is nice, but we can't adopt it as our primary database management system unless it handles the concurrency problem as well as the RDBMS.
A pragmatic approach would seem to start by keeping all the documents in the RDBMS: articles, user comments, discussion forum postings, etc. Either once per night or every time a new document was added, update a full-text search system's collection. Pages that are part of the standard user experience and workflow operate from the RDBMS. The search box at the upper right corner of every page, however, queries against the full-text search system. Let's call this a split-system design.
**** insert figure *****
One argument against the split-system approach is that two copies of the document collection are being kept. In an age of $200 disk drives of absurdly high capacity, this isn't a powerful argument. It is nearly impossible to fill a modern disk drive with words typed by humans. One can fill up a disk drive with video or audio streams, but not text. And in any case some full-text search systems can build an index to a document collection without themselves keeping the original document around, i.e., you would in fact have only one copy of the document in the RDBMS.
A second argument against using RDBMS and full-text search systems simultaneously is that the collections will get out of sync. If the Web server crashes in the middle of an RDBMS transaction, all work is rolled back. If the Web server was simultaneously inserting a document into a full-text search system, it is possible that the full-text database will contain a document that is not in fact available on the main pages of the site—the site being generated from the RDBMS. Alternatively, the RDBMS insert might succeed while the full-text insert fails, leading to a document that is available on the site, but not searchable. This argument, too, ultimately lacks power. It is true that the RDBMS is a convenient and nearly foolproof means of managing transactions and concurrency. However, it is not the only way. If one were to hire sufficiently careful programmers and sufficiently dedicated system and database administrators, it would be possible to keep two databases in sync.
A third argument against the split system is the disparity of interfaces. Suppose that our RDBMS is Oracle. The Web developers know how to talk to Oracle through Active Server Pages. The desktop programmers know how to talk to Oracle through the C API. The marketing people know how to talk to Oracle through various reporting tools. Some individual users have figured out to talk to Oracle from standard desktop programs such as Microsoft Excel and Microsoft Access. The cost of bringing in a new programmer grows if you have to teach that person not only about an RDBMS, but also about specialized tools, each with its own library of interfaces.
However, the best argument against using both an RDBMS and a "bag-on-the-side" full-text search system is that the split system does not naturally support the kinds of queries that are necessary:
Consider a modern relational database management system. It offers a way of writing stuff down: CREATE TABLE and INSERT. It offers a way of executing software written in a procedural language: C, Java, or PL/SQL in the case of Oracle; any .NET-supported computer language in the case of Microsoft SQL Server.
Why couldn't one build a full-text search indexer inside the RDBMS? That's exactly what some of the commercial RDBMS vendors have done. Oracle was a pioneer in this area and the relevant Oracle product is called "Oracle Text".
In the preceding example, Oracle Text builds its own index on the
create table content ( content_id integer primary key, refers_to references content_raw, -- who contributed this and when creation_user not null references users, creation_date not null date, modified_date not null date, mime_type varchar(100) not null, one_line_summary varchar(200) not null, body clob, editorial_status varchar(30) check (editorial_status in ('submitted','rejected','approved','expired')) ); -- create an Oracle Text index (the product used to be called -- Oracle Context, hence the CTX prefixes on many procedures) create index content_text on content(body) indextype is ctxsys.context; -- let's look at opinions on running shoes from -- users who registered in the last 30 days, sorting -- results in order of decreasing relevance select score(1), content.content_id, content.one_line_summary, users.first_names, users.last_name from content, users where contains(body, 'running shoes', 1) > 0 and users.registration_date > current_timestamp - interval '30' day and content.creation_us er = users.user_id order by score(1) desc;
bodycolumn of the
contenttable. When a Text index is defined on a table it becomes possible to use the
containsoperator in a WHERE clause. The Oracle RDBMS SQL query processor is smart enough to know how to use the Text index to answer this query without doing a sequential table scan. It is possible to have more than one call to
containsin the same query. Thus the last argument of
containsis an integer identifying the query, in this case "1". It is possible to get a relevance score out in the select list or in an ORDER BY clause with the function
scoreand an argument identifying from which
containscall the score should be pulled.
Oracle Text is one of the more difficult and complex Oracle RDBMS
products to use. For example, if you want to be able to search for a
phrase that occurs in either the
body and combine the relevance score, you need to build a
Notice that the index itself is built on the column
ctx_ddl.create_preference('content_multi','MULTI_COLUMN_DATASTORE'); ctx_ddl.set_attribute('content_multi', 'COLUMNS', 'one_line_summary, body'); create index content_text on content(modified_date) indextype is ctxsys.context parameters('datastore content_multi');
modified_date, which is not itself indexed. The call to
ctx_ddl.set_attributein which the COLUMNS attribute is set is what determines which columns get indexed.
For an example of a system that tackles the challenge of indexing text from disparate Oracle tables, see http://philip.greenspun.com/seia/examples-search/site-wide-search
Oracle Text also has the property that its default search mode is
exact phrase matching. A user who types "zippy pinhead" into a search
engine will expect to find documents that contain the phrase "Zippy
the Pinhead". This won't happen if your script passes the raw user
query right through to the
Contains operator. More
problematic is what happens when a user types a query string that
contains characters that Oracle Text treats specially. This can
result in an error being raised by the SQL query and a "Server Error
500" returned to the user if you don't catch the error in your
procedural script. It would be nice if Oracle Text had a built-in
procedure called "ProcessRawQueryFromWebForm" or something. But it
doesn't, at least we couldn't find one in the documentation for
Oracle version 10g. The next best thing is a procedure called
pavtranslate, available from http://technet.oracle.com/sample_code/products/text/htdocs/query_syntax_translators/query_syntax_translators.html.
Oracle Text, via the "INSO filters" option, has the capability to index a remarkable variety of documents in a BLOB column. For example, the software can recognize a Microsoft Excel spreadsheet, pull the text out and add it to the index. At the same time it is smart enough to know when to ignore a document entirely, e.g., if the BLOB column were filled with a JPEG photograph.
Include your client's answers to Exercise 1 in this document.
Record user search strings in an RDBMS table and let admins see what the popular search terms are (by the day, week, or month). Make sure to highlight any searches that resulted in the user seeing a page "No documents matched your query". Ask yourself whether it would be ethical to implement a facility whereby the site administrators could view a report of search strings and the users who typed them in.
/doc/search file to reflect the addition of
Make sure that your main documentation page links to the docs for this new module.
Notice the user-agent header at the end: Googlebot/2.1, with its included suggestion that Web publishers check http://www.google.com/bot.html for more information. Because some search engines archive what they index, you would not want to provide registration-free access to content that is truly private to members. In theory a
188.8.131.52 - - [10/Feb/2005:02:13:15 -0500] "GET /sql/triggers.html HTTP/1.0" 200 0 "" "Googlebot/2.1 (+http://www.google.com/bot.html)"
<META NAME="ROBOTS" CONTENT="NOARCHIVE">placed in the HEAD of your HTML documents would prevent search engines from archiving the page, but robots are not guaranteed to follow such directives.
Some search engines allow you to provide indexing hints and hints for presentation once a user is looking at a search results page. For example, in the table of contents page for this book, we have the following META tags in the HEAD:
The "keywords" tag adds some words that are relevant to the document, but not present in the visible text. This would help someone who decided to search for "MIT 6.171 textbook", for example. The "description" tag can be used by a search engine when summarizing a page. If it isn't present, a search engine may show the first 20 words on the page or follow some heuristics to build a reasonable summary. These tags have been routinely abused. A publisher might add popular search terms such as "sex" to a site that is unrelated to those terms, in hopes of capturing more readers. A company might add the names of its competitors as keywords. Users wouldn't see these dirty tricks unless they went to the trouble of using the View Source command in their browser. Because of this history of abuse, many public search engines ignore these tags.
<meta name="keywords" content="web development online communities MIT 6.171 textbook"> <meta name="description" content="This is the textbook for the MIT course Software Enginering for Internet Applications">
See http://searchenginewatch.com/resources/metasuits.html for accounts of various lawsuits that have been fought over the contents of meta tags.A particularly destructive practice is "cloaking", in which a Web server is programmed to send entirely different pages to the search engines and human users (identified by having "Mozilla" or "MSIE" in their user-agent headers). An unscrupulous publisher would find out what are the current most popular search terms on public search engines (http://searchenginewatch.com/facts/searches.html offers a list of windows into various search services), string those terms together, and serve a mishmash of those to search engines. Meanwhile, when a regular user came to the site the page presented would be a banal product pitch. Google threatens to ban from their index any site that engages in this practice.
/staging/directory on your server. This content isn't exactly secret, but neither do you want it released before its time. Nor do you want two copies of the same content in the Google index, one copy in the staging area and one copy in its final position on the site.
You need to read the Standard for Web Exclusion, a protocol for
communication between Web publishers and Web crawlers, available from
the publisher put a file on your site, accessible at
with instructions for robots. Here's an example that excludes the
User-agent: * # let's keep the robots away from our half-baked stuff Disallow: /staging
User-agentline specifies for which robots the injunctions are intended. Each
Disallowasks a robot not to look in a particular directory. Nothing requires a robot to observe these injunctions, but the standard seems to have been adopted by all the major indices nonetheless.
Visit http://www.ibm.com/robots.txt to get a bit of insight into how a site may evolve over time.
/robots.txtthat excludes robots from appropriate portions of your server. Put some comments at the top of the file explaining who created this, when it was created, and the rationale behind the exclusions.
If you're doing a 100 percent database-backed content management system, you are free to put the content of the robots.txt file in the RDBMS, just so long as it is served when the URI
Suppose that Joe User visits photo.net and types "At what shutter speeds is a tripod required?" into the search box. Is it reasonable to assume that Joe wants to read a 10,000-word document that contains the answer to this question? Or would Joe rather get ... the answer to his question. The answer "at shutter speeds slower than 1/lens-focal-length" is a lot smaller and quicker to read than a document containing this information.
To get a feel for how a question answering system can be built on top of a full-text indexer, read "Scaling Question Answering to the Web" (Cody Kwok, Oren Etzioni, Dan Weld; WWW10 conference, May 2001; http://www.www10.org/cdrom/papers/pdf/p120.pdf), which describes a system built at the University of Washington. This system includes all of the expected linguistic gymnastics plus code to sort out the Internet-specific problem of noise. Traditional information retrieval systems are designed to work with authoritative documents, e.g., the Encyclopedia Britannica, a binder of corporate policies, or the design notes for a jetliner. The documents in the corpus are presumed to be authoritative. There won't be four different answers, three of them flat wrong, to questions such as "In what year was Gioacchino Rossini born?", "How many signatures are required for a purchase of $57,300?", or "How wide is the wingspan of the airplane?" With user-authored content in an online community, however, it seems safe to assume that while the average answer is likely to be correct, for every 100 correct answers there will be at least three or four incorrect ones. Even when the data require no interpretation, there will be typos. For example, a Google search for "rossini 1792-1868" returned 50,900 documents in February 2005; a search for "rossini 1792-1869" returned 43 documents. A question-answering system built on top of lightly moderated user-authored content will have to exercise the same sort of judgment as do humans: How many documents contain Answer A versus Answer B? What is the relative authority of conflicting documents? Which of two conflicting documents is more recent?
Mobile Internet devices put an even greater stress on information retrieval. Connection speeds are slower. Screens are smaller. It isn't practical for a user to drill down into 20 documents returned by a search engine as possibly relevant to a query, especially if the user is driving a car and using a voice browser.
If you want to emerge as a hero from the dust of the next Internet collapse, work on information retrieval.
The search design and documentation should be a team effort, and take one to two hours.
The luckiest teams will be able to get their search systems up and running in an hour. Unlucky teams using difficult-to-install search systems may require the better part of a day. Teams with a single content table and no static html pages should be able to build the basic page scripts in one to two hours. Additional time will be required for designs that manage content across multiple tables and the filesystem.
The remaining exercises should be doable in 2 to 4 programmer-hours.