Chat

part of the ArsDigita Community System by Philip Greenspun

Why is Chat Useful?

Why is a chat server useful? As traditionally conceived, it isn't. The Internet is good at coordinating people who are separated in space and time. If a bunch of folks could all agree to meet at a specific time, the telephone would probably be a better way to support their interaction.

The way that commercial chat server vendors pitch their products these days is for customer service. Joe User is presumed to be using a computer and his one phone line. So he can't just pick up the phone and call them because he'd have to drop his Internet connection. Also, a customer service person can handle four or five sessions at once if done via something like AOL Instant Messenger rather than via the telephone.

Then why can't companies that want to do 1:1 conversations just use AOL Instant Messenger (AIM)? AOL Instant Messenger works best with some additional software to be installed on the user's machine. This is free software and it shipped with Netscape 4.x and it is certainly available to AOL service customers. But it is not universal and you can't be guaranteed that because someone is connecting to your Web site that they have the AIM client. An AIM-compatible Java applet is available. Since a Java applet can only connect to the server from which it was downloaded, you must reference this by including a link to http://toc.oscar.aol.com (the first thing this page does is execute some JavaScript that bounces your browser back one page, so from the user's point of view it looks like the Java client just popped up and they never left the real publisher's page).

A second limitation to the "just use AIM" approach for customer service is that AIM doesn't provide convenient canned responses. In theory, you might be able to come up with 10 or 15 canned responses that would cover 50% of the inquiries. Then the same staff can handle more customers.

A third limitation to the "just use AIM" approach is that you can't have rich content, e.g., in-line images, because AIM is not a Web browser and can't render HTML.

A fourth reasons that AIM isn't adequate is that it is tough to measure the effectiveness of your staff. The conversations aren't centrally logged (though your staff could save them individually to disk). The conversations aren't tied to your users table.

A good example of a company that has addressed many of these issues is liveperson.com. For $250 per customer service agent per month (prices as of April 1999), they will just do everything for you on their server farm. There are a few problems with LivePerson:

  1. The information in their database isn't automatically sync'd with the information in your database; you'll have two users tables, one on their server and one on your server.
  2. They don't solve the "public chat room" problem for those who wish to have this; LivePerson is limited to customer service and is best thought of as "a more specialized AIM".

The Big Idea

We have our own chat server because A potential limitation of our system is that Oracle's consistency and durability guarantees are expensive. If we had 50 submissions per second and 1000 queries for chatroom updates per second, we would need a super huge Unix machine. In theory, a custom-built chat server ought to be capable of higher performance for a given piece of hardware. In practice, the commercial systems aren't programmed properly and they crash. They also get unreliable and slow when, for example, the number of chat rooms is large.

One Table or Many Tables?

We've had good luck with the /bboard system since 1995. This uses one table to store all the messages, one message per row with a column for topic. This has the virtue of simplicity. This has the virtue of cleanliness in that all the identically structured data in the system is in a single table. This has the virtue of easy searchability since Oracle is designed to build indices on one table at a time. This has the virtue of transactionality for creation of bboard topics and deletion of bboard topics; no tables are created or dropped when topics are created or dropped.

A bad thing about the one-table structure is fragmentation and lack of recovery of disk space when postings are deleted. For example, suppose that you drop an entire bboard topic with 2000 archived messages. This will result in 2000 random deletions from perhaps 1000 Oracle blocks. The table won't shrink any, i.e., it will still consume just as many disk blocks. The free space might not even be used on subsequent inserts, depending on what percentage of the block is now free. According to Oracle8 Tuning, page 146, index entries for deleted rows aren't reclaimed until you do an alter index ***index_name*** rebuild. In practice, it seems that the bboard table on photo.net hasn't suffered too badly from this problem (after three months).

Oracle has b-tree indices that are a maximum of four levels deep (header, two intermediate levels, leaf nodes). So you don't get O(log n) access time through an index if the table has an outrageous number of rows. The most important thing that we'll want to do is query by chat room key and date-time. If we were to build a concatenated index on these values, we'd probably have the header block taken up with all the chat room names. Then the next level would be ranges of times for a particular chat room. Then the third level could be more ranges of times. Then the leaf nodes would point to the rowids for specific dates. This could be pretty darn selective if Oracle is smart about building the index.

Archive or not?

We have to drive the entire system design from a publishing decision: are we interested in seeing archives of chat sessions? If we are, then the one-table structure makes a lot more sense. We'll want to do a single SQL query to see everything that a user has contributed. We'll want to do a single SQL query to see how a customer service person is doing.

A disadvantage of archiving is that it chews up disk space. Suppose we're America Online and 1 million subscribers chat or AIM every day. Let's further assume that each person types 50 rows of stuff, 100 bytes each. That means our table will grow by 50 million rows and 5 GB every day. After 20 days, we'll begin to bump up against the billion-row table size that data warehouse experts suggest as a practical limit.

Does that mean we scrap our simple one-table system? Not yet. Designing systems for the highest volume sites is gratifying in a nerd ego sense, but it is a mistake if it leads to fewer features for the user and the publisher. Remember that the future is not mass media but personalized content and lots of interesting mid-sized communities.

Let's reconsider a community like photo.net where the public bboards get no more than 2000 new messages per day, each one an average of 650 characters. That's an average of 1.3 Mbytes of new content every day, implying 8000 days before a 9 GB hard drive is filled up and 500,000 days before we've built a billion-row table. Obviously this isn't going to be a problem to archive indefinitely.

Let's look at the photo.net community another way. We have about 25,000 daily visits. Suppose that each of those 25,000 people used a photo.net chat or instant messaging service to send as much communication as they send via email. Assume 100 messages per day and 200 bytes per message and all 25,000 people participating. That's 0.5 Gbytes per day. We fill up a modern (April 1999) 36 GB disk drive after two months.

So it seems that on popular public sites we won't be able to store everything that users type. At the same time, a real customer's interaction with a real customer service person ought to be archived forever (so that you can ask questions like "show me how many users who talked to Kathy eventually bought an item").

How do we accomplish pruning and tuning?

If we can be sure that we always have at least twice as much disk space as the chat that we want saved, we can do the following: This is a pretty risky operation and we'd want a PL/SQL program to do it rather than rely on a human dba. Chat could be down for as much as an hour so we'd want to do it on an early Sunday morning at the beginning of each month (or something similar). We'll need to develop the Tcl scripts so that they can say "Chat server is being maintained right now; try back in one hour". The pruning/tuning should be done by an AOLserver ns_schedule_proc that (1) sets the maintenance flag, (2) executes the PL/SQL proc, (3) resets the maintenance flag.

Following this operation, the chat table will be as compact as possible.

Types of chat we need to support

Public chat rooms. These are open to everyone in the users_active view. For moderation, we check perms using the permissions package (where module = "chat" and submodule = **chat_room_id**).

Private chat rooms. These are open to people in particular user groups. We check perms using the permissions package.

For either kind of chat room, we should support moderated chat. That is, a posting doesn't go live until it has been approved by someone who has the "moderator" or "administrator" role in user group associate with a private chat room or, in the case of a public chat room, by someone who is a member of the appropriate chat moderation group.

We want to support 1:1 messages for customer support, if nothing else. We need one layer on top of this to make sure that users can find an appropriate chat partner. For example, if Bill User says that he needs support for his widget, the system has to find the least busy authorized widget support person and start a 1:1 chat session between Bill and that person.

For public community sites where nothing is being sold or supported, a publisher might wish to limit the load on the server from all of this 1:1 chatting. In that case, we set an ad.ini file parameter to just bounce users over to the AOL Instant Messenger infrastructure.

Options for the publisher

Some options are configurable per-room, e.g., The per-system options are configurable in the ad.ini file. The big items:
[ns/server/yourservername/acs/chat]
EnabledP=1
; SystemName=Chat
; how long to cache the postings to a room (updates force a cache update
; so this theoretically could be 5 hours or whatever)
CacheTimeout=120
; how long will a room's properties (e.g., private group, moderation) be cached
RoomPropertiesCacheTimeout=600
UsersCanCreateRoomsP=0
; set to 1 if you want most recent postings on top; this is the way that 
; theglobe.com and other familiar chat systems do it (keeps users from 
; having to scroll to see new msgs)
MostRecentOnTopP=1
; do we want to offer users the option of sending private messages?
PrivateChatEnabledP=0
; do we offer users a link to a chat room's history?
ExposeChatHistoryP=1
; how many messages to display when users choose short medium or long
NShortMessages=25
NMediumMessages=50
NLongMessages=75
; show a picture at the index page and in individual rooms?
DefaultDecoration=<a href="/images/pcd0865/rachel-mouth-3.tcl"><img HEIGHT=134 WIDTH=196 src="/images/pcd0865/rachel-mouth-3.1.jpg" ALT="Mouth."></a>
; how often the javascript version should refresh itself
JavaScriptRefreshInterval=5

Linking into the system from a static page

If you want to link into the chat system from a static page elsewhere on your site, do it by linking to "enter-room.tcl" rather than the more obvious "chat.tcl". That way other users will see people coming in.

<a href="/chat/enter-room.tcl?chat_room_id=142">Chat</a>

Ensuring high performance

Inevitably a system like this will require some polling, either by Java clients, HTTP Refresh headers, or JavaScript. If we have 100 people in a chat room and they are all polling every 2 seconds, we don't want to have to buy an 8-CPU computer to support 50 queries per second right into Oracle. So we make heavy use of util_memoize. When a new posting is made, we force a cache update with util_memoize_flush. More than 95% of the time, a user is getting results from AOLserver's virtual memory and not from the database.

Why the HTML version can't autorefresh

The HTML page cannot have a Refresh: header for client-pull autofresh. If you did this, the user would be at risk of losing what he or she was typing into the post form.

If you care about database performance

An active chat server is going to result in a fair number of Oracle transactions. You'll at least want to keep chat tables on a new separate physical disk drive. You'll want indices on those tables to reside on yet another new disk drive. Since every disk drive on a 24x7 server must be mirrored, that means you need four new disk drives to implement this module.

Practical Experience from photo.net

We tested the service on photo.net. Here's a transcript excerpt from the first day:

Justin (06:42:14) Anyone there?
Justin (06:46:41) alright...I'll assume everyone is asleep. Sleep well. Goodnight.
Tommy (06:55:33) anyone here?
Tommy (06:58:09) so this is supposed to be a camera shopping chat room...
Justin (07:27:04) I'm here.
Justin (07:28:25) But...I guess you were here...like a half hour ago, so I'm going to take my
circadain rhythm problem to the donut shop.
Lim (09:59:23) hi
Lim (09:59:51) anybody around?
Eve (10:19:10) Hi there.
Eve (10:24:14) anybody around? Are you still there Lim?
Kind of makes you feel that all those long hours spent programming and maintaining Unix and Oracle were worthwhile...
philg@mit.edu