Scaling Gracefully

part of Software Engineering for Internet Applications by Eve Andersson, Philip Greenspun, and Andrew Grumet; revised February 2005
Let's look again at the passage from A Pattern Language, quoted in the "Planning" chapter:
"It is not hard to see why the government of a region becomes less and less manageable with size. In a population of N persons, there are of the order of N2 person-to-person links needed to keep channels of communication open. Naturally, when N goes beyond a certain limit, the channels of communication needed for democracy and justice and information are simply too clogged, and too complex; bureaucracy overwhelms human process. ...

"We believe the limits are reached when the population of a region reaches some 2 to 10 million. Beyond this size, people become remote from the large-scale processes of government. Our estimate may seem extraordinary in the light of modern history: the nation-states have grown mightily and their governments hold power over tens of millions, sometimes hundreds of millions, of people. But these huge powers cannot claim to have a natural size. They cannot claim to have struck the balance between the needs of towns and communities, and the needs of the world community as a whole. Indeed, their tendency has been to override local needs and repress local culture, and at the same time aggrandize themselves to the point where they are out of reach, their power barely conceivable to the average citizen."

Let's also remind ourselves of the empirical evidence that enormous online communities cannot satisfy every need. America Online has not subsumed all the smaller communities on the Internet. People unsubscribe from mailing lists when the traffic level becomes too high. Early adopters of USENET discussion groups (called "Netnews" or "Newsgroups" back in the 1970s and "Google Groups" to most people in 2005) stopped participating because they found the utility of the groups diminished when the community size grew beyond a certain point.

So the good news is that, no matter how large one's competitors, there will always be room for a new online community. The bad news is that growth results in significant engineering challenges. Some of the challenges boil down to simple performance engineering: How can one divide the load of supporting an Internet application among multiple CPUs and disk drives? These can typically be solved with money, even in the absence of any cleverness. The deeper challenges cannot be solved with money and hardware. Consider, for example, the following questions:

In this chapter we will first consider the straightforward hardware and software issues, then move on to the more subtle challenges that grow progressively more difficult as the user community expands.

Tasks in the Engine Room

Here are the fundamental tasks that are happening on the servers of virtually every interactive Internet application: At a modestly visited site, it would be possible to have one CPU performing all of these tasks. In fact, for ease of maintenance and reliability it is best to have as few and as simple servers as possible. Consider your desktop PC, for example. How long has it been since the hardware failed? If you look into a room with 50 simple PCs or single-board workstations, how often do you see one that is unavailable due to hardware failure? Suppose, however, that you combine computers to support your application. If one machine is 99 percent reliable, a site that depends on 10 such machines will be only 0.9910 reliable or 90 percent. The probability analysis here is the same as flipping coins but with a heavy 0.99 bias towards heads. You need to get 10 heads in a row in order to have a working service. What if you needed 100 machines to be up and running? That's only going to happen 0.99100th of the time, or roughly 37 percent.

It isn't challenging to throw hardware at a performance problem. What is challenging is setting up that hardware so that the service is working if any of the components are operational rather than only if all of the components are operational.

We'll examine each layer individually.

Persistence Layer

For most interactive Web applications, the persistence layer is a relational database management system (RDBMS). The RDBMS server program is parsing SQL queries, writing transactions to the disk, rooting around on the disk(s) for seldom-used data, gluing together data in RAM, and returning it to the RDBMS client program. The average engineer's top-of-the-head viewpoint is that RDBMS performance is limited by the speed of the disk(s). The programmers at Oracle disagree: "A properly configured Oracle server will run CPU-bound."

Suppose that we have a popular application and need 16 CPUs to support all the database queries. And let's further suppose that we've decided that the RDBMS will run all by itself on one or more physical computers. Should we buy 16 small computers, each with one CPU, or one big computer with 16 CPUs inside? The local computer shop sells 1-CPU PCs for about $500, implying a total cost of $8000 for 16 CPUs. If we visit the Web site for Sun Microsystems (www.sun.com) we find that the price of a 16-CPU Sunfire 6800 is too high even to list, but if the past is any guide we won't get away for less than $200,000. We will pay 25 times as much to get 16 CPUs of the same power, but all inside one physical computer.

Why would anyone do this?

Let's consider the peculiarities of the RDBMS application. The RDBMS server talks to multiple clients simultaneously. If Client A updates a record in the database and, a split-second later, Client B requests that record, the RDBMS is required to deliver the updated information to Client B. If we were to spread the RDBMS server program across multiple physical computers, it is possible that Client A would be served from Computer I and Client B would be served from Computer II. A database transaction cannot be committed unless it has been written out to the hard disk drive. Thus all that these computers need do is check the disk for updates before returning any results to Client B. Disk drives are 100,000 times slower than RAM. A single computer running an RDBMS keeps an up-to-date version of the commonly used portions of the database in RAM. So our multi-computer RDBMS server that ensures database coherency across processors via reference to the hard disk will start out 100,000 times slower than a single-computer RDBMS server.

Typical commercial RDBMS products, such as Oracle Parallel Server, work via each computer keeping copies of the database in RAM and informing each other of updates via high-speed communications networks. The machine-to-machine communication can be as simple as a high-speed Ethernet link or as complex as specialized circuit boards and cables that achieve memory bus speeds.

Don't we have the same problem of inter-CPU synchronization with a multi-CPU single box server? Absolutely. CPU I is serving Client A. CPU II is serving Client B. The two CPUs need to apprise each other of database updates. They do this by writing into the multiprocessor machine's shared RAM. It turns out that the CPU-CPU bandwidth available on typical high-end servers circa 2002 is 100 Gbits/second, which is 100 times faster than the fastest available Gigabit Ethernet, FireWire, and other inexpensive machine-to-machine interconnection technologies.

Bottom line: if you need more than one CPU to run the RDBMS, it usually makes most sense to buy all the CPUs in one physical box.

Abstraction Layer

Suppose that you have a complex calculation that must be performed in several different places within a computer program. Most likely you'd encapsulate that calculation into a procedure and then call that procedure from every part of the program where the calculation was required. The benefits of procedural abstraction are that you only have to write and debug the calculation code once and that, if the rules change, you can be sure that by updating the single procedure you've updated your entire application.

The abstraction layer is sometimes referred to as "business logic". Something that is complex and fundamental to the business ought to be separated out so that it can be used in multiple places consistently and updated in one place if necessary. Below is an example from an e-commerce system that Eve Andersson wrote. This system offered substantially all of the features of amazon.com circa 1999. Eve expected that a lot of ham-fisted programmers who adopted her open-source creation would be updating the page scripts in order to give their site a unique look and feel. Eve expected that laws and accounting procedures regarding sales tax would change. So she encapsulated the looking up of sales tax by state, the figuring out if that state charges tax on shipping, and the multiplication of tax rate by price into an Oracle PL/SQL function:

create or replace function ec_tax
 (v_price IN number, v_shipping IN number, v_order_id IN integer) 
return number
IS
	taxes			ec_sales_tax_by_state%ROWTYPE;
	tax_exempt_p		ec_orders.tax_exempt_p%TYPE;
BEGIN
	SELECT tax_exempt_p INTO tax_exempt_p
	FROM ec_orders
	WHERE order_id = v_order_id;

	IF tax_exempt_p = 't' THEN
		return 0;
	END IF;	
	
	SELECT t.* into taxes
	FROM ec_orders o, ec_addresses a, ec_sales_tax_by_state t
	WHERE o.shipping_address=a.address_id
	AND a.usps_abbrev=t.usps_abbrev(+)
	AND o.order_id=v_order_id;

	IF nvl(taxes.shipping_p,'f') = 'f' THEN
		return nvl(taxes.tax_rate,0) * v_price;
	ELSE
		return nvl(taxes.tax_rate,0) * (v_price + v_shipping);
	END IF;
END;  
The Web script or other PL/SQL procedure that calls this function need only know the proposed cost of an item, the proposed shipping cost, and the order ID to which this item might be added (these are the three arguments to ec_tax). That sales taxes for each state are stored in the ec_sales_tax_by_state table, for example, is hidden from the rest of the application. If an organization that adopted this software decided to switch to using third-party software for calculating tax, that organization would need to change only this one function rather than wading through hundreds of Web scripts looking for tax-related code.

Should the abstraction layer run on its own physical computer? For most applications, the answer is "no". These procedures are not sufficiently CPU-intensive to make splitting them off onto a separate computer worthwhile in terms of system administration effort and increased vulnerability to hardware failure. What's more, these procedures often do not even warrant a new execution environment. Most procedures in the abstraction layer of an Internet service require intimate access to relational database tables. That access is fastest when the procedures are running inside the RDBMS itself. All modern RDBMSes provide for the execution of standard procedural languages within the database server. This trend was pioneered by Oracle with PL/SQL and then Java. With the latest Microsoft SQL Server one can supposedly run any .NET-supported computer language inside the database.

When should you consider a separate environment ("application server" process) for the abstraction layer? Suppose that a big bank, the result of several mergers, has an IBM mainframe to manage checking accounts, an Oracle RDBMS for managing credit accounts, and a SQL Server-based customer support system. If Jane Customer phones up the bank and asks to pay her credit card bill from her checking account, a computer program needs to perform a transaction on the mainframe (debit checking), a transaction on the Oracle system (credit Visa card), and a transaction on the SQL Server database (payment handled during a phone call with Agent #451). It is technically possible for, say, a Java program running inside the Oracle RDBMS to connect to these other database management systems but traditionally this kind of problem has been attacked by a stand-alone "application server", usually a custom-authored C program. The term "application server" has subsequently become used to describe the physical computers on which such a program might run and, in the late 1990s, execution environments for Java or C programs that served some function on a Web site other than page presentation or persistence.

Another example of where a separate physical application server might be desirable is where substantial computation must be performed. On most photo sharing sites, every time a photo is uploaded the server must create scaled versions in standard sizes. The performance challenge at the orbitz.com travel site is even more serious. Every user request results in the execution of a Lisp program written by MIT Artificial Intelligence Lab alumni at itasoftware.com. This Lisp program searches through a database of two billion flights and fares. The database machines that are performing transactions such as ticket bookings would collapse if they had to support these searches as well.

If separate physical CPUs are to be employed in the abstraction layer, should they all come in the same box or will it work just as well to rack and stack cheap 1-CPU machines? That rather depends on where state is kept. Remember that HTTP is a stateless protocol. Somewhere the server needs to remember things such as "Registered User 137 wants to see pages in the French language", "Unregistered user who started Session 6781205 has placed the hardcover edition of The Cichlid Fishes in his or her shopping cart." In a multi-process multi-computer server farm, it is impossible to guarantee that a particular user will always be returned to the same running computer program, if for no other reason than you want the user experience to be robust to failure of an individual physical computer. If session state is being kept anywhere other than in a cookie or the persistence layer (RDBMS), your application server programs will need to communicate with each other constantly to make sure that their ad hoc database is coherent. In that case, it might make sense to get an expensive multi-CPU machine to support the application server. However, if all the layers are stateless except for the persistence layer, the application server layer can be handled by multiple cheap one-CPU machines. At orbitz.com, for example, racks of cheap computers are loaded with identical local copies of the fare and schedule database. Each time a user clicks to see the options for traveling from New York to London, one of those application server machines is randomly selected for action.

Presentation Layer

Computer programs in the presentation layer pull information from the persistence layer (RDBMS) and merge those results with a template appropriate to the user's preferences and client software. In a Web application these computer programs are doing an SQL query and merging the results with an HTML template for delivery to the user's Web browser. Such a program is so simple that it is often referred to as a "script". You can think of the presentation layer as "where the scripts execute".

The most common place for script execution is within the operating system process occupied by the Web server. In other words, the script language interpreter is built into the Web server. Examples of this architecture are Microsoft Internet Information Server (IIS) and Active Server Pages, AOLserver and its built-in Tcl interpreter, Apache and the mod_perl add-in. If you've chosen to use one of these popular styles of Web development, you've chosen to merge the presentation layer with the HTTP service layer, and spreading the load among multiple CPUs for one layer will automatically spread it for the other.

The multi-CPU box versus multiple-separate-box decision here should again be based on whether or not the presentation layer holds state. If no session state is held by the running presentation scripts, it is more economical to add CPUs inside separate physical computers.

HTTP Service

HTTP service per se is so simple that it hardly warrants its own layer, unless you're delivering audio and video files to a mass audience. A high performance pure HTTP server program such as Zeus Web Server (see www.zeus.com) can handle more than 6000 requests per second and saturate a 100 Mbps network link on a single 500 MHz Intel Celeron processor (that 100 Mbps link would cost about $50,000 annually as of February 2005, by the way). Why then would anyone ever need to deploy multiple CPUs to support HTTP service of basic HTML pages with embedded images?

The main reason that people run out of capacity on a single front-end Web server is that HTTP server programs are usually packaged with software to support computationally more expensive layers. For example, the Oracle RDBMS server, capable of supporting the persistence layer and the abstraction layer, also includes the necessary software for interpreting Java Server Pages and performing HTTP service. If you were running a popular service directly from Oracle you'd probably need more than one CPU. More common examples are Web servers such as IIS and AOLserver that are capable of handling the presentation and HTTP service layers from the same operating system process. If your scripts involve a lot of template parsing, it is easy to overload a single CPU with the demands of the Web server/script interpreter.

If no state is being stored in the HTTP Service layer it is cheapest to add CPUs in separate physical boxes. HTTP is stateless and user interaction is entirely mediated by the RDBMS. Therefore there is no reason for a CPU serving a page to User A to want to communicate with a CPU serving a page to User B.

Transport-Layer Encryption

Whenever a Web page is served, two application programs on separate computers have communicated with each other. As discussed in the "Basics" chapter, the client opens a Transmission Control Protocol (TCP) connection to the server, specifies the page desired, and receives the data back over that connection. TCP is one layer up from the basic unreliable Internet Protocol (IP). What TCP adds is reliability: if a packet of data is not acknowledged, it will be retransmitted. Neither TCP nor the IP of the 1990s, IPv4, provides any encryption of the data being transmitted. Thus anyone able to monitor the packets on the local-area network of the server or client or on the backbone routers may be able to learn, for example, the particular pages requested by a particular user. If you were running an online community about a degenerative disease, this might cause one of your users to lose his or her job.

There are two ways to protect your users' privacy from packet sniffers. The first is by using a newer version of Internet Protocol, IPv6, which provides native data security as well as authentication. In the glorious IPv6 world, we can be sure of the origin of a packet, whether it is from a legitimate user or a denial-of-service attacker. In the glorious IPv6 world, we can be sure that it will be impractical to sniff credit card numbers or other user-sensitive data from Web traffic. As of spring 2005, however, it isn't possible to sign up for a home IPv6 connection. Thus we are forced to fall back on the 1990s-style approach of adding a layer between HTTP and TCP. This was pioneered by Netscape Communications as Secure Sockets Layer (SSL) and is now being standardized as TLS 1.0 (see http://www.ietf.org/html.charters/tls-charter.html).

However it is performed, encryption is processor-intensive. On the client side, that's not a big deal. The client machine probably has a 2 GHz processor that is 98 percent idle. However on the server end performing encryption can tie up a whole CPU per user for the duration of a request.

If you've run out of processing power the only thing to do is ... add processing power. The question is what kind and where. Adding general-purpose processors to a multi-CPU computer is very expensive as mentioned earlier. Adding additional single-CPU front-end servers to a two-tier server farm might not be a bad strategy especially because, if you're already running a two-tier server farm, it requires no new thinking or system administration skills. It is possible, however, that special-purpose hardware will be more cost-effective or easier to administer. In particular it is possible to do encryption in the router for IPv6. SSL encryption for HTTP connections can be done with plug-in boards, an example of which is the Compaq AXL300, PCI card, available in 2005 for $1400 with a claimed performance of handling 330 SSL connections per second. Finally it is possible to interpose a hardware encryption machine between the Web server, which communicates via ordinary HTTP, and the client, which makes requests via HTTPS. This feature is, for example, an option on load-balancing routers from F5 Networks (www.f5.com).

Do you have enough CPUs?

After reading the preceding sections, you've gone out and gotten some computer hardware. How do you know whether or not it will be adequate to support the expected volume of requests? A good rule of thumb is that you can't handle more than 10 requests for dynamic pages per second per CPU. A "dynamic" page is one that involves the execution of any computer program on the server side other than simple HTTP service, i.e., anything other than sending a JPEG or HTML file. The 10-per-second figure assumes that the pages either are not encrypted or that the encryption is done by additional hardware in front of the HTTP server. For example, if you have a 4-CPU RDBMS server handling persistence and abstraction and 4 1-CPU front-end machines handling presentation and HTTP service you shouldn't expect to deliver more than 80 dynamic pages per second.

You might ask what CPU speed is this 10 hits per second per CPU number based upon? The number is independent of CPU speed! In the mid-1990s, we had 200 MHz CPUs. Web scripts queried the database and merged the results with strings embedded in the script. Everything ran on one physical computer so there was no overhead from copying data around. Only the final credit card processing pages were encrypted. We struggled to handle 10 hits per second. In the late 1990s we had 400 MHz CPUs. Web scripts queried the database and merged the results with templates that had to be parsed. Data were networked from the RDBMS server to the Web server before heading to the user. We secured more pages in response to privacy concerns. We struggled to handle 10 hits per second. In 2000 we had 1 GHz CPUs. Web scripts queried the referer header to find out if the request came from a customer of one of our co-brand partners. The script then selected the appropriate template. We'd freighted down the server with Java Server Pages and Enterprise Java Beans. We struggled to handle 10 hits per second. In 2002 we had 2 GHz CPUs. The programmers had decided to follow the XML/XSLT fashion. We struggled to handle 10 hits per second....

It seems reasonable to expect that hardware engineers will continue to deliver substantial performance improvements and that fashions in software development and business complexity will continue to rob users of any enjoyment of those improvements. So stick to 10 requests per second per CPU until you've got your own application-specific benchmarks that demonstrate otherwise.

Load Balancing

As noted earlier in this chapter, an Internet service with 100 CPUs spread among 15 physical computers isn't going to be very reliable if all 100 CPUs must be working for the overall service to function. We need to develop a strategy for load balancing so that (1) user requests are divided more or less evenly among the available CPUs, (2) when a piece of hardware fails, it doesn't result in too many errors returned to users, and (3) we can reconfigure hardware and network without breaking users' bookmarks and links from other sites.

We will start by positing a two-tier server farm with a single multi-CPU machine running the RDBMS and multiple single-CPU front-end machines, each of which runs the Web server program, interprets page scripts, performs SSL encryption, and generally does any computation not being performed within the RDBMS.

**** insert drawing of our example server farm ****

Figure 11.1: A typical server configuration for a medium-to-high volume Internet application. A powerful multi-CPU server supports the relational database management system. Multiple small 1-CPU machines run the HTTP server program.

Load Balancing in the Persistence Layer

Our persistence layer is the multi-CPU computer running the RDBMS. The RDBMS itself is typically a multi-process or multi-threaded application. For each database client, the RDBMS spawns a separate process or thread. In this case, each front-end machine presents itself to the RDBMS as one or more database clients. If we assume that the load of user requests are spread among the front-end machines, the load of database work will be spread among the multiple CPUs of the RDBMS server by the operating system process or thread scheduler.

Load Balancing among the Front-End Machines

Circa 1995 a popular strategy for high-volume Web sites was round-robin DNS. Each front-end machine was assigned a unique publicly routable IP address. The Domain Name System (DNS) server for the Web site was programmed to give different answers when asked for a translation of the Web server's hostname. For example, www.cnn.com was using round-robin DNS. They had a central NFS file server containing the content of the site and a rack of small front-end machines, each of which was a Web server and an NFS client. This architecture enabled CNN to update their site consistently by touching only one machine, i.e., the central NFS server.

How was the CNN system experienced by users? When a student at MIT requested http://www.cnn.com/TECH/, his or her desktop machine would ask the local name server for a translation of the hostname www.cnn.com into a 32-bit IP address. (Remember that all Internet communication is machine-to-machine and requires numeric IP addresses; alphanumeric hostnames such as "www.amazon.com" or "web.mit.edu" are used only for user interface.) The MIT name server would contact the InterNIC registry to learn the IP addresses of the name servers for the cnn.com domain. The MIT name server would then contact CNN's name servers and learn that "www.cnn.com" was available at the IP address 207.25.71.5. Subsequent users within the same subnetwork at MIT would, for a period of time designated by CNN, get the same answer of 207.25.71.5 without the MIT name server going back to the CNN name servers.

Where is the load balancing in this system? Suppose that a Biology major at Harvard University requested http://www.cnn.com/HEALTH/. Harvard's name server would also contact CNN's name servers to learn the translation of "www.cnn.com". This time, however, the CNN server would provide a different answer: 207.25.71.20, leading that user, and subsequent users within Harvard's network, to a different front-end server than the machine providing pages to users at MIT.

Round-robin DNS is not a very popular load balancing method today. For one thing, it is not very balanced. Suppose that the CNN name server tells America Online's name server that www.cnn.com is reachable at 207.25.71.29. AOL is perfectly free to provide that translation to all of its more than 20 million customers. Another problem with round-robin DNS is the impact on users when a front-end machine dies. If the box at 207.25.71.29 were to fail, none of AOL's customers would be able to reach www.cnn.com until the expiration time on the translation had elapsed—the site would be up and running and providing pages to hundreds of thousands of users worldwide, but not to those users who'd received an unlucky DNS translation to the dead machine. For a typical domain, this period of time might be anywhere from 6 hours to 1 week. CNN, aware of this problem, could shorten the expiration and "minimum time-to-live" on cnn.com but if these were cut down to, say, 30 seconds, the load on CNN's name servers might start approaching the intensity of the load on its Web servers. Nearly every user page request would be preceded by a request for a DNS translation. (In fact, CNN set their minimum time-to-live to 15 minutes.)

A final problem with round-robin DNS is that it does not provide abstraction. Suppose that CNN, whose primary servers were all Unix machines, wished to run some discussion forum software that was only available for Windows. The IP addresses of all of its servers are publicly exposed. The only way to direct users to a different machine for a particular part of the service would be to link them to a different hostname, which could therefore be translated into a distinct IP address. For example, CNN would link users to "http://forums.cnn.com". Users who enjoyed these forums would bookmark the URL, and other sites on the Internet would insert hyperlinks to this URL. After a year, suppose that the Windows servers were dying and the people who knew how to maintain them had moved on to other jobs. Meanwhile, the discussion forum software has become available for Unix as well. CNN would like to pull the discussion service back onto its main server farm, at a URL of http://www.cnn.com/discuss/. Why should users be aware of this reshuffling of hardware?

**** insert drawing of server farm (cloud), load balancer, public Internet (cloud) ****

Figure 11.2: To preserve the freedom of rearranging components within the server farm, typically users on the public Internet only talk to a load balancing router, which is the "public face" of the service and whose IP address is what www.popularservice.com translates to.

The modern approach to load balancing is the load balancing router. This machine, typically built out of standard PC hardware running a free Unix operating system and a thin layer of custom software, is the only machine that is visible from the public Internet. All of the server hardware is behind the load balancer and has IP addresses that aren't routable from the rest of the Internet. If a user requests www.photo.net, for example, this is translated to 216.127.244.133, which is the IP address of photo.net's load balancer. The load balancer accepts the TCP connection on port 80 and waits for the Web client to provide a request line, e.g., "GET / HTTP/1.0". Only after that request has been received does the load balancer attempt to contact a Web server on the private network behind it.

Notice first that this sort of router provides some inherent security. The Web servers and RDBMS server cannot be directly contacted by crackers on the public Internet. The only ways in are via a successful attack on the load balancer, an attack on the Web server program (Microsoft Internet Information Server suffered from many buffer overrun vulnerabilities), or an attack on publisher-authored page scripts. The router also provides some protection against denial-of-service attacks. If a Web server is configured to spawn a maximum of 100 simultaneous threads, a malicious user can effectively shut down the site simply by opening 100 TCP connections to the server and then never sending a request line. The load balancers are smart about reaping such idle connections and in any case have very long queues.

The load balancer can execute arbitrarily complex algorithms in deciding how to route a user request. It can forward the request to a set of front-end servers in a round-robin fashion, taking a server out of the rotation if it fails to respond. The load balancer can periodically pull load and health information from the front-end servers and send each incoming request to the least busy server. The load balancer can inspect the URI requested and route to a particular server, for example, sending any request that starts with "/discuss/" to the Windows machine that is running the discussion forum software. The load balancer can keep a table of where previous requests were routed and try to route successive requests from a particular user to the same front-end machine (useful in cases where state is built up in a layer other than the RDBMS).

Whatever algorithm the load balancer is using, a hardware failure in one of the front-end machines will generally result in the failure of only a handful of user requests, i.e., those in-process on the machine that actually fails.

How are load balancers actually built? It seems that we need a computer program that waits for a Web request, takes some action, then returns a result to the user. Isn't this what Web server programs do? So why not add some code to a standard Web server program, run the combination on its own computer, and call that our load balancer? That's precisely the approach taken by the Zeus Load Balancer (http://www.zeus.com/products/zlb/) and mod_backhand (http://www.backhand.org/mod_backhand/), a load balancer module for the Apache Web server. An alternative is exemplified by F5 Networks, a company that sells out-of-the-box load balancers built on PC hardware, the NetBSD Unix operating system, and unspecified magic software.

Failover

Remember our strategic goals: (1) user requests are divided more or less evenly among the available CPUs; (2) when a piece of hardware fails it doesn't result in too many errors returned to users; (3) we can reconfigure hardware and network without breaking users' bookmarks and links from other sites.

It seems as though the load-balancing router out front and load-balancing operating system on the RDBMS server in back have allowed us to achieve goals 1 and 3. And if the hardware failure occurs in a front-end single-CPU machine, we've achieved goal 2 as well. But what if the multi-CPU RDBMS server fails? Or what if the load balancer itself fails?

Failover from a broken load balancer to a working one is essentially a network configuration challenge, beyond the scope of this textbook. Basically what is required are two identical load balancers and cooperation with the next routing link in the chain that connects your server farm to the public Internet. Those upstream routers must know how to route requests for the same IP address to one or the other load balancer depending upon which is up and running. What keeps this from becoming an endless spiral of load balancing is that the upstream routers aren't actually looking into the TCP packets to find the GET request. They're doing the much simpler job of IP routing.

Ensuring failover from a broken RDBMS server is a more difficult challenge and one where a large variety of ideas has been tried and found wanting. The first idea is to make sure that the RDBMS server never fails. The machine will have three power supplies, only two of which are required. Each disk drive will be mirrored. If a CPU board fails, the operating system will gracefully fail back to running on the remaining CPUs. There will be several network cards. There will be two paths to each disk drive. Considering the number of moving parts inside, the big complex servers are remarkably reliable, but they aren't 100 percent reliable.

Given that a single big server isn't reliable enough, we can buy a whole bunch of them and plug them all into the same disk subsystem, then run something like Oracle Parallel Server. Database clients connect to whichever physical server machine is available. If they can't get a response from a particular server, the client retries after a few seconds to another physical server. Thus an RDBMS server machine that fails causes the return of errors to any in-process user requests being handled by that machine and perhaps a few seconds of interrupted or slow service for users who've been directed to the clients of that down machine, but it causes no longer term site unavailability.

As discussed in the "Persistence Layer" section of this chapter, this approach entails a lot of wasted CPU time and bandwidth as the physical machines keep each other apprised of database updates. A compromise approach introduced by Oracle in 2000 was to configure a two-node parallel server. The first machine would process online transactions. The second machine would be allowed to lag as much as, say, ten minutes behind the first in terms of updates. If you wanted a CPU-intensive report querying last month's user activity, you'd talk to the backup machine. If Machine #1 failed, however, Machine #2 would notice almost immediately and start rolling its own state forward from the transaction log on the hard disk. Once Machine #2 was up to date with the last committed transaction, it would begin offering service as the primary database server. Oracle proudly stated that, for customers willing to spend twice as much for RDBMS server hardware, the two-node failover configuration was "only a little bit slower" than a single machine.

Hardware Scaling Exercises

Exercise 1: Web Server-based Load Balancer

How can a product like the Zeus Load Balancer work? We were worried about our Web server program becoming overwhelmed so we added nine extra machines running nine extra copies of the program. Can it be a good idea to add the bottleneck of requiring all of our users to go through a Web server program running on one machine, which was probably how we had it set up in the first place?

Exercise 2: New York Times

Consider the basic New York Times Web site. Ignore any bag-on-the-side community features such as chat or discussion forums. Concentrate on the problem of delivering the core articles and advertising. Every user will see the same articles but with potentially different advertisements. Design a server hardware and software infrastructure that will (1) let the New York Times staff update the site using Web forms with the user experience lagging those updates by no more than one minute, and (2) result in minimum cost of computer hardware and system administration.

Be explicit about the number of computers employed, the number of CPUs within each computer, and the connections among the computers.

Your answer to this exercise should be no longer than half a page of text.

Exercise 3: eBay

Visit www.ebay.com and familiarize yourself with their basic services of auction bidding and user ratings. Assume that you need to support 100 million registered users, 800 million page views per day, 10 million bids per day, 10 million searches per day, and 0.5 million new user ratings per day. Design a server hardware and software infrastructure that will represent a reasonable compromise among reliability (including graceful degradation), initial cost, and cost of administration.

Be explicit about the number of computers employed, the number of CPUs within each computer, and the connections among the computers. If you're curious about the real numbers, remember that eBay is a public corporation and publishes annual reports, which are available at http://investor.ebay.com/.

Your answer to this exercise should be no longer than one page.

Exercise 4: eBay Proxy Bidding

eBay offers a service called "proxy bidding" or "automatic bidding" in which you specify a maximum amount that you're willing to pay and the server itself will submit bids for you in increments that depend on the current high bid. How would you implement proxy bidding on the infrastructure that you designed for the preceding exercises? Rough out any SQL statements or triggers that you would need. Be explicit about where the code for proxy bidding would execute: on which server? in which execution environment?

Exercise 5: Uber-eBay

Suppose that eBay went up to one billion bids per day. How would that change your design, if at all?

Exercise 6: Hotmail

Suppose that Hotmail were an RDBMS-backed Internet service with 200 million active users. What would be the minimum cost hardware configuration that still provided reasonable reliability and maintainability? What is the fundamental difference between Hotmail and eBay?

Note: http://philip.greenspun.com/ancient-history/webmail/ describes an Oracle-backed Web mail system built by Jin S. Choi.

Exercise 7: Scorecard

Provide a one-paragraph design for the server infrastructure behind www.scorecard.org, justifying your decisions.

Moving on to the Hard Stuff

We can build a big server. We can support a lot of users. As the community grows in size, though, can those users continue to interact in the purposeful manner necessary for our service to be an online learning community? How can we prevent the discussion and the learning from devolving into chaos and chat?

Perhaps we can take some ideas from the traditional face-to-face world. Let's look at some of the things that make for good offline communities and how we can translate them to the online world.

Translating the Elements of Good Communities from the Offline to the OnlineWorld

A face-to-face community is almost always one in which people are identified, authenticated, and accountable. Suppose that you're a 50-year-old, 6 foot tall, 250 pound guy, known to everyone in town as "Fred Jones". Can you walk up to the twelve-year-old daughter of one of your neighbors and introduce yourself as a thirteen-year-old girl? Probably not very successfully. Suppose that you fly a Nazi flag out in front of your house. Can you express an opinion at the next town meeting without people remembering that you were "the Nazi flag guy"? Seems unlikely.

How do we translate the features of identifiability, authentication, and accountability into the online world? In private communities, such as corporate knowledge management systems or university coordination services, it is easy. We don't let anyone use the system unless they are an employee or a registered student and, in the online environment, we identify users by their full names. Such heavyweight authentication is at odds with the practicalities of running a public online community. For example, would it be practical to schedule face-to-face meetings with each potential registrant of photo.net, where the new user would show an ID? On the other hand, as discussed in the "User Registration and Management" chapter, we can take a stab at authentication in a public online community by requiring email verification and by requiring alternative authentication for people with Hotmail-style email accounts. In both public and private communities, we can enhance accountability simply by making each user's name a hyperlink to the complete record of their contributions to the site.

In the face-to-face world, a speaker gets a chance to gauge audience reaction as he or she is speaking. Suppose that you're a politician speaking to a women's organization, the WAGC ("Women Against Gun Control", www.wagc.com). Your schedule is so heavy that you can't recall what your aides told you about this organization, so you plan to trot out your standard speech about how you've always worked to ensure higher taxes, more government intervention in individuals' lives, and, above all, to make it more difficult for Americans to own guns. Long before you took credit for your contribution to the assault rifle ban, you'd probably have noticed that the audience did not seem very receptive to your brand of paternalism and modified your planned speech. Typical computer-mediated communication systems make it easy to broadcast your ideas to everyone else in the service, but without an opportunity to get useful feedback on how your message is being received. You can send the long email to the big mailing list. You'll get your first inkling as to whether people liked it or not after the first 500 have it in their inbox. You can post your reply to an emotionally charged issue in a discussion forum, but you won't get any help from other community members, at least not through the same software, before you finalize that reply.

Perhaps you can craft your software so that a user can expose a response to a test audience of 1 percent of the ultimate audience, get a reaction back from those sample recipients, and refine the message before authorizing it for delivery to the whole group.

When groups too large for effective discussion assemble in the offline world, there is often a provision for breaking out into smaller groups and then reassembling. For example, academic conferences usually are about half "one to very many" lectures and half breaks and meals during which numerous "handful to handful" discussions are held. Suppose that an archived discussion forum is used by 10,000 people. You're pretty sure that you know the answer to a question, but not sure that your idea is sufficiently polished for exposure to 10,000 people and permanent enshrinement in the database. Wouldn't it be nice to shout out the proposed response to those users who happen to be logged in at this moment and try the idea out with them first? The electronic equivalent of shouting to a roomful of people is typing into a chat room. We experimented at photo.net by comparing an HTML- and JavaScript-based chatroom run on our own server to a simple hyperlink to a designated chatroom on the AOL Instant Messenger infrastructure:

<a href="aim:gochat?RoomName=photonet">photo.net chatroom</a>
This causes a properly configured browser to launch the AIM client (try it). Although the AIM-based chat offered superior interactivity, it was not as successful due to (1) some users not having the AIM software on their computers, (2) some users being behind firewalls that prevented them from using AIM, but mostly because (3) photo.net users knew each other by real names and could not recognize their friends by their AIM screen names. It seems that providing a breakout and reassemble chat room is useful, but that it needs to be tightly integrated with the rest of the online community and that, in particular, user identity must be preserved across all services within a community.

People like computers and the Internet because they are fast. If you want an answer to a question, you turn to the search engine that responds quickest and with the most relevant results. In the offline world, people generally desire speed. A Big Mac delivered in thirty seconds is better than a Big Mac delivered in ten minutes. However, when emotions and stakes are high, we as a society often choose delay. We could elect a president in two weeks, but instead we choose presidential campaigns that last nearly two years. We could have tried and sentenced Thomas Junta immediately after July 5, 2000, when he beat Michael Costin, father of another ten-year-old hockey player, to death in a Boston-area ice rink. After all, the crime was witnessed by dozens of people and there was little doubt as to Junta's guilt. But it was not until January 2002 that Junta was brought to trial, convicted, and sentenced to six to ten years in prison. Instant messaging, chat rooms, and Web-based discussion forums don't always lend themselves to thoughtful discourse, particular when the topic is emotional.
"As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches 1" — (Mike) Godwin's Law
For some communities it may be appropriate to consider adding an artificial delay in posting. Suppose that you respond to Joe Ranter's message by comparing him to Adolf Hilter. Twenty-four hours later you get an email message from the server: "Does the message below truly represent your best thinking? Choose an option by clicking on one of the URLs below: confirm | edit | discard." You've had some time to cool down and think. Is Joe Ranter a talented oil painter? Was Joe Ranter ever designated TIME Magazine Man of the Year (Hitler made it in 1938)? Upon reflection, the comparison to Hitler was inapt and you choose to edit the message before it becomes public.

How difficult is it in the offline world to find people interested in the issues that are important to us? If you believe that charity begins at home and all politics is local, finding people who share your concerns is as simple as walking around your neighborhood. One way to translate that to the online world would be to build separate communities for each geographical region. If you wanted to find out about the environment in your state, you'd go to massachusetts.envrionmentaldefense.org. But what if your interests were a bit broader? If you were interested in the environment throughout New England, should you have to visit five or six separate servers in order to find the hot topics? Or suppose that your interests were narrower. Should you have to wade through a lot of threads regarding the heavily populated eastern portion of Massachusetts if you live right up against the New York State border and are worried about a particular chemical plant?

The geospatialized discussion forum, developed by Bill Pease and Jin S. Choi for the scorecard.org service, is an interesting solution to this problem. Try out the following pages:

A user could bookmark any of these pages and enter the site periodically to participate in as wide a discussion as interest dictated.

Another way to look at geospatialization is of the users themselves. Consider, for example, an online learning community centered around the breeding of African Cichlids. Most of the articles and discussion would be of interest to all users worldwide. However it would be nice to help members who were geographically proximate find each other. Geographical clumps of members can share information about the best aquarium shops and can arrange to get together on weekends to swap young fish. To facilitate geospatialization of users, your software should solicit country of residence and postal code from each new user during registration. It is almost always possible to find a database of latitude and longitude centroids for each postal code in a country. In the United States, for example, you should look for the "Gazetteer files" on www.census.gov, in particular those for ZIP Code Tabulation Areas (ZCTAs).

Despite applying the preceding tricks, it is always possible for growth in a community to outstrip an old user's ability to cope with all the new users and their contributions. Every Internet collaboration system going back to the early 1970s has drawn complaints of the form "I used to like this [mailing list|newsgroup|MUD|Web community] when it was smaller, but now it is big and full of flaming losers; the interesting thoughtful material is buried under a heavy layer of dross." The earliest technological fix for this complaint was the bozo filter. If you didn't like what someone had to say, you added them to your bozo list and the software would hide their contributions from your view of the community.

In mid-2001 we added an "inverse bozo filter" facility to the photo.net community. If you find a work of great creativity in the photo sharing system or a thoughtful response in a discussion forum you can mark the author as "interesting". On subsequent logins you will find a "Your Friends" section in your personal workspace on the site. The people that you've marked as interesting are listed in order of their most recent contribution to the site. Six months after the feature was added 5,000 users had established 25,000 "I think that other user is interesting" relationships.

Human Scaling Exercises

Exercise 8: Newspaper's Online Community

Pick a discussion forum server operated by an online newspaper with a national or international audience, e.g., www.nytimes.com, etc. Select a discussion area that is of interest to you. How effectively does this function as an online learning community? What are the features that are helpful? What features would you add if this were your service?

What is it about a newspaper that makes it particularly tough for that organization to act as the publisher of an online community?

Exercise 9: amazon.com

List the features of amazon.com that would seem to lead to more graceful scaling of their online community. Explain how each feature helps.

Exercise 10: Scaling Plan for Your Community

Create a document at the abstract URL /doc/planning/YYYYMMDD-scaling on your server and start writing a scaling plan for your community. This plan should list those features that you expect to modify or add as the site grows. The features should be grouped by phases.

Add a link to your new plan from /doc/ or a planning subindex page.

Exercise 11: Implement Phase 1

Implement Phase 1 of your scaling plan. This could be as simple as ensuring that every time a user's name or email address appears on your service, the text is an anchor to a page showing all of that person's contributions to the community (accountability). Or it could be as complex as complete geospatialization. It really depends on how large a community your client expects to serve in the coming months.

Spam-Proofing Public Online Communities

A public online community is one in which registration is accepted from any IP address on the public Internet and one that serves content back to the public Internet. In a private online community, for example, a corporate knowledge-sharing system that is behind a company firewall and that only accepts members who are employees, you don't have to worry too much about spam, where spam in this case is defined as "Any content that is off-topic, violates the terms of use, is posted multiple times in multiple places, or is otherwise unhelpful to other community members trying to learn."

Let's look at some concrete scenarios. Let's assume that we have a public community in which user-contributed content goes live immediately, without having to be approved by a moderator. The problem of spam is greatly reduced in any community where content must be pre-approved before appearing to other members, but such communities require a larger staff of moderators if discussion is to flow freely.

Scenario 1: Sarah Moneylover has registered as User #7812 and posted 50 article comments and discussion forum messages with links to her "natural Viagra" sales site. Sarah clicked around by hand and pasted in a text string from a word processor open on her desktop, investing about 20 minutes in her spamming activity. The appropriate tool for dealing with Sarah is a set of efficient administration pages. Here's how the clickstream would proceed:

  1. site administrator visits a "all content posted within the last 30 days" link, resulting in page after page of stuff
  2. site administrator clicks a control up at the top to limit the display to only content from newly registered users, who are traditionally the most problematic, and that results in a manageable 5-screen listing
  3. site administrator reviews the content items, each presented with a summary headline at the top and the first 200 words of the body with a "more" hyperlink to view the complete item and a hyperlinked author's name at the end
  4. site administrator clicks on the name "Sarah Moneylover" underneath a posting that is clearly off-topic and commercial spam; this brings up a page summarizing Sarah's registration on the server and all of her contributed content
  5. site administrator clicks the "nuke this user" link from Sarah Moneylover and is presented with a "Do you really want to delete Sarah Moneylover, User #7812, and all of her contributed content?"
  6. site administrator confirms the nuking and a big SQL transaction is executed in which all rows related to Sarah Moneylover are deleted from the RDBMS. Note that this is different from a moderator marking content as "unapproved" and having that content remain in the database but not displayed on pages. The assumption is that commercial spam has no value and that Sarah is not going to be converted into a productive member of the community. In fact the row in the users table associated with User #7812 ought to be deleted as well.
The site administrator, assuming he or she was already reviewing all new content on the site, spent less than 30 seconds removing content that took the spammer 20 minutes to post, a ratio of 40:1. As long as it is much easier to remove spam than to post it the community is relatively spam-proof. Note that Sarah would not have been able to deface the community if a policy of pre-approval for content contributed by newly registered users was established.

Scenario 2: Ira Angrywicz, User #3571, has developed a grudge against Herschel Mellowman, User #4189. In every discussion forum thread where Herschel has posted, Ira has posted a personal attack on Herschel right underneath. The procedure followed to deal with Sarah Moneylover is not appropriate here because Ira, prior to getting angry with Herschel, posted 600 useful discussion forum replies that we would be loathe to delete. The right tool to deal with this problem is an administration page showing all content contributed by User #3571 sorted by date. Underneath each content item's headline are the first 200 words of the body so that the administrator can evaluate without clicking down whether or not the message is anti-Herschel spam. Adjacent to each content item is a checkbox and at the bottom of all the content is a button marked "Disapprove all checked items." For every angry reply that Ira had to type, the administrator had to click the mouse only once on a checkbox, perhaps a 100:1 ratio between spammer effort and admin effort.

Scenario 3: A professional programmer hired to boost a company's search engine rank writes scripts to insert content all around the Internet with hyperlinks to his client's Web site. The programs are sophisticated enough to work through the new user registration pages in your community, registering 100 new accounts each with a unique name and email address. The programmer has also set up robots to respond to email address verification messages sent by your software. Now you've got 100 new (fake) users each of whom has posted two messages. If the programmer has been a bit sloppy, it is conceivable that all of the user registrations and content were posted from the same IP address in which case you could defend against this kind of attack by adding an originating_ip_address column to your content management tables and building an admin page letting you view and potentially delete all content from a particular IP address. Discovering this problem after the fact, you might deal with it by writing an admin page that would summarize the new user registrations and contributions with a checkbox bulk-nuke capability to remove those users and all of their content. After cleaning out the spam you'd probably add a "verify that you're a human" step in the user registration process in which, for example, a hard-to-read word was obscured inside a patterned bitmap image and the would-be registrant had to recognize the word amidst the noise and type it in. This would prevent a robot from establishing 100 fake accounts.

No matter how carefully and intelligently programmed a public online community is to begin with, it will eventually fall prey to a new clever form of spam. Planning otherwise is like being an American circa 1950 when antibiotics, vaccines, and DDT were eliminating one dreaded disease after another. The optimistic new suburbanites never imagined that viruses would turn out to be smarter than human beings. Budget at least a few programmer days every six months to write new admin pages or other protections against new ideas in the world of spam.

More

Time and Motion

The hardware scaling exercises should take one half to one hour each. Students not familiar with eBay should plan to spend an extra half hour familiarizing themselves with it. The human scaling exercises might take one to two hours. The time required for Phase I will depend on its particulars.
Return to Table of Contents

eve@eveandersson.com, philg@mit.edu, aegrumet@mit.edu