After 40 years, among the most promising and popular uses of the Internet is education, especially informal unstructured education. People get answers with Google, learn from reading Wikipedia, and figure out how to accomplish tasks by watching videos on YouTube. The most dramatic effect of the Internet has been to expand the number of potential teachers. No longer are we restricted to learning from full-time teachers and full-length book authors. Someone who is an expert on how to shuck an oyster can make a two-minute video on the subject and help many a struggling chef at home.
One of the earliest and most effective means of connecting learners and teachers was the online community. Embodied first in mailing lists, then in discussion forums, and finally in comprehensive Web sites, the online community provides a way for people interested in sharing their expertise to answer questions. One of my favorite examples is this question about using red filters with black and white film (from photo.net, a community that the author developed in 1993).
For people to take maximum advantage of online communities, however, they need to participate in quite a few. Consider Joe Average, a suburban homeowner and parent, whose hobbies include taking pictures and videos of his kids and flying small airplanes. For Joe to take best advantage of opportunities for informal learning and teaching, he would have to belong to the following online communities:
There are several problems with news readers or RSS aggregators. First, they don't have very good tools for filtering and highlighting, so a person can't reasonably subscribe to too many sources (see "Solar Magnitude Forum" for an idea on how to give readers access to the most interesting parts of a very large discussion). Second, answering a question requires the potential teacher to (1) visit the underlying Web site, (2) remember the username/password for that Web site, (3) type in the username/password, (4) possibly navigate back to the question, and (5) post a response using an infrequently used and unfamiliar interface.
Alternatively, instead of getting all of the world's biggest photo nerds on one server, get all of the world's Internet users on one server and then let them form whatever interest groups they like. This was the AOL strategy in the 1980s and 1990s and is the Facebook strategy now. The problem with this approach is that the enormous umbrella community never captures a large enough percentage of the experts needed and the software lacks whatever specialized tools are needed for a particular topic (e.g., a gardening community may need a collaboratively-maintained taxonomy off which to hang photos, discussion, articles, etc.).
Let's accept that the Internet has too many entrenched specialized online communities for Facebook groups to take over. A popular standalone community has active motivated participants who know each other by name and reputation. The archives may stretch back more than a decade. Advertising revenue may be enough to give the publisher a strong motivation to keep improving the service.
This starts with some elements of lightweight aggregation. The canonical repositories for discussion remain on their existing sites. Maybe it is a standalone community such as gardenweb.com. Maybe it is a Google or Yahoo Group. Maybe it is a group within Facebook. With RSS feeds or screen-scraping scripts and using the participant's username/password credentials, the aggregator pulls discussion forum postings from the underlying communities, highlighting those that are likely to be most interesting and combining discussions from multiple sources on one page.
What if the participant, after reading a posting, wants to post a response? He or she can type it into a form on the aggregated page and the service will use his or her username/password to post it in the appropriate place on the underlying community. Suppose the participant wants to see something about the reputation of a poster or other contextual information? He or she clicks through to the underlying community, already conveniently logged in.
Because we've already got all of the text of discussion forums from hundreds or thousands of online communities, it is easy to make a unified mobile phone interface to all of them. Even if the community software hasn't been touched since 1996 and no thought was given to viewing/posting on a 3-inch screen, the discussion will now be usable from a smartphone.
As long as we're integrating communities we can add the best features of the best community software to every community. For example, on photo.net in the late 1990s, as the community grew beyond 100,000 registered users, we added the ability for a reader to tag a forum contributor as an "interesting person". This enabled the system to show the reader a page of all new content from all members that he or she had previously tagged as thoughtful or skilled.
We can track all questions posted and provide email notifications of responses, as well as highlighting threads that were initiated by or participated in by our customer.
Given that the server is pulling information from thousands of online communities on a daily basis, it should be easy to support a directory of active communities where a question is likely to be answered. Trying searching for "aquarium online community" or "aquarium forum". Then sort through the sites that Google returns and see how long it takes to determine which forums are active and have a good percentage of questions adequately answered. Marc Smith and his group at Microsoft Research were able to develop quite a few metrics of online community quality. These can be computed automatically and used to significantly assist a user in finding a helpful discussion forum.
Suppose that a publisher of one of the underlying communities does not see things this way and begins to block the IP address of a central aggregation server. Nothing stops all of this software from running inside a browser on a reader's desktop machine, in which case the page requests should not be distinguishable from ordinary browsing and therefore won't be blockable.
If the interface is a sufficient improvement over that of the underlying sites and the service catches on, the potential for profits is enormous because the business need not invest in content. Every new user adds a tiny bit of cost in terms of servers and programming but adds a lot of page views that can support advertising.
If we can come up with a set of features that are mostly valuable to companies trying to manage their brand image and reputation, it should be possible to charge $10,000 or more annually for them to use essentially the same service that consumers are using for free.
The most obvious candidates are Google Groups and Yahoo Groups. These services allow anyone to create a simple online community. With a single username/password, which may already be what one is using to read email, a participant can read and answer questions in multiple communities. Neither Google nor Yahoo seems interested in integration, however. Neither service allows a participant to view threads from multiple groups on a single page. To see what's going on inside 5 groups, the user has to click down into Group 1, look at a list of discussion topics, then click back up to the top level, click down into Group 2, look at a list of topics, etc. Nor does either service allow the importation of discussion threads via RSS from communities hosted elsewhere.
A single-sign on system such as Microsoft Passport or OpenID solves the problem of remembering a lot of different usernames and passwords, but it doesn't solve the problem that a participant in 20 different forums will have to click the mouse at least 40 times to see what's new and whether there is anything worth responding to.
Cesar Brea, a former management consultant at Bain and Monitor and now head of Force Five Partners, threw a few buckets of cold salt water on an earlier draft of this document.
The typical active user of online communities may visit 60 or 70 pages per day (each page being one discussion forum thread), which makes the potential revenue per user about $50 per year. If the aggregation service grows to the size of a single moderately popular online community, about 100,000 active users, that could be $5 million per year in revenue.
What about costs? Because the service need not crawl the Web or store any information from the underlying communities, the hardware and software infrastructure required will be modest. Given U.S.-based programming, system administration, and hosting, a reasonable long-term budget for technology is probably around $1 million per year. About half of that would be spent on maintenance and security. The other half would be spent on developing new features and interfacing to new custom-coded communities. At least a small offshore labor group would be required so that the cost of interfacing to a new community does not exceed the revenue derived from it.
For a credible out-of-the-gate start, the service should probably interface to at least the mostly popular 1000 online communities (a big service such as Google Groups, Facebook, or Twitter would count as 1). Let's assume 1000 programmer-days to build those interfaces (some will be tough, but some will be easy due to the use of similar software), at an offshore labor rate of $150 per day. That's $150,000 in startup costs to have the most popular communities set up and ready to go. Add to that another $50,000 to build the core service and we're talking about $200,000 in technical startup costs.
As far as hosting goes, this one rather cries out for cloud computing. Customers will have more faith in a cloud computing vendor's ability to provide security than they will in the average ISP's. There is no advantage to having all of the customers on one big computer. This is a more or less personal service and there is no downside to each customer having a dynamically assigned personal server. The nice thing about cloud computing in this case is that the hardware/hosting costs will scale with customers and usage and also that the system could handle a big spike in usage if it became popular.
Marketing methods and costs are tougher to predict. Though I have some ideas of my own, I'm going to leave that as an exercise for the entrepreneur.
Exit Strategy? What if this were funded as a venture capital-backed business? How could the original investors get their cash back? This company would be a natural to sell to any of the big media companies that are good at wringing the last advertising dime out of a page view. Examples include Demand Media, Marchex, and NameMedia.
-- Multi-community data model, by email@example.com, October 2009 -- available under the GNU General Public License -- create a sequence for user_id create sequence user_id_sequence start with 1; create table users ( user_id integer primary key, first_names varchar(50), last_name varchar(50) not null, email varchar(100) not null unique, -- we encrypt passwords using operating system crypt function password varchar(30) not null, registration_date timestamp(0) ); -- multiple users might belong to photo.net or facebook, for example, so we represent -- everything common about one of those sites in this table -- create a sequence for community_id create sequence community_id_sequence start with 1; create table communities ( community_id integer primary key, community_name varchar(4000) not null, community_url varchar(200) not null, -- the home page, e.g., http://photo.net -- here's where the real engineering happens; we need to store all of the code and patterns -- necessary to log into a particular community -- perhaps our life is easy and this uses a standard toolkit such as Wordpress standard_toolkit_id integer references standard_toolkits, -- note that column may be NULL if the community is custom-coded ); -- this next table doesn't really need its own generated key, but it -- probably makes life easier if using Web development tools that expect -- to see a single-column key create sequence uc_map_id_sequence start with 1; create table user_community_map ( uc_map_id integer primary key, -- the next two columns, taken together, are the real key (index below enforces their key-ness) user_id integer not null references users, community_id integer not null references communities, username varchar(200), password varchar(200), cookie varchar(4000), -- this way we won't have to keep logging in -- interest level may vary with season, e.g., user will be very interested in skiing community in November, but reduce level in March interest_level integer default 2; -- from 1 (most interested) to 5 (least), how much stuff does our user want to see from this community? ); create unique index uc_map_key_idx on user_community_map (user_id, community_id); -- here we store things that a user wants to pick out from a community even if his interest level is low at the time -- should we consider a skinny table architecture instead? With a pattern column and a search_type column? create sequence user_filter_id_sequence start with 1; create table user_filters ( filter_id integer primary key, uc_map_id integer not null references user_community_map, -- one of the following should not be NULL; the rest will be -- look for a regexp in the subject line of a discussion forum posting subject_regexp -- look for a regexp anywhere in posting (subject, body, author) anywhere_regexp -- look for a particular author author ); -- the primary key makes it fast to ask "to which communities does user 678 belong"; now an index to make it fast to ask -- "which of our users belong to community 342?" create index user_community_map_by_cu on user_community_map ( community_id, user_id ); create sequence standard_toolkit_id_sequence start with 1; create table standard_toolkits ( toolkit_id integer primary key, toolkit_name varchar(100) not null, -- we'll need a lot more here! );
Philip, I think it is a very good idea. I wonder if statistical linguistic analysis would be a practical and productive adjunct to manual, iterative regexp coding.
Baysian spam filters like DSPAM have been developed to learn very effectively for their intended purpose, but their capabilities have the potential to extend beyond the realm of SPAM filtration. If the guts of a spam filter were to be deployed against the 'scraped' content, it should be able to identify the patterns relevant to the subscriber with a high and improving degree of reliability. What's more, the software would be responsive to user feedback - it would learn what mattered to the subscriber.
This might allow the software to develop and continually improve independently of programming hours invested.
-- Richard Hamilton, December 1, 2009
It's a great idea, and I know users would love it. But I have two issues with it.
1.) It takes away the revenue stream from content creators, who are depending on eyeballs of viewers to see their ads, not yours. There could be copyright issues, but regardless, it disincentives the content creators or aggregators.
2.) If it works for me, it will work for a spammer. And then it won't work at all. People will demand forums where this isn't allowed to reduce the amount of spam (and drunken tweets that are familiar)
-- Aaron Evans, August 25, 2010