URL Databases: White, Black, and List Creation & Maintenance

White lists are collections of "appropriate" content. Any URL not on the list is not accessible. These make the most sense when the definition of "appropriate" is very narrow, or the degree of "safety" must be very high. They tend to be extremely limiting. Appropriate uses include workstations devoted to specific sites or to pages from specific directories. Yahooligans is a freely available whitelist put out by Yahoo and Lucent for use with ChoiceNet.

Black lists are collections of "inappropriate" content. Any URL not on the list is accessible. Most filtering solutions use blacklists, as the most common uses for filtering result in far fewer sites being deemed inapppropriate than appropriate. The disadvantage of blacklists is that new content is added to the Internet very rapidly, so the list must be continuously updated to include all of the inappropriate material. This is expensive. Because of this, there are no decent freely-available blacklists, and most of the commercially-available blacklists are not very effective, either.

In order to build and maintain a quality list, whether it be white or black, requires a lot of infrastructure and ongoing support. At N2H2, for example, we have several database servers, racks of support servers, and in excess of 100 employees maintaining the system and reviewing sites.

There are two general approaches to list building and maintenance: manual and automated. Using humans to find and review sites results in a relatively small, but extremely high quality database. Using machines to find and review sites results in a large but rather low quality database. Most filtering efforts started with the former and moved to the latter as they realized the labor costs involved in maintaining large review staffs. The best approach integrates the two - use machines to identify the bulk of the potentially inappropriate sites, then use humans to actually make the decisions. This gives a large database of high quality at a lower (but still rather expensive) cost.

Next: Data Filtering