URL Databases: White, Black, and List Creation & Maintenance
White lists are collections of "appropriate" content. Any URL not on the
list is not accessible. These make the most sense when the definition of
"appropriate" is very narrow, or the degree of "safety" must be very high.
They tend to be extremely limiting. Appropriate uses include workstations
devoted to specific sites or to pages from specific directories.
Yahooligans is a freely available whitelist put out by Yahoo and Lucent for
use with ChoiceNet.
Black lists are collections of "inappropriate" content. Any URL not
on the list is accessible. Most filtering solutions use blacklists,
as the most common uses for filtering result in far fewer sites being
deemed inapppropriate than appropriate. The disadvantage of blacklists
is that new content is added to the Internet very rapidly, so the list
must be continuously updated to include all of the inappropriate material.
This is expensive. Because of this, there are no decent freely-available
blacklists, and most of the commercially-available blacklists are not
very effective, either.
In order to build and maintain a quality list, whether it be white or
black, requires a lot of infrastructure and ongoing support. At N2H2, for
example, we have several database servers, racks of support servers, and in
excess of 100 employees maintaining the system and reviewing sites.
There are two general approaches to list building and maintenance: manual
and automated. Using humans to find and review sites results in a
relatively small, but extremely high quality database. Using machines to
find and review sites results in a large but rather low quality database.
Most filtering efforts started with the former and moved to the latter as
they realized the labor costs involved in maintaining large review staffs.
The best approach integrates the two - use machines to identify the bulk of
the potentially inappropriate sites, then use humans to actually make the
decisions. This gives a large database of high quality at a lower (but
still rather expensive) cost.
Next: Data Filtering