Data Filtering
Data filtering works by looking at the content returned by the web page
rather than the request for that content. Because this requires looking
at much more data, these solutions are generally less scalable and more
expensive than those using URL filtering.
The most widely-used type of data filtering is keyword filtering. The
keywords may be chosen by humans or by computers. Unfortunately, this
approach is prone to errors. Everyone has heard how the White House home
page was blocked because some filtering company had "couple" in their
keyword filter list. In addition, many inappropriate sites consist
entirely of pictures or have accompanying text in different languages.
Keyword filters don't do well in these situations.
Some of the limitations of keyword filters can be partially overcome by
using sophisticated AI techniques (most notably neural networks) to build
more complex rules. However, ultimately these are still keyword filters
and suffer from most of the limitations inherent to textual analysis.
A few people have tried to implement real-time or near-real-time image
recognition ("nipple filters") with limited success.
Another approach to data-filtering is to embed meta-tags into the web
pages or the HTTP reply headers. These self-labelling schemes have
been tried by various parties (PICS, RSACi, SafeSurf) over the past
several years with minimal success. This approach requires complete
acceptance and compliance by all of the relevant content providers,
which is extremely unlikely.