Semantics, AI, and the Web

2007-05-17

On the never-ending river of discussion in the W3C HTML Working Group, there are bits of fuzzy and malformed notions floating by that I hope - somehow - we can clean up. In particular the distinction between "Semantics" in the human sense and "semantics" as used (or misused) within some very shallow and limited application-specific domain - this is not generally clear.

The expertise within the working group is quite, er, varied - so the lack of a common vocabulary is to be expected.

From the web there are roughly three sorts of Semantics that can be deduced.

What meaning can be derived from reading present-day web content.
What meaning is explicitly declared in existing and future web content.
Meaning attached in cooperation between human authors and automated agents.

Humans are relatively good at (1), given sufficient domain-specific knowledge relevant to topic at hand. Computers are less effective, as the software just does not (yet) exist to approach human-level understanding. This is a topic worth continuing research, but the time-line between now and anything like complete results is probably decades.

Without doubt there are folk who believe (2) is a viable option (it seems to go with a specific personality type). In reality, outside a few noble gardens of well-organized content, the vast majority of content will remain without useful explicit organization.

Given that (1) is hard, and (2) is unlikely, can we find a third feasible choice?

While we may be a long ways from (1), that is not the same as saying there is no useful output from decades of AI research. Where asking your average web author to classify content within a global deep network of meaning is absurd, asking a small set of "is A like B" questions could be enough to bridge the gap.

There is an array of pragmatic considerations. There has to be one agent asking the questions, not one from every wannabe search engine. Linkage might be explicitly declared in content by the author, but more often is going to be via a look-aside through some shared pool. The conceptual linkage may not have a human assigned name, derived instead from "is A like B" answers. The deep knowledge structures built over time may change, and must be able to change without involving the human authors.

At the end, the question comes to: Is there anything useful the AI community has to say on the subject of Semantics that is relevant to HTML authoring?