Structured Document Markup Languages and Automatic Abstracting

Chris Dent

L505 Essay 5

2001-02-10

Consider that abstracting involves the identification of sections of documents by the abstractor (Cremmins, p 15). It follows that automatic identification of document structure will be helpful in the automatic generation of abstracts, indexes and other modes of representation.

Structure provides clues to the location of important segments of a document. In straight text, where only the meaning of the text provides structural clues, a computer can only do identification of content based on key words or phrases that it finds either by frequency counting or through instruction from the operator. Here is a segment of a document in an unstructured form:

Finally, and perhaps most importantly, Kiva::User needs to be made more generic. Whereas McFeely is readily available for download and use by whoever chooses to use it, Kiva::User is currently not available in any form that is easy to distribute. It has potential value for other Internet Service Providers and other organizations that need to manage the creation and modification of a large number of users.

Summary

Kiva::User and McFeely have proven immensely valuable to Kiva Networking. Today, adding additional machines to the server architecture or changing user policy is far less of a concern than it was three years ago. Management of those machines and the users on those machines can now be accomplished with a well-defined, stable set of tools and methods (Dent, 2000).

The computer could identify, with some measure of accuracy, that the sentence following the paragraph containing the lone word “Summary” is probably important.

To obtain greater degrees of accuracy the document itself must provide not just clues of, but statements of structure. A known structured markup language can indicate, without ambiguity, where certain sections may be found. Here is the same text marked up with an imaginary XML format:

[…]

<P>Finally, and perhaps most importantly, Kiva::User needs to be made more generic. Whereas McFeely is readily available for download and use by whoever chooses to use it, Kiva::User is currently not available in any form that is easy to distribute. It has potential value for other Internet Service Providers and other organizations that need to manage the creation and modification of a large number of users.</P>

</SECTION>

<P>Kiva::User and McFeely have proven immensely valuable to Kiva Networking. Today, adding additional machines to the server architecture or changing user policy is far less of a concern than it was three years ago. Management of those machines and the users on those machines can now be accomplished with a well-defined, stable set of tools and methods.</P>

[…]

</SECTION>

With that kind of format the computer could identify sections, determine where the introduction and summary lie and chain first sentences from sections to create the start of a potential abstract.

HTML fails as a structured markup language because it is a hybrid of both presentation markup and document structure markup. Often tags that were originally meant to impart meaning to the structure of a document instead only mean something in presentation. For example, the H1 tag frequently does not mean “Header of the greatest importance” but instead “Some text that should appear big.”

XML addresses these issues by separating document structure from presentation and more rigorously enforcing the concept of containers in the markup. One does not markup section headers, one marks up sections that happen to have headers (see the SECTION tag above).

Getting value out of XML for abstract automation will require some discipline. People tend to not like writing in a structured fashion. Template systems might help and may in fact provide a tool for greater assistance from authors in the generation of automatic abstracts and indexes. Many publications require certain formats for submission. If these were a standardized XML DTD writing in a somewhat structured form could be a part of the submission and review process. Imagine how much easier abstract generation might be if the SECTION tag described above were somewhat more complex:

If we take the idea of automation a bit further, semi-automated document editors could automate the generation of the SUMMARY attribute of a SECTION. Some editors already have similar capabilities.

REFERENCES

Cremmins, E. The Art of Abstracting. ISI Press, 1982.

Dent, Chris, Matt Liggett, Adrian Hosey and Jeremy Fischer. “Distributed User Management Tools: Developing Kiva::User and McFeely”. USENIX 2001 Paper Submission (redirected with comments to LISA 2001), 2000.