Article submitted by Alex Gretlein. We are running out of articles ! Please help DPOTD and submit good articles about software you like NOW !
Open Text Summarizer is both a library and a command line tool (developed by Nadav Rotem) that, well, summarises text. It is similar to the functionality incorporated into Microsoft Word and available in all native Mac OS X applications. The approach taken by OTS is to use word frequency to prepare a list of keywords and assign priority to sentences based on that frequency. It then outputs a summarised version of your text based on a ratio you supply —the default is 20%, i.e. the summary will be one-fifth the size of the original in terms of number of sentences. An automated process like this can never be perfect, and some texts are more amenable to auto-summarising than others. The reliance on sentences means that a well structured prose text works best, and that it should be somewhat substantial to produce meaning. Auto-summaries can be used as a basis for abstracts or catalogue descriptions, for article summaries in RSS feeds, or for checking keyword frequency for Search Engine Optimisation. Shorter texts, lists, and internally incoherent or structurally inconsistent texts will tend to produce gibberish —which can have its own amusement value. While the performance of OTS may not quite be up to the standards of proprietary alternatives (see this 2003 review), it is —as far as I was able to determine— the only available free or open source (specifically GPL) library for this purpose.
The developer has produced a screencast showing OTS in action. As a sample of the program’s output, this is a 20% summary of the “Ground Rules” section of the Ubuntu Code of Conduct.
This Code of Conduct covers your behaviour as a member of the Ubuntu Community, in any forum, mailing list, wiki, web site, IRC channel, install-fest, public meeting or private correspondence. The Ubuntu Community Council will arbitrate in any dispute over the conduct of a member of the community. We expect members of the Ubuntu community to be respectful when dealing with other contributors as well as with people outside the Ubuntu project, and with users of Ubuntu. Your work should be done transparently and patches from Ubuntu should be given back to the community when they are made, not just when the distribution releases. If you really want to go a different way, then we encourage you to make a derivative distribution or alternative set of packages available using the Ubuntu Package Management framework, so that the community can try out your changes and ideas for itself and contribute to the discussion.
You can run OTS by itself with the command —surprise!— ots
:
Usage: ots [OPTIONS...] [file.txt | stdin]
-r, –ratio=<int> summarization % [default = 20%]
-d, –dic=<string> dictionary to use
-o, –out=<string> output file [default = stdout]
-h, –html output as html
-k, –keywords only output keywords
-a, –about only output the summary
-v, –version show version information
Help options:
-?, –help Show this help message
–usage Display brief usage message
So, for example if I had a document called ucoc
and I wanted a 10% summary of it in a file called ucoc-tiny
, I would run:
$ ots -r 10 -o ucoc-tiny ucoc
The --keywords
option seems to be deprecated. The --html
option outputs an HTML page of the entire text with the elements that would make up the summary highlighted in yellow.
OTS uses XML based dictionary files to provide word recognition for different languages. The latest version includes files for 37 languages —including most of the major languages written in Roman script, as well as Russian and Hebrew. It does not appear to have any means of recognising variant forms of the same word, such as verb conjugation, particularly in languages like Hebrew.
OTS is available in the repositories of Debian from at least sarge on, and of every release of Ubuntu. It is available by itself under the name libots0. To install it use your favourite graphical installer or run:
$ sudo apt-get install libots0
In both distros this is version 0.4.2, released in 2003. There are also libots-dev packages. A new version, 0.5.0 was released in April, 2007. The source code is available from Sourceforge.
As a library, libots can also be used by other programs. There is a list on the project home page of three applications that provide summarising through OTS:
- There was a plug-in in the development version of AbiWord at the time the OTS site was written. That “development version” is fairly ancient and it is a fully integrated part of the version available in current Debian or Ubuntu systems. Abiword itself is a great lightweight alternative to OpenOffice Writer. It’s also the default word processor for Xubuntu, and part of the xubuntu-desktop package. If you’re not running Xubuntu, you can install it with the package name
abiword
. (There are also development and plug-in packages for Abiword.)
- The second application to use OTS is Gnome-Summarizer, a GUI by the author himself. It appears from the screenshot to display the output HTML file and keywords. Even more exciting-looking than this is the “Researcher’s Tool” demonstrated in the screencast of the next version.
- The third program listed is a gedit plug-in by Daniel Brodie.
To this, we may be able to add Haystack (described here), an extension for the Plone framework which identifies related content. It uses a Python wrapper called ots, which is available in Python’s Cheese Shop.
An earlier version of this article was posed at IQAG Notes.