Search

OTS: Command line text auto-summary

June 6th, 2007 edited by ana

Article submitted by Alex Gretlein. We are running out of articles ! Please help DPOTD and submit good articles about software you like NOW !

Open Text Summarizer is both a library and a command line tool (developed by Nadav Rotem) that, well, summarises text. It is similar to the functionality incorporated into Microsoft Word and available in all native Mac OS X applications. The approach taken by OTS is to use word frequency to prepare a list of keywords and assign priority to sentences based on that frequency. It then outputs a summarised version of your text based on a ratio you supply —the default is 20%, i.e. the summary will be one-fifth the size of the original in terms of number of sentences. An automated process like this can never be perfect, and some texts are more amenable to auto-summarising than others. The reliance on sentences means that a well structured prose text works best, and that it should be somewhat substantial to produce meaning. Auto-summaries can be used as a basis for abstracts or catalogue descriptions, for article summaries in RSS feeds, or for checking keyword frequency for Search Engine Optimisation. Shorter texts, lists, and internally incoherent or structurally inconsistent texts will tend to produce gibberish —which can have its own amusement value. While the performance of OTS may not quite be up to the standards of proprietary alternatives (see this 2003 review), it is —as far as I was able to determine— the only available free or open source (specifically GPL) library for this purpose.

The developer has produced a screencast showing OTS in action. As a sample of the program’s output, this is a 20% summary of the “Ground Rules” section of the Ubuntu Code of Conduct.

This Code of Conduct covers your behaviour as a member of the Ubuntu Community, in any forum, mailing list, wiki, web site, IRC channel, install-fest, public meeting or private correspondence. The Ubuntu Community Council will arbitrate in any dispute over the conduct of a member of the community. We expect members of the Ubuntu community to be respectful when dealing with other contributors as well as with people outside the Ubuntu project, and with users of Ubuntu. Your work should be done transparently and patches from Ubuntu should be given back to the community when they are made, not just when the distribution releases. If you really want to go a different way, then we encourage you to make a derivative distribution or alternative set of packages available using the Ubuntu Package Management framework, so that the community can try out your changes and ideas for itself and contribute to the discussion.

You can run OTS by itself with the command —surprise!— ots:

Usage: ots [OPTIONS...] [file.txt | stdin]
  -r, –ratio=<int>      summarization % [default = 20%]
  -d, –dic=<string>     dictionary to use
  -o, –out=<string>     output file [default = stdout]
  -h, –html             output as html
  -k, –keywords         only output keywords
  -a, –about            only output the summary
  -v, –version          show version information

Help options:
  -?, –help             Show this help message
  –usage                Display brief usage message

So, for example if I had a document called ucoc and I wanted a 10% summary of it in a file called ucoc-tiny, I would run:

$ ots -r 10 -o ucoc-tiny ucoc

The --keywords option seems to be deprecated. The --html option outputs an HTML page of the entire text with the elements that would make up the summary highlighted in yellow.

OTS uses XML based dictionary files to provide word recognition for different languages. The latest version includes files for 37 languages —including most of the major languages written in Roman script, as well as Russian and Hebrew. It does not appear to have any means of recognising variant forms of the same word, such as verb conjugation, particularly in languages like Hebrew.

OTS is available in the repositories of Debian from at least sarge on, and of every release of Ubuntu. It is available by itself under the name libots0. To install it use your favourite graphical installer or run:

$ sudo apt-get install libots0

In both distros this is version 0.4.2, released in 2003. There are also libots-dev packages. A new version, 0.5.0 was released in April, 2007. The source code is available from Sourceforge.

As a library, libots can also be used by other programs. There is a list on the project home page of three applications that provide summarising through OTS:

  1. There was a plug-in in the development version of AbiWord at the time the OTS site was written. That “development version” is fairly ancient and it is a fully integrated part of the version available in current Debian or Ubuntu systems. Abiword itself is a great lightweight alternative to OpenOffice Writer. It’s also the default word processor for Xubuntu, and part of the xubuntu-desktop package. If you’re not running Xubuntu, you can install it with the package name abiword. (There are also development and plug-in packages for Abiword.)
  2. The second application to use OTS is Gnome-Summarizer, a GUI by the author himself. It appears from the screenshot to display the output HTML file and keywords. Even more exciting-looking than this is the “Researcher’s Tool” demonstrated in the screencast of the next version.
  3. The third program listed is a gedit plug-in by Daniel Brodie.

To this, we may be able to add Haystack (described here), an extension for the Plone framework which identifies related content. It uses a Python wrapper called ots, which is available in Python’s Cheese Shop.

An earlier version of this article was posed at IQAG Notes.

Posted in Debian, Ubuntu |

7 Responses

  1. Matteo Says:

    Hi. You may summarise your article and put at the end the result!

    Good work guys.

  2. xc Says:

    20%:
    While the performance of OTS may not quite be up to the standards of proprietary alternatives (see this 2003 review), it is —as far as I was able to determine— the only available free or open source (specifically GPL) library for this purpose.The developer has produced a screencast showing OTS in action. As a sample of the program’s output, this is a 20% summary of the “Ground Rules” section of the Ubuntu Code of Conduct. If you really want to go a different way, then we encourage you to make a derivative distribution or alternative set of packages available using the Ubuntu Package Management framework, so that the community can try out your changes and ideas for itself and contribute to the discussion. The –html option outputs an HTML page of the entire text with the elements that would make up the summary highlighted in yellow.OTS uses XML based dictionary files to provide word recognition for different languages. It does not appear to have any means of recognising variant forms of the same word, such as verb conjugation, particularly in languages like Hebrew.OTS is available in the repositories of Debian from at least sarge on, and of every release of Ubuntu.

  3. James Says:

    I just wrote a php wrapper around this library. It seems like one could use this for doing something really groundbreaking in a web app… now to just figure out what…

  4. John Carter Says:

    Hmm. Need firefox addon version of this…

  5. TuringTest Says:

    I did create a firefox extension for this. Worked great, but it was too slow. I may rescue it from my backup CDs and send it to you, if interested. It only worked in Linux because I couldn’t manage to install OTS on Windows.

  6. Eran Says:

    I’d love a firefox extension for that! Please post a link or an email address…

  7. nam Says:

    I have installed ots-0.5.0.
    when I try to run the article using the ots command (as you have mentioned above), it says
    bash: ots:command not found

    please help