News River Visualization of TIDES Data

1: The Problem

The TIDES news feed provides a firehose of text data sufficient to overwhelm any single human. Even the data summaries average over 11,000 words a day. The full news groups are much larger -- one single group, disease.anthrax, contains over 8,000 articles.


a few messages from the disease.anthrax newsgroup

The problem can be stated in long-term and short-term goals. The long- term goal would be to have an Artificial Intelligence (AI) program that operated as an expert system and understood the articles enought to ananlyze, summarize and index them. This version of the problem is currently unsolvable.

The short-term goal is to serve as an assistant to the human in the tasks. We assume that somehow humans are finding the time to read every article, and that their major problem then becomes one of memory -- how do you find something you read last month, or six months ago?

2: Inspirations

The Chinese pictrogram system of language suffers from being nearly un-indexable. It lacks even alphabetical order. This has made the categorization efforts of Chinese scholars extremely difficult. The 16th Century Italian Jesuit Matteo Ricci journeyed to China and taught them a technique he developed called The Memory Palace, in which knowledge is stored in a real or imaginary building, utilizing humans' natural ability to remember spacially.


The Memory Palace of Matteo Ricci

His work has been rediscovered in the 1980s by cognitive scientists and Virtual Reality (VR) researchers. Artificial spaces, or "cyberspaces," can be used as memory palaces.

"People don't read the morning newspaper, Marshall McLuhan once said,
they slip into it like a warm bath."
Inspired by Ricci, and by McLuhan's "newspaper bath," as well as published research of Xerox PARC and Pacific Northwest Laboratories, we have attempted to turn the firehose of data into a river of data.

3: The Plan

First we'd like to emphasise that we are designing the visualization software (in Java) with a plug-in architecture, so that other techniques can be easily tried. The "news river" represents our first set of experiments, and what we end up with may be quite different, and will certainly allow multiple visualization techniques to be used on the same data.

That said, here is what we planned for the news river approach.

Though there is a lot of data in the TIDES newsgroups, only a small fraction of that at any point in time is new data. We wanted to create a display in which new data appeared on the bottom edge daily, giving the impression of a river of words drifting by. Instead of trying to drink froma firehose one can fish in the river as it flows by.

What we wanted to do precisely is plot word frequencies over time, with the frequencies represented as horizontal widths of color bands, and time as the vetrical axis, past on top, present on bottom.


mock-up display using TIDES summary data

we did some mock-up displays using the TIDES summary data, which was small and easily obtained from email archives. On the basis of explorations in this data we selected the words America, Arab, Iran, Iraq, and Laden, because they all appear frequently. (They are plotted n that order left to right.) Other key features we want to see are: time labels on the plot, and an ability to pick a time in the graphic dipslay and open articles from that date.

4: Prototype Solution

In implementing a Java prototye we found that the size of the data made it unwieldy to try to load it all into a browser-deployed Java Applet. So we broke it into two pieces: a server-based program to load and store the messages from the news feed and do the analysis, creating a data file, and another program (which could easily be an Applet), for displaying the results.

We got the prototype working late Monday 10 December 2001. Picking is not yet implemented, but we are seeing scrolling display of plotted word frequencies in the disease.anthrax group, with time labels. (The same words in the same order as above were used.) If we want more information we can manually return to a news reader and look at articles in that group on that data.


plot of word frequencies in Sep./Oct. 2001

And in fact we did just that. We noticed that there was an expected increase in all the word frequencies afdter September 11, 2001. But there also was a spike around August 11th, som we went and looked.


plot of word frequencies in Aug./Sep. 2001

Sure enough, that was the day the Report of the National Comission on Terrorism was posted to the group, a rather long document that talks about all forms of terrorism.

5: Future Work