The KANT System
System Overview
The research presented in this thesis is a set of algorithms and applications that create what we call the KANT system. KANT stands for Knowledge Acquisition, iNterpretation and Translation. The system is designed for Japanese-English cross-lingual information retrieval and knowledge acquisition with a focus on news. An overview of the system can be seen in the figure below. The system is broken up into two major parts: the KANT library and an application suite. This thesis’s main focus is on creating the KANT library. The library should make a good foundation for future research. The next two subsections will take a look at the
library and application suite in more detail.
KANT Library
The KANT library is a set of core algorithms designed to deal with news. While the current target is Japanese and English, the library is designed so that other languages can be easily added. It is made up of three modules: Knowledge Acquisition, Knowledge Interpretation and Knowledge Translation. These three modules take care of acquiring news articles and interpreting and translating the knowledge in those articles.
The Knowledge Acquisition module is made up of three submodules: News Monitoring, Special Domain Creation and Comparable Corpus Creation. The news monitoring submodule monitors different news feeds and sites for new news to acquire for the system. The special domain creation and comparable corpus creation modules are to take care of the third tenet (Manual creation of corpora is too expensive) and are in charge of gathering corpora to be mined for statistical information and lexicon creation. The Knowledge Interpretation module is the main module for taking the raw knowledge in news articles and putting it in a form more usable by computers. It is made up of three submodules: Keyword Extraction, Topic Analysis and Named Entity Recognition. Each of the submodules deal with converting certain types of raw knowledge. The Knowledge Translation module deals with translating knowledge to another form of knowledge. This can be translating one language to another or from different forms of interpreted knowledge to a new representation. The module is made up of two submodules: Machine translation and Knowledge Network.

Application Suite
The application suite is the presentation layer of the system. It is made up of one module: Knowledge Presentation. The application suite is what the user will interact with. Currently, two of the most basic and useful applications have been completed: text summarization and digest creation. Text summarization is to aid the user in quickly browsing news that is acquired by the system. It acts as knowledge browser for the system. It is also used by other presentation submodules and will be used more and more in the future. Digest creation filters the news for the day into the most important stories. This allows users to keep track of the major happenings without having to be glued to their televisions, radios, or computers. The digest creation system will also aid in tracking events and people in the future.