Technology Today

2010 Issue 2

Raytheon Partners With Universities
for Knowledge Technologies
The intelligence community has cried out, and Raytheon has listened.

According to Lt. Gen. David A. Deptula, Air Force deputy chief of staff for intelligence, surveillance and reconnaissance, "We're going to find ourselves in the not too distant future swimming in sensors and drowning in data." 1

To address this issue, Raytheon has conducted significant research and matured techniques to work at higher orders of cognitive function in the progression from data to information to knowledge, as depicted in Figure 1. Actionable intelligence reference modelFigure 1. Part of that investment has focused at the knowledge level, where algorithms are developed to extract actionable information, or knowledge, from large seas of data and tie together pieces of knowledge from different sources to increase its value.

Raytheon's investigation of the marketplace has found a lack of existing tools and techniques for manipulating knowledge, so the company has focused its research and development on that level and higher.

This article highlights three collaborative partnerships that Raytheon has with universities to address sharing knowledge, using knowledge tools alongside existing information tools, and merging knowledge from different sources.

Geographic Semantic Schema Matching

Integrating information has proven to be a difficult problem over the last few decades. Researchers at the University of Texas at Dallas, led by Dr. Latifur Khan and Dr. Bhavani Thuraisingham, have developed a method for integrating different geospatial resources that is applicable to combining information from, for example, Google Maps™ and MapQuest™ tools. The method, called GSim, is a two-part process, and it is intended to ultimately work with little or no human intervention.

The first part of GSim compares data between systems using details about their geographic information. To give an example of how many geographic locations share the same name, the instance value "Victoria" (depicted in Figure 2) may Figure 2. Even within a single geographic source, an identifier like Victoria or Clinton May appear many times, causing great difficulty in matching across multiple sourcesbe a city, a county, a lake or various other features. After comparing enough details between the two geographic sources, the approach develops statistics indicating which feature types (e.g., city, county, road, etc.) most closely match among the data sources.

At this point, the set of possible matches is too great and still confused. The second part of the algorithm reduces the potential matches by confirming which feature types have similar meanings by looking at how well the two features align on the planet or estimating how alike the feature names are by using the Google Maps distance calculator, which finds frequent occurrences of the feature names using standard Google search and determines how often the potentially matching terms from the two sources appear on the same pages.

The GSim algorithm was compared with a method for semantic similarity measure¬ments that uses substrings of Length 2 known as 2-grams. The results over two distinct sets of geographic databases showed that GSim performed 25 to 50 percent better in both precision and recall.

Bridging Knowledge and Information Technologies

As knowledge technologies grow in popularity, there is still a need to work with pre-existing tools and environments. RDF-to-database (R2D) allows knowledge engineers to use new storage approaches, specifically resource description framework (RDF), with existing relational database visualization and analytic tools like Crystal Reports® and Business Objects®.

R2D was devised at the University of Texas at Dallas under the guidance of Dr. Latifur Khan and Dr. Bhavani Thuraisingham. It addresses the problem by providing a bridge between the two approaches to storage.

As shown in Figure 3, the graph-oriented (i.e., links and nodes) structures of RDF are presented in relational database form to the existing tools. This is accomplished without converting any data to relational table form. Rather, all queries in relational table form (e.g., SQL) are converted on the fly into an RDF form (e.g., SPARQL), and then results are converted on the fly into the necessary relational table presentation.

Figure 3. R2D converts semantic representations into relational database table hierarchies

The performance impact of R2D was measured to be a negligible addition to query time of the knowledge store while enabling the user to leverage the data table for further analysis.

Random Forest Disambiguation

Determining which names in multiple datasets actually refer to the same person is very challenging and is of high impor-tance to the intelligence community. For example, when "John Smith" appears multiple times in a data set, how do we determine if this always refers to the same person? Solving this problem includes using all the information available in each data source like address, job title, list of friends, and correspondences. Dr. C. Lee Giles at the Pennsylvania State University has developed a method for this problem and deployed it as part of managing the scientific literature library CiteSeer, hosted by Penn State's College of Information Sciences and Technology.

The approach, named Random Forest, requires enough known truth samples to train it before being used, like other machine learning algorithms. The Random Forest approach uses decision-tree learning as part of the algorithm. The decision-tree approach takes a set of data and subdivides the data using features such as how closely related two names are, based on additional attributes, so that leaves of the tree represent whether or not two names are considered to Figure 4. Depiction of a Random Forest of Training Setsrepresent the same person. The Random Forest algorithm first modifies this approach by using a random selection of a subset of features for the splitting criteria at each node in the tree, instead of optimally selecting from the full feature set. Once the forest is built, as depicted in Figure 4, it simply counts the majority votes (i.e., match/no-match) of the trees in the forest.

The Random Forest approach was compared to the popular support vector machine classifier using the Medline literature database maintained by the U.S. National Library of Medicine, which has more than 18 million articles. The results show Random Forest to be two to three percentage points better than support vector machines in accuracy, and always much faster to train. This represents significant improvement, especially when dealing with extremely large data sets.

Summary

Thus far, our contributions to technologies for application and integration of knowledge technologies include:

  • Automated matching of geographic schema
  • Bridging of knowledge stores to existing database exploitation tools
  • Disambiguation of human identities across multiple sources

Raytheon will continue to mature these technologies to address the intelligence needs of our nation.

Some of Raytheon's 100+ University Partnerships and Projects
Full-Motion Video-Based Control Research Texas A&M
Formal Verification Methods for Security Verification University of Texas, Austin
Semantic-Based Knowledge Extraction UMass - Amherst
Folding MEMS IMU UC Irvine
Advanced RF Image Formation and ATR Ohio State University
Modeling and Prediction of Battery Lifetime in Wireless Sensor Nodes University of Arizona
Energy Security Microgrid Configuration Study New Mexico State University
Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) Cal Poly SLO
Distributed Radar for Weather Detection - Waveforms University of Melbourne
Distributed Radar for Weather Detection - Testbed University of Adelaide
Rapid Grinding and Polishing of SiC and Glass Ceramic Substrates University of Arizona
Novel Passive and Active Mid-IR Fibers for IRCM Applications Clemson University
Low Loss, High Strength Fibers from Improved Chalcogenide Glasses Clemson University
AlGaN/GaN Nanowire Transistors for Low Noise and W-band Applications MIT
Radar Signal Processing Cal Poly Pomona
Silicon Compatible Processing of III-V Devices University of Glasgow
3-D Modeling of Semi-Guiding Fiber University of Rochester
Enhancing Modeling and Simulation Reuse Old Dominion University
Low Defect Density Substrate Technology for Heterogeneopus Integration
of III-V Devices and Si CMOS
MIT
Increasing the Self-Focusing Threshold in High-Peak-Power Fiber Lasers Cornell University
Development of Titanium Foil Reinforced High Temperature Composite GS Fuselage UCLA
Meta and Nano Materials Research UMass-Lowell
Partnership for Cyber Policy Research Georgetown University
High Mechanical Performance and Electromagnetic Interference (EMI)  
Shielded Multifunctional Composites Florida State University
MBE-Grown, IV-VI Nano-Based, Ultra Thermoelectric Coolers University of Oklahoma
Electrowetting Display Research University of Cincinnati


1 Magnuson, Stew. "Military 'Swimming In Sensors and Drowning in Data.'" National Defense: January 2010.

Authors: Steven Seida; BJ Simpson
Contributors: Jeffrey Partyka
Sunitha Sririam,
Dr. Latifur Khan,
Dr. Bhavani Thuraisingham,
Dr. C. Lee Giles

Top of Page