Raytheon Partners With Universities
for Knowledge Technologies
The intelligence community has cried out, and Raytheon has listened.
According to Lt. Gen. David A. Deptula, Air Force deputy chief of staff for intelligence, surveillance and reconnaissance, "We're going to find ourselves in the not too distant future swimming in sensors and drowning in data." 1
To address this issue, Raytheon has conducted significant research and matured techniques to work at higher orders of cognitive function in the progression from data to information to knowledge, as depicted in Figure 1. Part of that investment has focused at the knowledge level, where algorithms are developed to extract actionable information, or knowledge, from large seas of data and tie together pieces of knowledge from different sources to increase its value.
Raytheon's investigation of the marketplace has found a lack of existing tools and techniques for manipulating knowledge, so the company has focused its research and development on that level and higher.
This article highlights three collaborative partnerships that Raytheon has with universities to address sharing knowledge, using knowledge tools alongside existing information tools, and merging knowledge from different sources.
Geographic Semantic Schema Matching
Integrating information has proven to be a difficult problem over the last few decades. Researchers at the University of Texas at Dallas, led by Dr. Latifur Khan and Dr. Bhavani Thuraisingham, have developed a method for integrating different geospatial resources that is applicable to combining information from, for example, Google Maps™ and MapQuest™ tools. The method, called GSim, is a two-part process, and it is intended to ultimately work with little or no human intervention.
The first part of GSim compares data between systems using details about their geographic information. To give an example of how many geographic locations share the same name, the instance value "Victoria" (depicted in Figure 2) may be a city, a county, a lake or various other features. After comparing enough details between the two geographic sources, the approach develops statistics indicating which feature types (e.g., city, county, road, etc.) most closely match among the data sources.
At this point, the set of possible matches is too great and still confused. The second part of the algorithm reduces the potential matches by confirming which feature types have similar meanings by looking at how well the two features align on the planet or estimating how alike the feature names are by using the Google Maps distance calculator, which finds frequent occurrences of the feature names using standard Google search and determines how often the potentially matching terms from the two sources appear on the same pages.
The GSim algorithm was compared with a method for semantic similarity measure¬ments that uses substrings of Length 2 known as 2-grams. The results over two distinct sets of geographic databases showed that GSim performed 25 to 50 percent better in both precision and recall.
Bridging Knowledge and Information Technologies
As knowledge technologies grow in popularity, there is still a need to work with pre-existing tools and environments. RDF-to-database (R2D) allows knowledge engineers to use new storage approaches, specifically resource description framework (RDF), with existing relational database visualization and analytic tools like Crystal Reports® and Business Objects®.
R2D was devised at the University of Texas at Dallas under the guidance of Dr. Latifur Khan and Dr. Bhavani Thuraisingham. It addresses the problem by providing a bridge between the two approaches to storage.
As shown in Figure 3, the graph-oriented (i.e., links and nodes) structures of RDF are presented in relational database form to the existing tools. This is accomplished without converting any data to relational table form. Rather, all queries in relational table form (e.g., SQL) are converted on the fly into an RDF form (e.g., SPARQL), and then results are converted on the fly into the necessary relational table presentation.
The performance impact of R2D was measured to be a negligible addition to query time of the knowledge store while enabling the user to leverage the data table for further analysis.
Random Forest Disambiguation
Determining which names in multiple datasets actually refer to the same person is very challenging and is of high impor-tance to the intelligence community. For example, when "John Smith" appears multiple times in a data set, how do we determine if this always refers to the same person? Solving this problem includes using all the information available in each data source like address, job title, list of friends, and correspondences. Dr. C. Lee Giles at the Pennsylvania State University has developed a method for this problem and deployed it as part of managing the scientific literature library CiteSeer, hosted by Penn State's College of Information Sciences and Technology.
The approach, named Random Forest, requires enough known truth samples to train it before being used, like other machine learning algorithms. The Random Forest approach uses decision-tree learning as part of the algorithm. The decision-tree approach takes a set of data and subdivides the data using features such as how closely related two names are, based on additional attributes, so that leaves of the tree represent whether or not two names are considered to represent the same person. The Random Forest algorithm first modifies this approach by using a random selection of a subset of features for the splitting criteria at each node in the tree, instead of optimally selecting from the full feature set. Once the forest is built, as depicted in Figure 4, it simply counts the majority votes (i.e., match/no-match) of the trees in the forest.
The Random Forest approach was compared to the popular support vector machine classifier using the Medline literature database maintained by the U.S. National Library of Medicine, which has more than 18 million articles. The results show Random Forest to be two to three percentage points better than support vector machines in accuracy, and always much faster to train. This represents significant improvement, especially when dealing with extremely large data sets.
Thus far, our contributions to technologies for application and integration of knowledge technologies include:
- Automated matching of geographic schema
- Bridging of knowledge stores to existing database exploitation tools
- Disambiguation of human identities across multiple sources
Raytheon will continue to mature these technologies to address the intelligence needs of our nation.
Some of Raytheon's 100+ University Partnerships and Projects
|Full-Motion Video-Based Control Research||Texas A&M|
|Formal Verification Methods for Security Verification||University of Texas, Austin|
|Semantic-Based Knowledge Extraction||UMass - Amherst|
|Folding MEMS IMU||UC Irvine|
|Advanced RF Image Formation and ATR||Ohio State University|
|Modeling and Prediction of Battery Lifetime in Wireless Sensor Nodes||University of Arizona|
|Energy Security Microgrid Configuration Study||New Mexico State University|
|Synthetic Aperture Radar Automatic Target Recognition (SAR ATR)||Cal Poly SLO|
|Distributed Radar for Weather Detection - Waveforms||University of Melbourne|
|Distributed Radar for Weather Detection - Testbed||University of Adelaide|
|Rapid Grinding and Polishing of SiC and Glass Ceramic Substrates||University of Arizona|
|Novel Passive and Active Mid-IR Fibers for IRCM Applications||Clemson University|
|Low Loss, High Strength Fibers from Improved Chalcogenide Glasses||Clemson University|
|AlGaN/GaN Nanowire Transistors for Low Noise and W-band Applications||MIT|
|Radar Signal Processing||Cal Poly Pomona|
|Silicon Compatible Processing of III-V Devices||University of Glasgow|
|3-D Modeling of Semi-Guiding Fiber||University of Rochester|
|Enhancing Modeling and Simulation Reuse||Old Dominion University|
|Low Defect Density Substrate Technology for Heterogeneopus Integration
of III-V Devices and Si CMOS
|Increasing the Self-Focusing Threshold in High-Peak-Power Fiber Lasers||Cornell University|
|Development of Titanium Foil Reinforced High Temperature Composite GS Fuselage||UCLA|
|Meta and Nano Materials Research||UMass-Lowell|
|Partnership for Cyber Policy Research||Georgetown University|
|High Mechanical Performance and Electromagnetic Interference (EMI)|
|Shielded Multifunctional Composites||Florida State University|
|MBE-Grown, IV-VI Nano-Based, Ultra Thermoelectric Coolers||University of Oklahoma|
|Electrowetting Display Research||University of Cincinnati|
1 Magnuson, Stew. "Military 'Swimming In Sensors and Drowning in Data.'" National Defense: January 2010.
Authors: Steven Seida; BJ Simpson
Contributors: Jeffrey Partyka
Dr. Latifur Khan,
Dr. Bhavani Thuraisingham,
Dr. C. Lee Giles