Meet Catalyst: IARPA’s Entity and Relationship Extraction Program

Sourced from: Public Intelligence

April 4, 2012 in Featured

A slide from a presentation by the Chief Information Officer of the Office of the Director of National Intelligence depicts examples of “entity extraction” and “relationship extraction” from a piece of intelligence.

Public Intelligence

The Office of the Director of National Intelligence (ODNI) is building a computer system capable of automatically analyzing the massive quantities of data gathered across the entire intelligence community and extracting information on specific entities and their relationships to one another.  The system which is called Catalyst is part of a larger effort by ODNI to create software and computer systems capable of knowledge management, entity extraction and semantic integration, enabling greater analysis and understanding of complex, multi-source intelligence throughout the government.

The intelligence community has been working for years to develop software and analytical frameworks capable of large-scale data analysis and extraction. Technological advances have now made it possible for spy agencies to not just capture the incredible amount of data flowing through public and private networks around the world, but to parse, contextualize and understand the intelligence that is beinggathered.  Automated software programs are now capable of integrating data into semantic systems, providing context and meaning to names, dates, photographs and practically any kind of data you can imagine.

Many agencies within the intelligence community have already created systems to do this sort of semantic integration.  The Office of Naval Intelligence uses a system called AETHER “to correlate seemingly disparate entities and relationships, to identify networks of interest, and to detect patterns.”  The NSA runs a program called APSTARS that provides “semantic integration of data from multiple sources in support of intelligence processing.”  The CIA has a program called Quantum Leap that is designed to “find non-obvious linkages, new connections, and new information” from within a dataset. Several similar programs were even initiated by ODNI including BLACKBOOK and the Large Scale Internet Exploitation Project (LSIE).

Catalyst is an attempt to create a unified system capable of automatically extracting complex information on entities as well as the relationships between them while contextualizing this information within semantic systems.  According to its specifications, Catalyst will be capable of creating detailed histories of people, places and things while mapping the interrelations that detail those entities’ interactions with the world around them. A study conducted by IARPA states that Catalyst is designed to incorporate data from across the entire intelligence community, creating a centralized repository of available information gathered from all agencies:

Many IC organizations have recognized this problem and have programs to extract information from the resources, store it in an appropriate form, integrate the information on each person, organization, place, event, etc. in one data structure, and provide query and analysis tools that run over this data. Whereas this is a significant step forward for an organization, no organization is looking at integration across the entire IC. The DNI has the charter to integrate information from all organizations across the IC; this is what Catalyst is designed to do with entity data. The promise of Catalyst is to provide, within the security constraints on the data, access to “all that is known” within the IC on a person, organization, place, event, or other entity. Not what the CIA knows, then what DIA knows, and then what NSA knows, etc., and put the burden on the analyst to pull it all together, but have Catalyst pull it all together so that analysts can see what CIA, DIA, NSA, etc. all know at once. The value to the intelligence mission, should Catalyst succeed, is nothing less than a significant improvement in the analysis capability of the entire IC, to the benefit of the national security of the US.

To fully grasp the capabilities of such a system, it is important to understand the concepts of “semantic integration” and “entity extraction” that Catalyst will perform.  Using an example described in the IARPA study, we will follow data through the stages of processing in a Catalyst system:

For example, some free text may include “… Joe Smith is a 6’11″ basketball player who plays for the Los Angeles Lakers…” from which the string “Joe Smith ” may be delineated as an entity of class Athlete (a subclass of People) having property Name with value JoeSmith and Heightwith value 6’11″ (more on this example below). Note that it is important to distinguish between an entity and the name of the entity, for an entity can have multiple names (JoeSmith, JosephSmith, JosephQSmith, etc.).

Once entities and their associated relationship values are determined, the information is then integrated into a knowledge base to produce a semantic graph:

To continue the example, one entry in the knowledge base is the entity of class Athlete with (datatype property) Name having value JoeSmith, another is the entity of classSportsFranchise with Name having value Lakers, and another is an entity of class Cityhaving value LosAngeles. If each of these is viewed as a node in a graph, then an edge connecting the node (entity) with Name JoeSmith to the node with Name Lakers is namedMemberOf and the edge connecting the node with Name Lakers to the node with Name LosAngeles is named LocatedIn. Such edges, corresponding to relationships (object properties) and have a direction; for example, JoeSmith is a MemberOf the Lakers, but theLakers are not a MemberOf JoeSmith (there may be an inverse relationship, such asHasMember, that is between the Lakers and JoeSmith.).

Data that has been extracted and integrated can then produce patterns that determine unknown relations between an entity and other entities that may be of concern to a particular intelligence agency:

Another simple pattern could be: JoeSmith Owns Automobile, or Person Owns an instance of the class Automobile with Manufacturer Lexus and LicensePlate VA-123456 or evenJoeSmith has-unknown-relationship-with an instance of the class Automobile withManufacturer Lexus and LicensePlate VA-123456. In these last three examples, one of the entities or the relationship is uninstantiated. Note that JoeSmith Owns an instance of the class Automobile with Manufacturer Lexus and LicensePlate VA-123456 is not a pattern, for it has no uninstantiated entities or relationships. A more complex pattern could be: Person Owns Automobile ParticipatedIn Crime HasUnknownRelationshipWith Organization HasAffiliationWith TerroristOrganization. Any one or more of the entities and the has-unknown-relationship-with relationship (but not all) can be instantiated and it would still be a pattern, such as JoeSmith Owns Automobile ParticipatedIn Crime PerpetratedBy Organization HasAffiliationWith HAMAS.

 

While this example only provides a limited view of Catalyst functionality, it nonetheless helps to demonstrate the potential capabilities of the system.  Far more detailed explanations of the system, as well as a useful overview of similar government systems across the intelligence community, are provided in IARPA’s one-hundred and twenty-two page study.

IARPA-Catalyst

Advertisements

2 Responses to “Meet Catalyst: IARPA’s Entity and Relationship Extraction Program”

  1. A couple of comments:

    – the technology behind Catalyst is the same technology driving the architecture of the next version of the web (many names including The Semantic Web, Web 3.0, the Web-of-Data, Linked-Data). The architecture can be read about in detail here: http://www.w3.org/standards/semanticweb/ It is the brain-child of Tim Berners-Lee, the inventor of the internet, is a brilliant concept and deserves wide support. It is close to mainstream, with IBM having just released DB2 version 10 that supports some capabilities (RDF).

    – it is not proprietary, nor unique to the intelligence community. In future the same capabilities described in this article will be in use by everyone. Just like we now use url’s and browsers, we will in future use uri’s and linked-data enabled browsers. Further the benefits will be so significant that anyone who doesn’t use it will seem out of touch; like those who do not now use the web.

    – finally, use of this technology by the intelligence industry may be REALLY GOOD NEWS. While there is a chance for error and misuse, it is no more to be feared than internet banking. And remember the fear-mongering that was proceeded by? Yes there have been a few problems but I personally think internet banking is one of the best things the internet has enabled.

    So, why might the intelligence community using SemWeb technology be so great? Because it is based on fact, provable, auditable, unsentimental, arguable, defensible, computer-based-inference that will likely improve the information quality dramatically and therefore, if not absolute fact, at least much closer to fact than is currently the case. Removing the black-art-based unauditable activities the intelligence community currently use as their method of operation and replacing it with fact-based operations must surely be a positive step.

    Further, it will be automated and immediate as it will be done by computers rather than back-room analysts. Surely this will save money and improve service, just like internet banking has.

    And finally, this technology is available to everyone and equally can monitor everyone. We too can build repositories. It too can be used to monitor politicians and police, criminals and child-pornographers, doctors, lawyers, billionaires and paupers and as it is fact-based it can be used to assess who in each of these categories is truly what they appear to be.

    Wikileaks is about transparency and fact-based analysis and decision making.

    The SemWeb technology behind Catalyst enhances transparency and fact-based analysis and decision making.

    Both sides of this argument, “the big-brother has control and wants more” outlined in this article, and the “we are building this technology into Web 3.0 to improve ease with which we can gather and share fact transparently” should be heard.

    The Web is ours, not big brothers, and while big-brother may want to own it exclusively, the repository they build will one day also be ours. Wonder what that will tell us about our “rulers”.

    An useful sample application of the SemWeb technology might be to parse all the WikiLeaks documents, Cables, War Logs, Intelligence Emails, everything through a SemWeb engine, and see what nuggets might pop out through computer inference that would be much too onerous for a human to undertake. I would bet that there is still a lot of gold in those documents and that linking all of them and cross-referencing would find some truly amazing “coincidental’ relationships. Add to that a few banking documents and continue to enhance the totality with every new leak and one soon sees how this technology will work in our favour.

    The emporer’s cloths are getting more transparent every day.

    @GregLBean

  2. A couple of comments:
    – the technology behind Catalyst is the same technology driving the architecture of the next version of the web (many names including The Semantic Web, Web 3.0, the Web-of-Data, Linked-Data). The architecture can be read about in detail here: http://www.w3.org/standards/semanticweb/ It is the brain-child of Tim Berners-Lee, the inventor of the internet, is a brilliant concept and deserves wide support. It is close to mainstream, with IBM having just released DB2 version 10 that supports some capabilities (RDF).

    – it is not proprietary, nor unique to the intelligence community. In future the same capabilities described in this article will be in use by everyone. Just like we now use url’s and browsers, we will in future use uri’s and linked-data enabled browsers. Further the benefits will be so significant that anyone who doesn’t use it will seem out of touch; like those who do not now use the web.

    – finally, use of this technology by the intelligence industry may be REALLY GOOD NEWS. While there is a chance for error and misuse, it is no more to be feared than internet banking. And remember the fear-mongering that was proceeded by? Yes there have been a few problems but I personally think internet banking is one of the best things the internet has enabled.

    So, why might the intelligence community using SemWeb technology be so great? Because it is based on fact, provable, auditable, unsentimental, arguable, defensible, computer-based-inference that will likely improve the information quality dramatically and therefore, if not absolute fact, at least much closer to fact than is currently the case. Removing the black-art-based unauditable activities the intelligence community currently use as their method of operation and replacing it with fact-based operations must surely be a positive step.

    Further, it will be automated and immediate as it will be done by computers rather than back-room analysts. Surely this will save money and improve service, just like internet banking has.

    And finally, this technology is available to everyone and equally can monitor everyone. We too can build repositories. It too can be used to monitor politicians and police, criminals and child-pornographers, doctors, lawyers, billionaires and paupers and as it is fact-based it can be used to assess who in each of these categories is truly what they appear to be.

    Wikileaks is about transparency and fact-based analysis and decision making.

    The SemWeb technology behind Catalyst enhances transparency and fact-based analysis and decision making.

    Both sides of this argument, “the big-brother has control and wants more” outlined in this article, and the “we are building this technology into Web 3.0 to improve ease with which we can gather and share fact transparently” should be heard.

    The Web is ours, not big brothers, and while big-brother may want to own it exclusively, the repository they build will one day also be ours. Wonder what that will tell us about our “rulers”.

    An useful sample application of the SemWeb technology might be to parse all the WikiLeaks documents, Cables, War Logs, Intelligence Emails, everything through a SemWeb engine, and see what nuggets might pop out through computer inference that would be much too onerous for a human to undertake. I would bet that there is still a lot of gold in those documents and that linking all of them and cross-referencing would find some truly amazing “coincidental’ relationships. Add to that a few banking documents and continue to enhance the totality with every new leak and one soon sees how this technology will work in our favour.

    The emporer’s cloths are getting more transparent every day.

    @GregLBean

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: