In this usage scenario we use CZSaw to solve the Mini-Challenge 1 of VAST 2010. The original problem description can be found here.
This challenge has 103 documents, which come from different resources and describe different countries, regions, and people. Within the encompassing story of illegal firearm dealing activity, there are also several sub-threads. Many errors or inconsistencies existed in these documents, for example, misspelled names in surveillance reports. It is almost impossible for an analyst to keep track of the whole scenario through reading alone, even if she has enough time to read all the documents.
CZSaw helps analysts solve large scale problems via flexible data views that provide overviews and details on demand. In addition, CZSaw provides process views to manage the complex analysis process itself.
First, CZSaw's data views allow visualization and manipulation of entities, documents, and relations for use in the sense-making process. These visualizations aid in selective reading of documents to make connections between disparate facts. The Hybrid View is an enhanced graph visualization of entities (nodes) and relations (edges) where the nodes can be visualized with a variety of techniques. The Semantic Zoom View (SZV) examines documents at several levels of detail (overview, document's entities, and detailed text). The Document View allows the analyst to read documents and scan their contained entities.
Second, CZSaw provides an interaction model and history mechanism to support the analysis process. User interactions are recorded and translated into a script language at the task level. Analysts can then replay or reuse their analysis steps to help them understand, explore, and reference their analysis process. CZSaw also creates a model of the analysis process in the form of a dependency graph through which changes can be propagated. Driven by the dependency propagation mechanism, data views automatically update themselves to reflect changes in data such as modifications of entities. CZSaw provides users computational power for data query and management. Functions include managing display states and layout, querying/filtering entities and relations, and refining entities on the fly.
CZSaw relies on extracted entities, as do many similar systems. Thus we asked our colleagues in the SFU Natural Language Lab to run entity extraction algorithms on the original dataset to generate an XML file containing extracted entities (person, location, date etc).
To take advantage of differing analysis approaches and exercise different CZSaw strategies, we began this challenge in two separate teams. To read and analyze all 103 documents and match the desired solution format, both groups adopted a divide-and-conquer data organization strategy, grouping documents and entities by country and event. One team focussed their investigation within the Semantic Zoom View (SZV), grouping documents by country before drilling down to investigate further. The other team used the Hybrid View's node-link graphs and the Document View to read details. After this data exploration stage, we integrated the teams' findings and reported the outcome.
The SZV (Fig. 1) shows documents that can be semantically zoomed to three levels: overview, entities in the document, and detailed text. For an overview, it uses a clustering algorithm to layout documents – the more entities two documents have in common, the closer they are placed, resulting in clusters of documents about the same set of entities. In this challenge, clusters contained many documents related to arms dealing in the same country. Our first team scanned and searched the clusters and created permanent groupings in the SZV, where each group displayed the documents for a country. Each group was analyzed separately to keep the information flow to a reasonable cognitive load. Similar to individual documents, groups in the SZV can be displayed as sets of zoom-able documents, combined sets of entities for brushing across the rest of the view, or the full text of each document.
CZsaw users can create sets of entities and relations, and visualize them with custom layouts in the Hybrid View, as our second team did. Examples include all entities in the data set, all entities of a given type (e.g. people), entities filtered by value (e.g. the name "Nicolai"), or entities related to previously defined sets. Fig. 2 shows all reports in a Hybrid View as a graph where two documents are connected if they contain one or more of the same entities. Fig. 3 shows the social network of people, and how they are connected by code words (e.g. textbooks, farming and drilling equipment), arms deals, and money transfers. To create this view, we listed all the people, searched for codes, bank accounts, and money that connect at least two people, and then displayed these connections. With such a selective display method, we can examine connections among people from different perspectives (e.g. country, date, and arms deals). The force-directed layout automatically pulls related entities/documents closer, producing clusters. Thus, we were able to examine each cluster by reading a smaller number of documents.
The left image in Fig. 2 shows many isolated documents. Scanning through them showed that machine entity extraction was neither fully accurate nor complete. CZSaw allows users to interactively manage entities enabling correction of errors and recording of hypotheses (that two entities are the same) during the analysis process. Operations include:
The user can manually refine entities while reading a document or working within other views (e.g. similar nodes in Hybrid View's entity network bring about entity-merge possibilities). CZSaw's dependency propagation mechanism instantly updates content and layout in views to reflect these changes, which may form new clusters or merge existing clusters (Fig. 2).
By capturing the analyst's interactions, CZSaw creates a model of the analysis process in the form of a directed acyclic dependency graph capable of propagating changes. Nodes in the graph (variables in CZSaw) are results generated from user interactions. The user interacts with entity or relation variables in views to create the next step's results. Edges indicate dependency relationships among variables. Any content change to a variable triggers the propagation mechanism to update downstream variables and in turn update data views to reflect the change. The graph's root node represents the entire data set. Thus any change to the data set (such as an entity refinement operation) starts the propagation at the root node, potentially changing all analysis results and updating all data views. As entity refinement proceeds, the document network is transformed into a more understandable image (Fig. 2). The analyst can also reuse parts of the analysis process by assigning new data to one node in the middle of the graph.
Interactions during the analysis process are transformed into a text script. Replaying this script lets the analyst review the whole analysis process. Editing this script supports fine control of the analysis process, for example to quickly change parameter values used in interactions to get better results.
This script also facilitates collaborative analysis. For example, one student analyst recorded the steps to create document groups in the SZV. Fellow team members were then easily able to recreate the groups in their own instance of CZSaw by replaying the script.
CZSaw provides rich features for analyzing real-world documents via clustering and data cleaning. Interaction, data and visualization are tightly integrated by the underlying script and dependency graph.