Extraction of unstructured data to generate crime knowledge graphs: Past, Present & Future

An interdisciplinary research study combining generative AI, NLP – natural language processing, criminology, and graph database. Does this sound like a Netflix “most watched” top ranking show to you? It's even better! Watch the presentation and read on.

This work was presented in NODES 2023, a free 24-hour online conference for developers, data scientists, architects, and data analysts across the globe.


NODES 2023 - Whodunit Getting Answers from the HK Legal Information Institute Text


Past

This was a research study done by DeepDive Labs in collaboration with Prof. Hava Dayan, Senior Lecturer, Faculty of Law, School of Criminology at the University of Haifa. The objective of the study was to retrieve information of interest from legal documents and then have this information on a knowledge graph, in a query enabled format. This study highlights an excellent Legal application for knowledge graphs and paves the way for many more domain specific applications of knowledge graphs across the world. The other unique aspect of this study was its interdisciplinary nature that combines generative AI, NLP (natural language processing), criminology and graph database.

The story unfolds prior to 2020 with the project involving extraction of information from the Hongkong femicide legal verdicts available publicly at HongKong Legal Information institute. The criminological information of Interest from the legal verdicts were: 

  • Features about the Crime such as

    • Date of the crime

    • Killing scene (Location, Was it a secluded location?)

    • Whether overkilling happened?  

  • Features about Victim such as

    • Name of the victim

    • Age of the victim

    • Marital Status/Gender/Origin of Victim 

    • Relationship of victim with the killer

  • Features of the Killer/defendant such as

    • Name of the killer

    • Age of the killer

    • Marital Status/Gender/Origin of Killer 

    • Any substance Abuse/ Chronic Illness/Financial Stress?

    • Whether the killer committed suicide

At that point, these set of information from were structured into different NLP tasks such as shown below:

NLP Tasks
Named Entity Recognition
1. Date of the crime
2. Killing scene (Location, Was it a secluded location?)
3. Name of the killer, Name of the victim
4. Age of the killer, Age of the victim
Question & Answering
Extractive & Abstractive
1. Marital Status/Gender/Origin of Victim and Killer
2. Any substance Abuse/ Chronic Illness/ Financial Status ?
3. Relationship of victim with the killer
Boolean QnA
1. Any substance Abuse/ Chronic Illness/ Financial Status ?
2. Any prior criminal record for the killer?
3. If victim was related to the killer
4. Whether the killer committed suicide
5. Whether overkilling ha pened?

As the project progressed, we soon realized these judicial verdicts are very complex documents and they differed in length and structure. Some of the challenges that we faced were:

  • There were differing amounts of success in extracting information as in the legal document different stakeholders related to the crime are also mentioned like the policeman, witnesses etc and not necessarily the crime, victim or perpetrator and so the NER extracted these as well. 

  • The data provided by the Boolean Question & Answer was not consistent due to the structure of the verdict itself. 

  • There was not enough labelled data to train these multiple models at that time. Typical training would require 1000s of data records while in this study it had only about 50+.

Due to these roadblocks this study was shelved for a little while. 


Present

With vast progress made in large language models (LLMs) in 2023, we decided to revisit the project now from the prompt engineering perspective. All information needed from the legal verdict document could now be made in the format of questions and answers with context. Listed below are some of the significant learnings when the NLP tasks were converted into prompts for Chat GPT 3.5 Turbo: 

  1. The straightforward prompts with mention of the judicial context helped. Eg., Look for killer or defendant information etc.

  2. For the deductive reasoning questions, prompts were updated to explicitly respond UNKNOWN if answer was not available in the judicial context i.e. apart from YES or NO.

  3. Furthermore, while extracting information on the killer / Crime / Victim information, additional details were extracted to verify and validate the Boolean information. For e.g: a prompt to extract if the killer had a chronic illness (as a Boolean Yes or No), the prompt was also asked to extract key validation phases as “illness indicator” so that it could be cross checked with reason.  

  4. When the judicial verdict was too long, outputs would be truncated or just plain rejected. This resulted in us working on splitting the prompts. At times, we created 3 different prompts to extract information of the killer, victim and the crime.

  5. The output was extracted in JSON format. These outputs were verified against the labelled data, through manual validation

The details were then loaded into a Knowledge graph, which was a great way to visualize the relationship between the crime, victim & suspect. A knowledge graph is a way of representing and organizing information about the world using a network structure. some of the key advantages of knowledge graphs are:

  • Enhanced understanding: Knowledge graphs can help to uncover hidden relationships and patterns in data that might not be apparent in a standard database. 

  • Flexibility: Knowledge graphs are schema-free, which means that they can accommodate new data and changes to the data structure without requiring a complete overhaul of the database.

    The data model used by the graph is shown here:

The graph representation helped visualise the data and the relationship between the entities – crime, victim and the killer. It was found that many femicides happened where the killer is someone within the family and had a relationship with the victim. Some other examples of findings where that knowledge graphs helped visually see patterns of whether conditions like “under the influence of a drug” contributed to the crime etc. Overall, this study was very valuable for investigative and criminal analysis and further building on it for tactical, strategic and administrative analysis from the bottom up. At the highest level, it can be used for policy making. 

The impact of this research study was far reaching and profound with learnings for us including:

  1. Professor Dayan had initially spent 3 months of effort reading through the legal documents and in manually extracting this information, which GenAI was able to do quickly of course with the right prompting technique.

  2. Further, the use of graphs to store the extracted information, created a true knowledge graph which provides stakeholders an improved visualization and urges them to explore the data.

  3. When information was domain specific, today's GenAI is not equipped to extract all information of interest. A subject matter expert, a human in the loop, is important to guide the AI to extract and see these structures hidden in unstructured data.


Future

At DeepDive Labs, we see a future where AI is a great augmentation tool to all experts in the field to improve our productivity and get us to higher-order tasks. Tell us, how have you been using AI in your workplace: Are you becoming a Centaur or a Cyborg (pdf)?


Previous
Previous

Part I of II : Limitations of a Single-Tool Approach: Traditional ML vs. LLM-based text classification

Next
Next

Cloud Engineering : Refactoring and Migrating away from the cloud to save costs