Unlocking crime data for research: An update from 2019's SAGE Concept Grant winners
Last year we announced Text Wash as the winner of the 2019 SAGE Concept Grant program. This new software tool, under development by Dr. Toby Davies, Dr. Bennett Kleinberg and Maximilian Mozes at UCL’s Department of Crime Science, uses machine learning and natural language processing to unlock previously untapped crime data, that so far has been inaccessible to research due to the need to anonymize the personally identifiable information it contains.
One year on, we caught up with the Text Wash team about why we need text anonymization to work with crime data, why the current methods need replacing, and the opportunities that this new software tool will open up for researchers and practitioners.
What crime data do we have available and what are the barriers to using it in research?
There is a great deal of data wrapped up in free text sources across the social sciences and humanities. This data contains a rich amount of information, but it can’t be exploited by the research community either because it’s inaccessible or because it can’t be shared for ethical or legal reasons.
In the field of crime science, the type of data we’re looking at is free text descriptions of crime events, interviews with offenders, and other descriptions of criminality. Since these data sets contain personal data of individuals, they cannot be shared without first making it anonymous. If we find an effective way to do this, the data could tell us more about how crime in happening and the types of behaviors that are involved in crime.
What are the issues with current approaches to text anonymization?
The current procedure for anonymizing crime data involves a person with a vetted security level going through text manually and crossing out any information that might reveal an identity. This is problematic because it impedes the scaling up of the anonymization process.
Another problem is that the current tools available are not properly validated, so we don’t really know how good they are at detaching the identity of individuals from the text.
What are the biggest challenges in anonymizing text data?
The two biggest challenges in anonymizing text data are what to anonymize and how to anonymize it.
The what refers to which pieces of information in a text you need to redact for the text to become fully anonymized. This might seem intuitive to us from a human perspective, but in a computer setup this is much more difficult to achieve.
As to the how: If we assume we have found all the words and phrases we need to anonymize, we need to find a way of replacing these pieces of information in a semantically meaningful way. You could, for example, replace all the sensitive words with the letter X, but then the resulting text may not have the same semantic meaning as the original version. What we aim to do with Text Wash is find replacements for these words such that the semantics of the sequence are retained after anonymization, so that it’s still valuable for both quantitative and qualitative text analysis.
How does text wash work?
Text Wash helps researchers and practitioners to anonymize text data on a large scale. It does this by using state of the art techniques from natural language processing, which is the process of getting computers to understand language in the way that humans use it. We combine this with machine learning to derive information from text that we then automatically tag or anonymize such that individuals cannot be identified, whilst at the same time retaining the semantic meaning of the text so that researchers can use it to conduct large-scale automated data analysis.
What will be the benefits to researchers?
By using Text Wash to open up access to vast banks of crime data, it will be possible to identify new types of crime and new modus operandi, ways that crimes are committed, which might not have been apparent just from the kind of quantitative data that we currently have available.
Our key selling point for Text Wash is that we want to make it accessible for both practitioners and researchers. We aim to do that by providing an open-source software tool that researchers can use and develop further if they want to. At the same time, we are developing a tool that can be put into the hands of practitioners, like the police, that they can use in house locally on their computers to run the anonymization of text data that they can then share with researchers or other partners that are interested in the data.
The benefits will extend beyond crime science, too. Since text data crops up in other fields, like medicine and epidemiology, we would hope that the tools we are building for our particular research questions will be applicable in these fields as well.