The importance of human-interpretable AI
My colleague called me to say, “They want to know what size dog: big dog or little dog?“. Not a question a data scientist gets asked every day. My colleague is a subject matter expert with decades of border security experience. He understands inspectors. He speaks their language. He knows his big dogs from his little dogs. From the silence at my end of the phone, it was clear to him that I didn’t. This article is the story of why a question about the size of a dog led us to upgrade the “ordinary” artificial intelligence (AI) in the Unisys LineSight® advanced targeting system to use more human-interpretable Explainable AI (XAI) and how this brought wider benefits to security at the border.
At the time, my colleague was working on-site, embedded in customs at one of the world’s busiest airports, helping us to develop state-of-the-art artificial intelligence (AI) for advanced targeting analytics. Advanced targeting analytics can automatically identify and target shipments or travellers that present a higher risk and which require additional investigation or inspection. In this particular instance, we were evaluating some new ideas for highly specialised AI models for air freight targeting. So, in between training border security officers to use the newly deployed targeting system, my colleague observed officers using it during targeting and inspections. He actively solicited feedback to help us improve the user experience and, in particular, the accuracy of these specialised models. Up until that point, we had been pleased with the results we were getting from this latest AI-based addition to LineSight®. Trained on years of seizure data, these new AI models were assessing vast amounts of shipment data in real-time and, in theory, selecting targets as accurately as human domain experts. Indeed, we’d already had successful seizures where our models independently selected the same targets as the targeting team as well as a prestigious few that the targeting team had not selected for inspection.
We were aware of a few problems though, in practice. Some of our AI’s most promising looking targets, ones that were flagged as highest risk for the day, were being rejected by border security officers on the ground and ignored. Was the AI’s assessment of these shipments correct or not? We would never know because the inspections were not carried out. We suspected that it was down to logistics or scheduling; these new models had no concept of how long a particular shipment would take to inspect nor how selecting that shipment should be balanced against the alternative of inspecting several others in the same timeframe. We could deal with that in good time, once we’d optimised the accuracy of the models.
Now, one team of inspectors was asking our AI to do something it couldn’t do, to explain in non-technical terms, the reasoning behind its decision so that they could choose the right sized dog. My colleague had patiently explained that the size of the sniffer dog needed depends on the type of smuggled commodity and the type of concealment. If they don’t book the right sized dog then the inspection would be a waste of time, a furry version of a square peg in a round hole. Our AI needed to tell the officers why it thought a particular shipment was suspicious, not just how highly it judged the risk to be. As a data scientist, it was now my turn to know something that my colleague didn’t know: this ask was potentially bad news because some AI models can be notoriously hard to interpret and their decisions correspondingly hard to explain – especially in non-technical language. Despite this, it was now abundantly clear to both of us that if border security officers were to trust and act upon the AI’s targeting decisions then the system needed to explain those decisions and do so in their language. A risk score alone, no matter how accurate, was insufficient information for officers to be able to interpret the AI’s decision in an actionable way; they needed to understand the “why” behind the decision.
The point we’d missed in deferring dealing with the suspected logistics issues, the ones causing some of our AI models’ best recommendations to be ignored by border security officers, is that someone or something has to take those logistic decisions, or no action can be taken. Choosing the right size dog requires information about what the inspectors should be looking for in the target shipment, as do numerous other operational decisions that officers need to take when acting on a targeting recommendation. In all likelihood, the reason that we’d had such great successes up until that date, in part, relied upon officers using their own considerable experience to guess why the AI was targeting particular shipments, thereby enabling them to make the requisite operational decisions. The dog question highlighted that there were situations where the officers could not guess without further information. This is why we turned to Explainable AI (XAI).
XAI solves a problem that didn’t used to exist. Before the introduction of AI into targeting systems, targeting recommendations may not have been as accurate or as comprehensive in their use of available data, but they were transparent. These largely manual “systems” were actually sets of 20 or more non-integrated IT systems that human targetters could use to select high-priority targets for the day. The people who made the targeting decisions intrinsically knew why they were selecting a target and could record that in, sadly all too often, operational spreadsheets. If an inspector wanted to understand why a particular shipment was a target, then they could just ask the targeting team for further information.
Then came Rule Engines that let targetters manually define alerting rules which were run against a variety of shipping data sources. These heuristic rules (rules of thumb) were easy to define and easy to understand when they fired an alert, so long as someone on the team still remembered what the rule was meant to detect. Unfortunately, rules were so easy to define that it was not uncommon to have thousands of legacy rules with cryptic names whose meaning had long been forgotten. No one had time or wanted to risk deleting these legacy rules. Also, near duplicate rules tended to proliferate, in part because it was easier to create a brand-new rule than to dig through the archive of existing rules. In terms of performance, manually defined rules tend to give far too many false positives while also tending to have overly sharp thresholds that miss the subtleties of cumulative risk. They do, however, still have a useful role if used sparingly – for example, in specific short-term campaigns, where rapidly developing new AI models might not be feasible. LineSight® includes a Rule Engine for just such use cases. Within the aforementioned constraints, these rules are transparent and easily interpreted by officers.
Enter AI, which uses historical shipment or passenger data to learn complex rules in the form of models that can be updated automatically by periodically retraining the models with the latest seizure data. Carefully tuning an AI model in consultation with domain experts makes it is possible to choose the best trade-off between the number of false positives and overall accuracy in a way that is rarely possible with Rule Engines. The downside is that the resultant AI models and their predictions can be mysteriously opaque to human interpretation. The term AI is often used to refer neural network approach to learning models from data. While phenomenally successful, these models are amongst the most opaque and least interpretable by humans. However, in the broader usage of the term, and without straying into sci-fi, AI is also used to refer to Machine Learning (ML) in general – the family of algorithms that learn from data. Many ML algorithms, like decision trees, are inherently transparent and make decisions that can be readily interpreted by humans; others, like neural networks, are not. Either way, explanations are not usually generated by ML algorithms alongside their decisions; explanations have to be extracted by either modifying the ML/AI algorithm or by post-hoc analysis of the algorithms’ behaviour. This process of making ML/AI interpretable by humans is known as Explainable AI (XAI).
XAI tools and techniques have been developed to extend the utility of AI by accompanying predictions with explanations of either exactly why a particular decision has been taken or, where that is not possible, what the main contributing factors were. Sometimes XAI even learns a second, approximate but more human-interpretable “meta-model” of the opaque model, so that explanations can be extracted from the meta-model. XAI explanations can take different forms and be presented to users in many different ways. For example, as natural language explanatory text or lists of words or bar charts comparing the relative contribution of different features, or, for images, visually highlighting regions that the AI is most influenced by. Interactive explanation is also possible, where the user drills down into the details of a decision as far as needed to gain sufficient understanding on a case-by-case basis. There are no standards for explanations as yet and some vendors claim to support XAI when, in fact, their explanations still need skilled analysts to interpret the results.
In our case, after discussing the available options with border security officers, we selected an XAI method that generated concise natural language explanations to accompany shipment risk scores in LineSight®. Not only did these explanations immediately solve the dog problem by enabling officers to take better informed operational decisions in general, but XAI also brought other benefits. Arguably the most important of these was fostering trust in LineSight® by making the reasons for an elevated risk score transparent. The corollary of this transparency was that officers were now able to give us feedback on why they accepted or rejected LineSight® inspection recommendations, to call out unjustified bias in our models and to suggest improvements. This in turn led to the development of better models in a virtuous cycle. We started logging the explanations, not just the concise natural language explanations but a more detailed quantified explanation, which further helped to improve our models but, more significantly for LineSight® and the border agency, created an audit trail of decision lineage to support ethics and compliance in our use of AI. All in all, not a bad outcome from a question about the size of a dog.