As technology improves, utilising it to assist people becomes more and more valuable. Answering questions about images, for example, can be incredibly helpful when they are asked by people who are affected by sight loss. Visual question answering (VQA) systems are becoming more accurate, but often these correct answers are not actually usable to those that could benefit the most from these advancements.
In addition to usability, there are a few other vital considerations when developing assistive technology. They are: (1) privacy, as the camera may capture a user’s mail, medication bottles, or other sensitive information; (2) transparency, so that the system’s behaviour can be explained and trusted by users; and (3) controllability, to tailor the system for a particular domain or user group. We have therefore extended a conversational VQA framework, called Aye-saac, with these objectives in mind.
Specifically, we gave Aye-saac the ability to answer visual questions in the kitchen, a particularly challenging area for visually impaired people. Our system can now answer questions about quantity, positioning, and system confidence in regards to 299 kitchen objects. Questions about the spatial relations between these objects are particularly helpful to visually impaired people, and our system output more usable answers than other state of the art end-to-end VQA systems.
At the start of 2020, I co-supervised a group of seven MSc students doing Heriot-Watt University’s Conversational Agents and Spoken Languge Processing course. We built the foundations of a voice assistant for blind and partially sighted people. You can read about that work and the students here.
This year’s team worked on this project over a 12-week period as part of the same course. They are:
We couldn’t have done this work without Julian Frabel, Noe Rivals, Guillaume Jeanne, Rachel Yu, Basile Lamarque, and Nejma Belkhanfar.
Malnutrition is a seemingly unrelated health issue that is actually associated with sight loss. Grocery shopping is difficult, eating is difficult, and meal prep can even be dangerous - with a risk of injury and people affected by sight loss reporting that they feel unsafe. With no/reduced spatial awareness or depth perception, sharp knives and extremely hot objects are clearly of concern.
Another consideration is food hygiene. With sight loss, it is very difficult to determine whether food is cooked thoroughly, mouldy, dirty, or expired see other group’s work. For these reasons, many people affected by sight loss only prepare hot meals with assistance from friends, family, and carers - or do not prepare hot meals at all. Citations are found in our paper.
We are therefore very motivated to take steps towards alleviating some of these concerns.
Visual question answering (VQA) systems take both an image and a question as input, and then respond with an answer. The current state of the art systems are end-to-end, requiring huge amounts of data and resources to train and tweak. This doesn’t detract from the fact that they are the most accurate VQA systems today, we just want to highlight that these are not transparent, controllable, and don’t focus on usability of their answers. These are very important when developing assistive technology as we need to be able to explain why incorrect responses were output, quickly tweak the system to avoid these incorrect responses, and not frustrate the user by responding with useless answers that just waste their time.
We decided to build upon Aye-saac, a conversational VQA system that was originally designed with a focus on the above principles. Privacy was also at the heart of our design, as people affected by sight loss will often not know what is in view of the camera. For an in depth discussion about this with a comparison of existing human-in-the-loop and end-to-end systems, please check out our previous article.
The original voice assistant that we started with, called Aye-saac, was created by a group of MSc students at Heriot-Watt University. It was built from scratch with a fantastic microservice-based architecture. This allowed us to easily build new components and improve existing ones for our use-case. You can read about their original work here.
You will notice that there are OCR and “Label Extraction” components that we do not discuss below. They were implemented by another group of students - you can read their paper or their Rasa article!
Following the user’s question through this architecture, our system does the following:
Natural Language Understanding
When a person affected by sight loss requires assistance, it is vital that our system accurately determines what they are asking for assistance with. The camera takes a photo as the question is being asked, and the textual transcript of the user’s question is sent to Aye-saac’s NLU component.
We are using Rasa NLU to understand the user’s request and extract entities from it too. For example, we have trained Rasa NLU to output the locate intent, and the entity “oven” when asked “Where is the oven?”. Since Rasa is open source, we can access their libraries to build conversational agents, train it using our own data, tweak training pipelines, etc… and still perform on par with paid NLU services like Microsoft’s LUIS.
To reduce error-propagation between these two sub-tasks (intent classification and entity extraction), we opted to use Rasa’s Dual Intent Entity Transformer (DIET) classifier as it has been shown to improve overall performance. We did not spend a huge amount of resources trying to optimise this, so I imagine some readers could advise here, but this is the Rasa pipeline we chose:
- Firstly, with the “SpacyNLP” component, we initialize spaCy’s medium-sized English model trained on written web text (en_core_web_md).
We then split the text into tokens using spaCy. We do not want to predict multiple intents in our implementation, so we are just using the default “SpacyTokenizer”.
With these tokens, we convert them into features that we can feed into our model. The first of these featurizers is using spaCy again, the “SpacyFeaturizer”. This outputs a dense vector representation of the user’s full utterance.
Another featurizer we are using in our pipeline is the “RegexFeaturizer”, which outputs sparse features representing whether certain regex expressions were found in the user’s utterance.
We then use the “LexicalSyntacticFeaturizer” to output another set of sparse features. We kept this component’s default settings as we saw no benefit of the optional features available. This therefore returns features that represent whether a token is at the beginning of an utterance, at the end of an utterance, is lowercase, is uppercase, or one uppercase letter followed by lowercase letters (e.g. Rasa or Edinburgh).
Our second-to-last featurizer is the “CountVectorsFeaturizer” which outputs even more sparse features in a bag-of-words representation. This component uses sklearn’s CountVectorizer on each token.
The last featurizer in our pipeline is also a “CountVectorsFeaturizer”, but with a different configuration. Instead of using sklearn’s CountVectorizer on each token, we use it on multiple n-grams of each token. This lets us represent parts of words that we know the meaning of, even if the word as a whole is unseen (e.g. “bio” in “bioregionalism”). We set the minimum n-gram size to 1, and the maximum to 4.
All of these features are sent to Rasa’s Dual Intent Entity Transformer (DIET) classification implementation, named “DIETClassifier”. As mentioned earlier, this DIET model will output both the predicted user’s intent, and recognised entities. As we retrained our pipeline many times, we set the number of training epochs to 100. Increasing this would likely lead to a boost in performance, at a cost of training time.
We then use the “EntitySynonymMapper” to map the entities extracted by the DIET classifier to their synonyms. This works by mapping synonyms to the same value.
Finally, we use Rasa’s default “ResponseSelector” to output predicted responses and their confidences.
In order to train our Rasa pipeline to classify new intents and extract the correct entities, we crowd-sourced data from fellow students. Examples include:
- Where is the spoon? (intent:locate, entities:[spoon])
- Can you count the number of eggs? (intent:count, entities:[eggs])
- Are you sure? (intent:confidence)
The other group of students also used Rasa which made integration of both team’s work extremely easy! We’d like to thank Rasa for their support when students had questions and were learning how to use all their libraries.
Manager within the Dialogue Manager
Once the photo is taken and the user’s question has been ‘understood’ by our NLU component, the manager selects a path through the various components within Aye-saac. The intent is used for this decision. For example, “Where is the spoon?” does not require OCR but does require object detection - the path reflects this.
As our team was particularly focused on object related questions, the object detection component was extremely important, and we built the spatial detection component from scratch. OCR was not however, so we will not discuss that below.
To answer any object related questions, our system must be able to identify those objects in an image. Not only this, we also want to identify where that object is in the image. It was therefore essential that we integrated a suitable object detection model not image classification to provide both of these outputs.
The original group had developed an object detection model within Aye-saac already. It used a Single Shot Detector (SSD) with ResNet50, trained on the COCO dataset. This contained 80 object classes, but only 30 of these were commonly found in the kitchen.
Now this was very important, our system can only answer questions about objects that it recognises! Without any changes, we couldn’t locate many standard kitchen utensils, but could answer: “Where is the giraffe?”
Thanks to Aye-saac’s wonderfully transparent and controllable architecture, we could easily update the object detection component to be more kitchen-specific.
We opted to integrate a Faster R-CNN model trained on the Epic-Kitchen-55 (EK) dataset. This model can identify 290 distinct objects that are commonly found in the kitchen. This model is a baseline though, with several papers reporting a much better object detection performance on the EK dataset. Sadly, none of these better models have been released publicly. As the COCO model outperforms the baseline EK model, we decided to keep it until a state of the art EK model is released (or we train our own).
Aye-saac therefore relies on two object detection models to identify 299 kitchen object classes (21 classes occur in both datasets) - a huge improvement on the original 30.
With improved domain-specific NLU and object detection components, we could now almost answer counting questions. For example, people affected by sight loss can find it difficult to figure out whether they have enough ingredients for a particular recipe. Now when someone asks “How many eggs do I have?”, our NLU component will determine that the user is asking to “count” the entity “eggs”. Then our object detection component will return a number of boxes with labels. The number of boxes labelled as eggs will be our answer. But before we move on to the interpreter and response generation, lets look at answering locate and confidence questions.
From the object detection, we get bounding boxes (as seen colourfully in the EK image above). If one object is to the left of another, their bounding boxes will follow suit. Therefore, in a simple example, the centre point of one bounding box will be to the left of the other bounding box’s centre - giving us some spatial relation information from a 2D image.
Although this approach is relatively simple, we can start to add complexity in a controlled manner with a focus on usability.
Firstly, we identified a list of 34 objects that rarely move in a kitchen like the oven and fridge. We called these anchor objects. Sight Scotland taught us that people affected by sight loss know their own homes very well. For example, they would not need to ask a system “Where is my couch?” because it does not move and is never in an unexpected location. These kitchen “anchor objects” follow this logic.
We then attached expected spatial relationships to each anchor object class. These are prepositions that would likely describe the spatial relation between a query object and an anchor. For example, objects are likely to be found “in”, “on”, and “next to” but not “below” a fridge. This allows Aye-saac to describe an object’s location more specifically, related to objects that a visually impaired person would be able to locate themselves.
If no anchor objects are in the picture, we attempt to describe an objects location in relation to people or hands. Failing this, we return a more standard response that just relies on the absolute position of the object in the image.
More details can be found in the paper, but we found that our focus on answer usability was useful as we did return more helpful answers than other VQA systems. For example:
When given the above image and asked “Where is the sink?”, both Pythia and HieCoAtt (VQA systems) responded that it is in the “kitchen”. Aye-saac responded with “I can see a sink, it’s right of the refrigerator and it’s left of the oven in this image”. Our system cannot compete with these other general VQA systems in most other regards, but this answer is clearly more helpful.
Interpreter and Natural Language Generation (NLG)
All of the information output by each of Aye-saac’s components is collected in the interpreter. We have discussed the answers to count and locate questions above, so let’s run through a confidence example in order to illustrate what the interpreter does.
We implemented this confidence intent in the NLU as we believe that system transparency is of high priority, especially when interacting with more vulnerable groups. This was also a good way to take steps towards our desired goal of multi-turn interaction.
In order to really drive home the importance of transparency. A system recently told a user that a bottle was “seasoning” when in-fact, the item was medication. This is a safety-critical mistake if the user is affected by sight loss!
So, similar to the counting example earlier - the user has taken an image of their kitchen and asked “How many chairs are there?”. Aye-saac’s NLU component uses Rasa to correctly identify that the user’s intent is to count objects, and they want to count chairs. The Manager component then selects a path through the other components based on the intent. In this case, only the object detection is required. The object detection successfully identifies two objects within the image, a chair and a table. Importantly, the object detection also outputs how confident it is about it’s detection (note: object detection models can still be confidently wrong). All of this information is sent to the interpreter component, filtered, and sent to the NLG component (which I’ll detail below). Aye-saac therefore responds “I count one chair”.
Now the user asks “Are you sure?”.
Once again, this is sent to our Rasa NLU model, which determines that the user is asking a confidence question. Based on this intent, the manager selects a path (in this case the interpreter just needs to retrieve the previous turn’s info - no new image processing is required). The interpreter then selects the previous object detection confidence scores and sends them to our NLG component.
We use a templative NLG component here to avoid “NLG hallucination”, whereby the response would generate responses with incorrect information, learned from its training data. This is again to minimise the chance of making safety-critical mistakes. Our NLG component responds with “My confidence is 90% that there is a chair and 60% that there is a table”. This is very similar to the other team’s NLG component, so we were able to integrate very easily.
People affected by sight loss face many challenges in the kitchen. We have tried to highlight the importance of usability when designing assistive technology, and made steps to build a conversational VQA framework with this in mind. In addition, we believe that privacy, transparency, and controllability are necessary goals - especially when answering safety-critical questions.
This team of MSc students were a delight to work with and were very excited to publish this paper. Please do pop back up to the top of this article and check out their social media.