Matthew Johnson, Alexander Schwartzberg, John Angel, Jordan Faas-Bush, Andrew Gaudet

Semantic Web Laboratories
Cognitive Science Department
Rensselaer Polytechnic Institute (RPI) Troy NY 12180
johnsm21@rpi.edu

Machine learning allows us to learn a model for a given task such as facial recognition with a high degree of accuracy. However, after these models are generated they are often treated as black boxes and the limitations of a model are often unknown to the end user. To address this issue, we’ve developed a semantically enabled system that allows users to explore the limits of a facial recognition model. We integrate "smart" images with classification results to discover common causes for misclassifications, capture the provenance of the model learning process and describe the structure of the learned model from a data-centric perspective. We evaluated our tool by loading in the Labeled Faces in the Wild[1] dataset, enriching the images with Kumar image tags[2], and exploring two popular face recognition models FaceNet[3] and DLib[4]. Using our tool we discovered several limitations, including both models having trouble classifying mugshot-like images or images with cranial occlusion, while DLib does significantly better at recognizing people with hair.

Machine learning and the amount of money being spent on machine learning research has exploded over the last 10 years. New machine learning techniques are being applied to everything from video game creation to diagnosing diseases, to better communication between machines and humans. However one of the main limitations of machine learning algorithms is that the learned models are often impossible for humans to understand and often are only evaluated for accuracy on a specific dataset for a specific task. This results in future users treating the model as a black box and not really understanding the limitations of a given model Another machine learning limitation is the inability to explain how a model arrives at a prediction, which makes it difficult for users to trust pre-learned models. There are use cases where this is perfectly fine; however, this is unacceptable in scenarios where life and death decisions must be made such as driving vehicles. Because of this, it is often the case that a potential user that is making a model for such a scenario will have insufficient information to make a truly informed decision when choosing which model to use for a new application.

To address these issues we developed the Face Recognition Model Analyzer ontology (FRMA) [https://tw.rpi.edu/web/Courses/Ontologies/2018/FRMA], which semantically describes face recognition models, the images used to train/test a model and the prediction generated by a model. Using this ontology we are able to explore the strengths and weakness of various models from an image attribute perspective using SPARQL queries, allowing future users to better evaluate the effectiveness of a pre-trained model for their use-case.

The use case is focused on potential requirements for the development of an application that would support facial recognition model analysis. It will be used to discover how an ontology of image attributes correlate to the test results of a facial recognition model. The application must be able to read in model test results provided by a user and link classification results to known “smart” images. “Smart” images use the FRMA ontology to semantically describe the image attributes and facial features of the people within the image. Using this alignment between datasets and model results our system will be able to calculate accuracy statistics across images’ attributes and find correlations between misclassifications and image attributes. For example, the user could choose to see the model’s accuracy on only mugshot images or only images of people with blond hair. The user can then use this information to determine if the model fits their specific wants and needs, or determine what causes the model to fail and use that information in improving the model in the future.

The system will present this information as a semantically enhanced hierarchy view of attributes with the statistics of all images that contain that attribute (i.e ‘has goatee’) or belong under an attribute category (i.e. ‘has facial hair’). When a user selects a level on this category they will be presented with a view showing the images associated with that category and an indication of whether they were classified correctly or not. In addition, the user should be able to switch between all images, correctly classified images, and misclassifications which cause the hierarchy view to update statistics and the images presented.

In the process of developing the overall system, we decided that the system will be evaluated based on how well it can answer a specific set of questions, termed competency questions. The competency questions acted as the guiding force for the development of the system, both conceptually and technically. When deciding which questions to use to guide our progress we wanted to make sure that the questions would demonstrate various forms of semantic knowledge. With these demonstrations of semantic knowledge, the user can be more easily assured that the system is designed well and will not cause errors or unexpected relations between concepts. The variety of problems being tested for through our competency questions make the system generally more reliable.

This question was chosen to show how the system could be used in a potential real-life scenario. This question also shows how useful a model would be in a more specific use case where the training data will likely not have a direct correlation with the desire of the user. This does, however, demonstrate the ability of a user to leverage the semantic understanding of the models into different information. For example, a user—let’s call him Greg—may decide that mugshots are likely to be indoor, posed photos with harsh lighting and of people with facial hair. If a user receives an answer of, “the model has 75% accuracy on images that match labels that the user associated with mugshots,” then the user now has an estimate of how useful their model will be for their use case.

This question was selected because a user would find this information useful as a preliminary evaluation of a model’s performance in relation to the types of images given to the model, especially as a diagnostic tool. In the case of the previous user, Greg, 75% would be an unsatisfactory result, and the user may ask this question in order to find out why the accuracy is so low. For example, if the model has 65% of misclassification occurs on images that contain facial hair, then that user would know not to use this model when they expect for most of the images to contain facial hair. This question was also selected because it demonstrates a semantically grounded understanding of the capabilities of a machine learning model.

This would be useful to a user as a form of testing their system against what the user would assume to be typical subject matter examples for the images, especially when used in conjunction to the previously discussed competency question that Greg has been asking. A potential answer of, “‘Facial hair’ is a most associated attribute with George Bush,” would be worthy of concern for Greg, especially because they learned that their model has the most trouble with facial hair in their previous question. This question demonstrates the same understanding as the previous questions but with reasoning over the people depicted by the image instead of the image itself.

This question is also useful to a user for its preliminary evaluative/diagnostics properties. However, instead of evaluating in reference to image attributes, this questions helps a user evaluate in reference to the various occluding elements likely to cause a misclassification that is within the depiction of the image itself, e.g. a person that is wearing sunglasses has occluded the picture’s view of their eyes and thus may skew a model’s results. Now let’s say that Greg receives an answer of, “your model is most likely to fail when the mouth is covered.” Greg can now speculate that the issue with their model is not merely that beards are problematic, but that beards are problematic because the model depends heavily on the mouth being clearly visible. This shows that the model has the ability to display reasoning over the potential quality of images in regards to being useful for facial reasoning.

First, this question would be useful to a user because it allows for the user to ask a broader question where the data is insufficient to answer a more specific question. Second, this question would allow a user to directly compare the capabilities of two models over semantically rich information. Now, let’s state that Greg has acquired two satisfactory models and wishes to see how they perform when classifying people with long hair. Let us also state that there is not enough data in the system to sufficiently identify images of people with long hair. When the system gives a potential answer of, “FaceNet is 10% better than DLib at classifying people with a group of people not known NOT to have long hair, aka not bald,” note that the response has been weakened to reason over images that could be of long hair. This is done by filtering out the images that contain an attribute that would be mutually exclusive to having long hair, e.g. bald, and reasoning over what is left. We were forced into this looser definition of long hair because of a limitation with our image tag, namely that images are only tagged as having bangs, bald, or receding hairline, leaving all other hairstyles as taggless. This question demonstrates the system’s ability to leverage its semantic understanding of a model to build new information while using that understanding to compare the proficiency of the two models. This query also shows how the system is limited by the tags it contains. The FRMA ontology is built on the sometimes limited Kumar tags which, in part due to their being automatically generated, do not tag some human-obvious features and thus can limit queries.

We used RDF/OWL in order to encode our information. We used SPARQL in order to query over the individuals of the knowledge graph that was constructed through the RDF triples. We used a description logic reasoner in order to gain additional information that could be inferred from the definitions of the terms in the ontology. Mugshot photos and potential long hair are both inferred concepts.

The Face Recognition Model Analyzer Ontology (FRMA) specifically outlines the concepts of “occlusions” and “wearable objects.” Wearable objects are items recognized by facial recognition algorithms (e.g., eyeglasses, earrings, hats). FRMA breaks these items into two categories; headwear (e.g., eyeglasses, hats) and ornaments (e.g., jewelry, neckwear, and makeup). Occlusions model the concept of something that blocks the view of the subject within an image. FRMA breaks occlusions into two further categories; cervical occlusions (anything obstructing the subject’s neck), and facial occlusions (modeling the upper and lower portions of the subject’s face separately).

The FRMA Hair Ontology aims to more precisely describe the hair on the subject’s head and/or face. Because human hair is incredibly varied, the hair ontology focuses on multiple traits to more completely describe a person’s head and/or facial hair.

The FRMA Image Ontology enables the FRMA system to classify the properties of individual image files. This metadata includes semantic descriptors of the visual image and its subject rather than native image metadata describing camera settings and file information. FRMA is interested in semantically defining the dearth of content within the image rather than merely interpreting the image itself as a flat photograph. This ontology reuses the LIO ontology to describe general image features and the different depicted pictorial elements of an image such as background and subject.

The FRMA Machine Learning Model Ontology allows users to describe the learning process, the structure of the learned model, and the evaluation of the model from a data-centric perspective. This ontology reuses the FIBO arrangements ontology to describe model components as collections. For example neural networks are described as a collection of layers: fully connected, pooling, inception, etc.

The FRMA Person, Face, Demographic sub-ontology focuses specifically on a person’s demographic and facial features that appear in their images, including descriptions such as facial expression, age range, and nose shape. This ontology reuses the Uber-anatomy ontology to describe the different sections of the face and the FIBO Agents ontology to describe the attributes of person within the image.

For the FRMA Wearable Things Ontology, when dealing with facial recognition, people in the images may be wearing some type of clothing or accessory. Those pieces of clothing may effectively block a part of or occlude someone’s face. The Wearable Things Ontology was created to keep track of various clothing and accessories that could disrupt a facial recognition algorithm. This ontology acts as a place to hold potential sources of occlusion that also happen to be wearable things.

For more details, see https://github.com/FRMA-Ontology/Ontology/tree/master/conceptual_map.

For the process of evaluating the system, we used the competency questions that were the guiding force for the implementation as the basis for the evaluation as well. The basic idea behind this decision was that we implemented the system specifically to be able to answer these questions, so these questions could also be how we measure the system’s value and performance. In terms of deciding whether a competency question has been answered to a satisfactory degree, we used the informal idea of whether or not a response was both “seemingly correct”, that is whether or not a user would be willing to accept an answer as a potential evaluation of their model, and whether or not a response was “useful”, which is whether or not a response provided information to a user that the user has not previous either possessed nor had easy access to. Further discussion about why certain responses to competency questions would be considered to be satisfactory can be found in the aforementioned Competency Questions subsection.

Our system answers all of the competency questions to a satisfactory level. Unfortunately, this is not quite as strong a claim as it sounds. This is because the nature of how we engineered the system, by building exactly to complete the competency questions and occasionally refining the competency questions based on what was possible to engineer, means that what we may consider a satisfactory, that is “seemingly correct” and “useful,” response to competency questions is not what a typical user in the real world would consider such. It is for this reason that one of our main future works to strengthen our results is to design and run a full experiment with other people and gather a better evaluation of the system.

In developing our ontology and our SPARQL queries, we used Kumar image tags[2] that described attributes of all the images in the Labeled Faces in the Wild training set. We were able to use these attributes, along with the accuracy information (whether or not the person in the image was identified correctly), to create a system to analyze machine learning models in terms of how well they identify faces with specific attributes. However, there are a limited amount of attributes in the data, so for certain concepts not listed in the attributes data it became a challenge to specify what would constitute an image with that attribute. For example, there is no attribute for long hair, so to answer our competency question about how well models do with people with long hair the best we could possibly do with the data we had was to look for people with potentially long hair by excluding anyone who has properties that are known to be disjoint with having long hair. From our tagging set that only included people who were bald or had a receding hairline.

Overlooking these small problems, our ontology is able to infer many useful concepts and provides detailed information that can prove useful to people developing machine learning models, as well as their potential clients. If a machine learning model developer wants to know how they can improve their model, they can use our system to pinpoint what most commonly causes the model to make mistakes. As for a client thinking about what machine learning model they want to use for a certain task, now instead of simply looking at the overall accuracy statistics of the models, they can look at a more detailed analysis of how each model performs within different limitations of data. For example, the police may be interested in a facial recognition model that can perform well specifically on identifying mugshots. Our system would then allow the police to see which models perform the best with only mugshot photos.

The “smart” images that we developed gave us access to semantic information about the images that a facial recognition model would be tested against. This wide range of data, about the conditions of the photo, the subject matter, what the subject was wearing, even the setting of the image to list a few, allowed us to begin understanding not only the images themselves, but the subject matter of the images in a semantically complex manner. We then added external semantic understandings of the world so that we could leverage this information to not only more easily answer questions that could have been directly received from the data, but to then justifiably infer new information about the model, and answer additional questions from the inferred information.

There are several concrete examples of questions that would have been impossible, or at the very least much more difficult, and answer without the semantic understanding of the subject matter of the models being analyzed and the semantic understanding of the external world. For our first example, consider the question, “What type of occlusions does my facial recognition model misclassify most?” The smart images give information about what is being depicted in an image, down to what a person is wearing and their hairstyle, and specifically about whether or not the forehead of a person being depicted in an image is occluded. Without leveraging an understanding of the world of the image, we would have only been able to reason over whether or not the forehead of a person is occluded. Instead, because the system has a semantic understanding of various accessories that a person in an image is wearing, specifically what parts of the body they typically block, and of the various common hairstyles that people wear, specifically how they fall on the face and what facial features they are likely to occlude, the system can leverage this information about what parts of an image’s depicted person is being occluded by what they are wearing.

For another example of a question that would be impossible and answer without a semantic understanding of the images and the world in which they were taken, consider the question, “Which of these two models is better at classifying people with long hair?” The smart images have no data that directly correlate to the subject of an image having long hair. Because of this, without an understanding of the nature of various hairstyles, it would simply be impossible to answer such a question. However, because we do have such information, we are able to reason over the data to find an equivalent, or at least broader, answer to the question that can still provide some value to a user. In this specific case, the system would understand that certain hairstyles are mutually exclusive to being described as having long hair, being bald for example, and can provide an answer to the question that is reasoned over the images that are not immediately unrelated to the question. With this, the system has now created value where previously, there would have been none.

For the rest of the competency questions, we will freely state that having a semantic understanding of the world is not directly required to answer those questions. We will, however, assert that having a semantic understanding of the data enhances the answers that one would get from these questions. Our reasoning for stating this is because many of the image attributes that we get from the smart images form into natural is-a hierarchical relationships and disjoint sibling hierarchical relationships that allow the system to provide a better answer than if those relationships were unknown. For an example, consider the competency question, “What type of image attributes does the model have the most trouble classifying?” Naturally, beards are an example of facial hair, so in the case where 65% of the total misclassifications occur on facial hair and where 60% of the total misclassifications occur on beards specifically, it is more correct—read more valuable—for a system to tell a user that 60% of their misclassifications occur on beards than that 65% of their misclassifications occur on facial hair.

There are several research efforts currently trying to improve our understanding of machine learning models. Currently the most successful of these efforts is Google’s What-If Tool[5]. This tool allows users to visualize inference results, explore the effects of a feature on model results, find common attributes across data samples and discover counterfactual examples. Their approach is entirely model driven and there's no effort made to understand the semantic attributes of the data the model is operating on, this allows them to remain domain and task-independent.

Another avenue of research is being explored by Pascal Hitzler’s semantic group at Wright State University. They are exploring how to add explainability into machine learning. They have developed a technique where they derive inference rules from the positive and negative examples found in training data[6]. Our approaches are similar in that we both develop an ontology around the input to and the output of machine learning models. We differ in that their focus is on how to explain whats learned, while our ontology is exploring the limitations of a model.

One of the first things that we want to do is complete a full experiment where we validate or challenge our own preliminary evaluations. So far, we simply have our own results for whether or not the system passed the evaluation questions, which we designed and are deciding whether or not are complete. Naturally, there is a conflict of interest inherent in this pattern, so we dearly want to design and run a full experiment to collect data that would allow us to strengthen our claims.

There are a few avenues of progression that have shown potential for future exploration. One of these such avenues that we have already begun work on and want to further is the idea of expanding on the semantic vocabulary of the face recognition system. The smart images that we’ve leveraged to gain our current semantic understanding of images are often too sparse or occasionally outright inaccurate for the system to generate results that can be considered absolutely confident. There would be value in cleaning up the smart images by validating the data within those smart images, adding more image attributes to increase the number of potential use cases, and clarifying the definition of various attributes that are more subjective in nature. However, we recognize that this would likely be a process that takes an unsustainable amount of human work hours.

There is also projected value in further enhancing the models of the world around the images. Many of these ontological models could find use in many different potential use cases and are worthy of being maintained and expanded upon on their own merit.

Lastly, there is projected value in attempting to prove that the successful implementation of a system on top of the ontological tool for describing machine learning models can be repeated for more, disparate fields of study. To put it simply, we’ve seen how we can describe and analyze facial recognition models, now we want to see how easily we can use the same architecture to create a system that can analyze, for example, natural language processing.

When working with machine learning, it is a well-known drawback that one often ends up creating incomprehensible blocks of code of which it is impossible to directly understand nor analyze the inner working. This work was to make progress toward gaining a semantic understanding of the world around those incomprehensible works—primarily their input, output, and their world of use—and our preliminary evaluations would state that we have made valuable progress towards this goal. To demonstrate this, we have implemented a system that not only enhances the answers of questions that one could have asked about a face recognition model but also infers additional answers for questions that could not be asked without a semantic understanding of a model. While we will freely state that the current evaluations are not sufficient evidence to state a complete success for the project, we also claim that they demonstrate that there is potential in the continuation of this line of research.

This work is supported by Prof. Deborah L. McGuinness, Ms. Elisa Kendall, Jim McCusker, and Rebecca Cowan for the “Ontologies” class in the fall of 2018 at Rensselaer Polytechnic Institute (RPI).
This work was conducted using the Protégé resource, which is supported by grant GM10331601 from the National Institute of General Medical Sciences of the United States National Institutes of Health.

[1] E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua, “Labeled faces in the wild: A survey,” in Advances in face detection and facial image analysis. Springer, 2016, pp. 189–248.

[2] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar, “Describable visual attributes for face verification and image search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 10, pp. 1962–1977, 2011

[3] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.

[4] D. E. King, “Dlib-ml: A machine learning toolkit,”Journal of MachineLearning Research, vol. 10, pp. 1755–1758, 2009.

[5] WIT | Playing with AI Fairness. [Online]. Available: https://pair-code.github.io/what-if-tool/index.html. [Accessed: 09-Dec-2018].

[6] M. K. Sarker, N. Xie, D. Doran, M. Raymer, and P. Hitzler, “Explainingtrained neural networks with semantic web technologies: First steps,”arXivpreprint arXiv:1710.04324, 2017.