#### Summary

*Knowledge graphs and embeddings are gaining popularity in the data science world.* *In this post, we will focus on one class of graphs called Knowledge Graphs, describe some of the problems practitioners are trying to solve using them and explain how machine learning is used in this context.*

#### Introduction

In our previous post, we introduced the world of graphs and networks, giving an overview of the basic concepts used to study and understand them. While these ideas are relevant across the entire spectrum of applications associated with graphs, more is needed to help solve the specific problems researchers and practitioners face in different areas where graphs are used. This is where the value of tailored machine-learning techniques come into play.

Mosaic Data Science’s creative approach to solving legacy problems makes us well-positioned to help your business take full advantage of the latest developments in graph analytics and networks, which can be applied to solve various use cases within all sectors. In this next post within our graph analytics series, we focus on one class of graphs called Knowledge Graphs, highlighting some of the complex problems they can help solve when paired with advanced data analytics.

#### Knowledge Graphs

A graph is an abstraction used to represent relationships or interactions between different elements. Figure 1 shows an example of a graph with five nodes, or edges, denoted by *{v _{1},v_{2},v_{3},v_{4},v_{5}}* and a set of interactions (called links or arcs), represented by lines connecting pairs of nodes. Some interactions are directed, like the link going from

*v*to

_{4}*v*, meaning that the relationship works in one direction only (like a one-way street). In contrast, the rest of the links in the graph are undirected, signaling relations that are valid both ways.

_{5}A Knowledge Graph (KG) is a particular class of graph that encodes information by letting nodes represent a heterogeneous set of entities and endows links with semantic meaning that describes how such entities relate to each other. Let us explore some of the intricacies of knowledge graphs through a series of examples.

Figure 2 shows a small geographic Knowledge Graph, with boxes representing entities and arrows representing links. Some noticeable differences quickly stand out between this KG and the generic graph shown in Figure 1. Firstly, the KG has a semantic meaning attached to each one of its elements, which is articulated in the form of *triples*. A triple, denoted as <*h, r, t*>, is defined as a combination of two nodes, called head (*h*) and tail (*t*), and a relation (*r*) between them. Triples encode facts; thus <*Europe, is a, continent*> and <*Nepal, is a region of, South Asia*> are a couple of facts from our KG.

Secondly, our KG needs to inform us of multiple facts, like what continent Portugal belongs to, if Poland is in the Iberian Peninsula, or if Armenia is a region of South Asia. In practice, missing facts are common in KG, but the interpretation of such absence depends on whether we have a closed-world or an open-world KG. The assumption behind the former is that any missing relation implies the relation is false, whereas for the latter missing relations signal ignorance (the fact could be either true or false, but we do not know).

Let us investigate what type of KG is shown in Figure 2. Under the closed-world assumption, the missing fact <*Armenia, is a region of, South Asia* > would be false, which happens to be correct, whereas <*Portugal, is a region of, Europe*> would be false too, which we know is incorrect. Hence, for our KG to be right, it must be an open-world KG. In practice, the type of world assumption is established by the KG creators and communicated to the KG users, so we do not need to deduce it as we did here. However, verifying the coherence between the KG construction and its world type could be an essential data quality step.

The information in our KG allows us to do different things. We could ask what countries are in Asia. A query of this type is easily solved by finding the intersection between the entities that are countries and the entities that are regions of Asia. By doing so, we get Nepal and Armenia, but we don’t get India. A quick inspection uncovers another missing relation: Like Portugal, India is not connected to any continent. How do we determine what continent India belongs to?

Knowledge graphs allow us to apply formal logic rules to deduce facts. Knowing that India is a region of South Asia, which in turn is a region of Asia, lets us conclude what continent India is in. However, the strength of this deduction depends not only on the correctness of the available facts but also on how exhaustive they are. Consider the syllogism a) India is a region of South Asia, b) South Asia is a region of Asia, therefore c) India is a region of Asia. For this to always work, *is a region of* must be exhaustive, i.e., all possible ways in which something is a region of something else should be included. Suppose now that we added the fact <*Azores islands, is a region of, Portugal*> to our KG.

If we applied the same type of syllogism we used for India, we would conclude that the Azores are a region of the Iberian Peninsula, which is false. What went wrong? In this case, the problem is the ambiguous meaning of “region,” which can be used in a purely geographic or geopolitical sense. Since that distinction does not exist in our KG, we’re equivocating a geopolitical fact (the Azores is a region of Portugal) with a spatial fact (Continental Portugal is in the Iberian Peninsula). Likewise, we implicitly equated Portugal with its continental region, which is inaccurate.

When building and using a KG, it is crucial to understand the conceptual categories used to define the entities and relations and adopt a consistent naming convention. This refers to the ontology or schema of the KG. One of the shortcomings of our geographical KG is its inability to distinguish between spatial and political divisions. Suppose we modify the relations by replacing *is a region of* with *is located in* for spatial information and *is part of* for geopolitical information. This improved schema allows us to easily include new facts with less ambiguity. Figure 3 brings a new geographic KG that incorporates the new relations we just defined, plus some additional entities and facts (some countries are not shown for readability). By choosing this new schema that decouples spatial and geopolitical facts, we can encode things we could not encode before, like the fact that even though Spain is part of Europe, it has two African cities.

Notice also how, under this new schema, there is still some risk — albeit smaller — of wrongly concluding that the Azores are in the Iberian Peninsula. If we knew that the archipelago and peninsula are mutually exclusive concepts, we could confidently conclude that the Azores are not in the Iberian Peninsula. One way this can be achieved is by including a negative relation like, *is not a,* in the schema. Having such a relationship allows us to incorporate <*Archipelago, is not a, Peninsula*> as fact (remember that this is necessary only because ours is an open-world KG), which is the missing piece we needed before applying the deductive machinery to conclude with certitude that the Azores are not in the Iberian Peninsula.

Before moving on, let us recap what we have learned so far. Knowledge Graphs are specific types of graphs endowed with semantic meaning that allow us to encode information. The constitutive elements of a KG are the triples, which contain facts about pair of entities called head and tail articulated by a relation. For example, in the triple <*Poland, is a, Country*> Poland is the head, Country is the tail, and *is a* is the relation. Furthermore, we can interrogate a KG by combining entities and relations. For example, if we know the entity Poland and the relation *is a* are a part of our KG, we can ask, ‘what is Poland?’ which is solved by retrieving the fact <*Poland, is a, Country*>. Other questions like ‘what continent is Portugal part of?’ cannot be answered by any of the existing facts in our KG, but sometimes it is possible to arrive at an answer using formal logic.

#### Predicting Links

One way we can frame the problem of interrogating a KG is as a link prediction problem. Consider the question, ‘what is Poland?’. We know the answer to such a question must be a fact of the form <*Poland, is a, ?*> where the question mark denotes that we are looking for a tail. If we take all possible tails and plug them into the triple, the task is reduced to testing the veracity of the facts generated.

For example, <*Poland, is a, Peninsula*>, <*Poland, is a, Continent*>, <*Poland, is a, Country*> are three potential facts whose veracity we would like to test. We know that the latter is true because that triple already exists in the KG, but that tells us nothing about the former two facts under the open-world assumption. Moreover, if we wanted to know what continent Portugal is a part of, we would generate triples like <*Portugal, is part of, Europe*> and <*Portugal, is part of, Africa*> — none of which is a fact in our KG — and try to qualify them as true or false.

As we hinted at in our previous discussions, the application of inductive and deductive reasoning is one tool we can use to investigate the veracity of potential facts. We have neglected to mention that these logic-based mechanisms rely heavily on the properties of the relations. Indeed, as part of the ontology, associations can have different properties that can be articulated as part of the link prediction task. For instance, *is located in* refers to space, and thus it is a transitive property. If we added <*Madrid, is located in, Continental Spain*> to the KG, we could immediately conclude that <*Madrid, is located in, Iberian Peninsula*> is true.

While defining the schema, it is essential to describe the properties of each relation. For a further treatment of properties and how they are used in logical reasoning, the curious reader is referred to the book “Knowledge Graphs” by Hogan *et al.,* which contains a thorough presentation of inductive and deductive knowledge, among other topics.

Although logic-based reasoning is powerful, it may not always be helpful in practice. One of the challenges this approach faces is scalability. Indeed, size and sparsity tend to go hand in hand in the KG world, which markedly undermines the efficiency and accuracy of logical engines. For this reason, practitioners are inclined to rely on machine-learning approaches to handle large graphs. In the rest of this article, we will focus on embeddings, the workhorse of machine learning on knowledge graphs.

#### Knowledge Graphs Embeddings (KGE)

Facts in a KG are intelligible to humans but not to machines. To learn the facts of a KG, and more importantly, to make predictions, words like Nepal and relations like *is part of* are *embedded* into a multidimensional numerical space where each one of these elements is represented by a vector called an embedding. These Knowledge Graph Embeddings (KGE) are not defined by humans but are learned as part of the machine learning process.

To illustrate how KGE’s work, let us consider the principal elements and operations of a machine-learning model on knowledge graphs. Firstly, we need to decide how to represent the entities and relations. For our example, we choose a 2-dimensional space. In this space, entities will be seen as points, whereas relations will be translations denoted by vectors. This representation corresponds to TransE, the simplest of the translation-based KGE models and arguably the easiest to understand.

Figure 4A represents the typical first step in machine learning, where elements are initialized at random. For simplicity, we only show five entities and one relation. Translation models generally work by making a composition of the type *head + relation,* as illustrated in Figure 4B. For example, if we apply *is a* to the Azores in our embedding space, we obtain a new point (denoted by a star) in that same space. This point is the embedding of the tail, which is obtained after translating the head.

Remember that this new tail exists in the embedding space as a numerical vector, which is meaningless to humans; hence, we need to map it back to the original entity space where we can make sense of it in the form of a triple. One way to do the mapping is by calculating the distance between the star and the five entities shown in the figure and selecting the closest entity to the star. As we can see, the star is very far from any entity, which is expected since we have not trained anything yet, and the best triple we can produce based on distance is a self-referential <*Azores, is a, Azores*>.

However, since we know the correct tail for the pair (*Azores*, *is a*), we can formulate an error measure or loss as |*head + relation – tail*| with respect to the known tail (archipelago), where |*| is either the Manhattan or the Euclidean distance. This can be extended over many examples to estimate an average loss, which an optimizer can use to reposition the entities and change the size and angle of the relations until a stopping criterion is achieved. Figure 5 offers an idea of how the embeddings may look after the training process converges (for a better illustration, we have included more entities). Notice that Poland, Nepal, and Spain are close to each other so that the translation for *is a*, when applied to the embedding of any of these three entities, results in an embedding of a nearby Country. A similar thing happens with the archipelagos. More importantly, Country and Archipelago are far from each other, so there is no risk of confusion between the two concepts.

In the previous section, we talked about predicting links. We can generalize the idea to predict triples in the embedding space. For each entity in the KG, the question <*entity, is a, ?*> can be approached by scoring all possible triples involving *is a* in the embedding space. Scoring here is done with the same loss function used to train the KGE. For any given head, the most likely answer is the triple with the best score. Ideally, all facts included in the training set should get the best scores. Moreover, we would expect new true facts to receive high scores. Notice that a similar process can be followed for questions <*entity, ?, entity*> and <*?, is a, entity*>.

Suppose that we decided to include a new relation *is neighbor of* in our geographic KG. We could use such a relation to include the fact <*Spain, is neighbor of, Portugal*>. This is a symmetric relation, so <*Portugal, is neighbor of, Spain* > is equally valid. However, if we trained our KGE with TransE, we would fail to capture both facts because symmetry would imply the ability to move from head to tail and then back to head just by adding the same vector, which is impossible. But even facts that can represent TransE can still prove challenging. For example, Indonesia, Fiji, and the Philippines are archipelagos. TransE would need help to represent these facts with the same relation while keeping a good separation between countries that are archipelagos and other entities that are only either.

Other models within the translation-based family address these limitations with pros and cons in inference ability, computational efficiency, and accuracy (for surveys on KGE models, check Dai *et al.* and Wang *et al.* articles). In general, practitioners explore different KGE models and embedding dimensions and even combine them with ensemble methods to search for more accurate predictions on knowledge graphs.

#### Conclusion

In this post, we introduced the topic of knowledge graphs and knowledge graph embeddings. Our treatment of these topics is far from comprehensive; instead, we have focused on helping the reader understand the main concepts and some of the intricacies and challenges that lie beneath the surface when these technologies are adopted.

While we explained some of the mechanisms used to perform predictions on knowledge graphs work, we have yet to deal with how prediction can be used in practice. In our next post of this series, we will look at the different applications of knowledge graphs, what is being predicted and why, and how this impacts specific industry sectors such as the biomedical space, where knowledge graphs and embeddings are used for drug discovery and patient journey visualization.

Mosaic Data Science applies data science and machine learning to generate value for your business, helping companies across every industry fill their data science talent gap with a flexible engagement model. Our highly skilled team of data scientists can match specific data analytics solutions to your unique needs. If you are considering using machine learning to derive value from knowledge graphs, do not hesitate to reach out.