77. GUILTY SHROUD

For too many of my caregiving years, looking back I see guilt’s shroud wrapped dark about my shoulders. Guilt that I did not always smile, act nice while I pushed a fork of scrambled egg between…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




The case for approximate graph search

Introducing Fornax: A new way to explore knowledge graphs using Python

The idiosyncratic nature of noisy knowledge graph data makes it inherently hard to search. Where the user may reasonably expect some relationships to be present, the representation can be different to expectations. This necessitates a fuzzy approach where labels and relationships need to be specified approximately rather than strictly.

Take for example the social graph below centred around Queen Victoria and Prince Albert produced using data from Wikidata.

Figure 1: A social graph of European royalty focused on Queen Victoria and Prince Albert. Purple nodes: individuals, red nodes: alternative names, blue nodes: descriptions. Some nodes have been omitted for clarity.

Queen Victoria is represented by a node labelled simply as “Victoria” and the graph contains 15 other alternative names (or aliases) such as “Queen Victoria”, “Victoria Hanover”, “Queen of Great Britain” etc. There is no single correct or canonical name for Queen Victoria, just a set of subjective alternatives where the user can be left with no a priori way of knowing how this choice was made. For a larger graph “Victoria” is also unlikely to be unique.

Queen Victoria’s mother also had the given name Victoria but she is somewhat inconsistently represented by the label “Princess Victoria” rather than “Victoria”. One of her alternative names is “Victoria of Saxe-Coburg-Saalfeld” however “Victoria of Saxe Coburg Saalfeld” (without hyphens) is not listed as an alternative name despite seeming like an equally reasonable name.

The bottom line is that, unexpectedly, the following patterns do not exist in the graph and strict queries for those patterns will return an empty result:

In order to search the graph it feels like we need intimate familiarity with its content, a kind of Catch 22 situation which will not scale to large social graphs or any other large graphs for that matter.

NeMa and Fornax are best understood by example. Below is a simple query graph which can be interpreted as “I’m looking for an entity ‘Victoria’ who is related to the entity ‘Victoria of Saxe Coburg Saalfeld’”

As we have seen the label “Victoria of Saxe Coburg Saalfeld” does not exist in the graph but there is a similar label “Victoria of Saxe-Coburg-Saalfeld”. This kind of discrepancy amounts to a string discrepancy or label discrepancy. A suitable similarity metric between labels could be Levenshtein distance or the Jaccard index. Fornax allows the user to use any such metric since this choice can be very domain specific.

We have also seen there is no node with a label similar to “Victoria of Saxe Coburg Saalfeld” which has a neighbour “Victoria” (or a distance of 1 between them), but there is a Victoria which has a distance or 2. This amounts to a topological discrepancy.

Fornax is tolerant of label and topological discrepancies and will return the top-k subgraphs in figure 1 (the target graph) that resemble figure 2 (the query graph). For example two sensible results would be

Figure 3: A subgraph of figure 1 that is similar to figure 2

Fornax is a Python library. It exposes a GraphHandle class that can be used to manipulate target and query graphs. The relationships between nodes are stored in either SQLite or Postgres via a Connection interface. Graphs can be loaded directly into the database or via the Python interface provided. When a query is executed Fornax submits a SQL query to the database and then processes the result. This pushes a lot of the work to the database and its index rather than the python interpreter so Fornax is more performant.

The creation of a graph typically looks like this:

The execution of a search typically looks like this

The label similarity scores are provided directly by the user for maximum flexibility. The scores can be calculated by a simple python function, a full-text search engine such as an ElasticSearch cluster running over many nodes, or anything in between.

Fornax makes graph search accessible to data scientists and developers who want to mine knowledge graphs for information even when the data is inhomogeneous and noisy.

A special thanks goes to Matt Dean at Digital Catapult for his help.

Khan, Arijit, et al. “Nema: Fast graph search with label similarity.” Proceedings of the VLDB Endowment. Vol. 6. №3. VLDB Endowment, 2013.

Add a comment

Related posts:

Top UX Design System or design guideline for top organisation.

Material Design System (Google.Inc). “Top Design System” is published by Abin Raj M™.

5 Things I Do To Be At Peace In Any Situation

Do you want to be at peace always? Do you want to keep yourself serene, still, calm in any situation? Do you want to live a life in tune with Nature? A few years ago, I eagerly wanted to live a…

Fated to Chaos. I Chose Wisdom.

In Chinese astrology, it is said that a snake born in winter inherently will have a stormy life and tempestuous personality. I am still, day by day, unpacking our so-called love and what it really…