Friday, April 19, 2024
No menu items!
HomeDatabase ManagementExploring new features of Apache TinkerPop 3.6.x in Amazon Neptune

Exploring new features of Apache TinkerPop 3.6.x in Amazon Neptune

Amazon Neptune version 1.2.1.0 now supports the Apache TinkerPop 3.6.x release line, which offers a number of major new features and improvements to existing functionality. New features include fresh additions to the Gremlin language itself, like the P.regex predicate for filters and the mergeV() and mergeE() steps, which should help simplify complex upsert-like functionality.

In this post, we show you an overview of the most critical and interesting changes as they pertain to Neptune, which will help you understand the implications of upgrading and using this new release.

New Gremlin syntax

It has been a long time since TinkerPop expanded the Gremlin language as much as it did in 3.6.x. The following sections outline those additions. The examples shown in these sections utilize a small, manufactured dataset inspired by the air routes dataset. All examples are written in Groovy unless otherwise noted.

mergeV() and mergeE()

One of the most compelling reasons to upgrade to TinkerPop 3.6.x is the introduction of mergeV() and mergeE() steps. As mentioned earlier, these steps unwind the complexity of the fold().coalesce(unfold(), …) pattern used when doing graph mutations that require upsert-like functionality for vertices and edges. Because fold-coalesce-unfold has been prescribed for many years in TinkerPop, documented in posts and examples many times over, and advanced in a multitude of ways (bulk loads, streaming use cases, and more), it’s quite likely that if you modify your graph data in your code, you’re using this pattern somewhere. It’s worth your time to identify where you are using it and refactor to these new steps where possible. It will make your code more readable and allow Neptune to better optimize for mutation performance in certain cases.

As an example of mergeV(), consider the fold-coalesce-unfold pattern for upserting a Vertex:

gremlin> g.V().has(‘airport’,’code’,’ATL’).
……1> fold().
……2> coalesce(unfold(),
……3> addV(‘airport’).property(id, ‘215’).property(‘code’,’ATL’)).
……4> valueMap(true)
==>{code=[ATL], id=215, label=airport}

Consider the same example using mergeV() in 3.6.x:

gremlin> g.mergeV([(T.label): ‘airport’, (T.id): ‘215’, code: ‘ATL’]).
……1> valueMap(true)
==>{code=[ATL], id=215, label=airport}

The use of a Map as an argument to mergeV() is convenient because it’s a common generic form an application will have its data in. Folks who might be new to Gremlin don’t have to recognize a pattern of steps and their interactions to understand the code, and the readability improves considerably for all.

The mergeE() step offers an even more dramatic improvement in readability. For example, assume the addition of one more airport vertex in addition to the ATL one added previously:

gremlin> g.addV(‘airport’).property(id, ‘203’).property(‘code’,’AUS’)
==>v[203]

Upserting a route edge from ATL to AUS using fold-coalesce-unfold would look like the following code:

gremlin> g.V().has(‘airport’,’code’,’ATL’).as(‘v’).
……1> V().has(‘airport’,’code’,’AUS’).
……2> coalesce(inE(‘route’).where(outV().as(‘v’)),
……3> addE(‘route’).from(‘v’).property(id, ‘219’).property(‘dist’,813)).
……4> elementMap()
==>{id=219, label=route, IN={id=203, label=airport}, OUT={id=215, label=airport}, dist=813}

With mergeE() in 3.6.x, the preceding Gremlin simplifies to the following:

gremlin> g.V().has(‘airport’,’code’,’ATL’).
……1> mergeE([(T.label): ‘route’, (T.id): “219”, (to): “203”, dist: 813]).
……2> elementMap()
==>{id=219, label=route, IN={id=203, label=airport}, OUT={id=215, label=airport}, dist=813}

Note that mergeV() and mergeE() were designed to cover the widest number of upsert use cases possible without greatly overcomplicating the step usage and thereby compromising an intuitive feel. As a result, you may find cases where these steps aren’t perfect replacements for fold-coalesce-unfold and therefore retaining the old pattern may be necessary. Fold-coalesce-unfold remains a useful Gremlin pattern and shouldn’t be construed as something to avoid using given the emergence of these new steps.

The examples provided here are simple. See mergeV() and mergeE() for further details as well as the specific reference documentation for mergeV() and mergeE(). As usage of these steps in real-world applications grows, it will be interesting to see how they will be used and what new patterns will emerge. Just as fold-coalesce-unfold defined itself as a pattern by combining the interactions of three steps to satisfy a common use case, complex new patterns featuring combinations of mergeV() and mergeE() may bring themselves to the Gremlin toolbox.

element()

The element() step allows you to traverse from a property back to its parent element (Vertex, Edge, or VertexProperty):

gremlin> g.V().properties(‘code’).as(‘p’).
……1> element().out().as(‘v’).
……2> select(‘p’,’v’)
==>{p=vp[code->AUS], v=v[2]}
==>{p=vp[code->AUS], v=v[5]}
==>{p=vp[code->DFW], v=v[3]}
==>{p=vp[code->DFW], v=v[4]}
==>{p=vp[code->LAX], v=v[1]}
==>{p=vp[code->LAX], v=v[2]}
==>{p=vp[code->LAX], v=v[4]}
==>{p=vp[code->ATL], v=v[2]}
==>{p=vp[code->ATL], v=v[4]}

This step offers a major convenience because it saves you from having to track the parent element if you need to traverse to the property, leading to more readable Gremlin queries.

fail()

The fail() step provides a way to stop the traversal should a particular branch of Gremlin get run. If you have a use case where you are detecting a branch run with a constant(), which would likely better end in an exception and stopping the traversal, then you may wish to refactor that branch to use fail() instead. See the following code:

gremlin> g.V().choose(has(‘code’,startingWith(‘A’)), values(‘code’), constant(‘Not Starting with A’))
==>AUS
==>Not Starting with A
==>Not Starting with A
==>Not Starting with A
==>ATL
gremlin> g.V().choose(has(‘code’,startingWith(‘A’)), values(‘code’), fail(‘Not Starting with A’))
{“requestId”:”2de62d83-2fc4-407b-b524-c0d71ba28309″,”code”:”InternalFailureException”,”detailedMessage”:”Exception processing a script on request [RequestMessage{, requestId=2de62d83-2fc4-407b-b524-c0d71ba28309, op=’eval’, processor=”, args={gremlin=g.V().choose(has(‘code’,startingWith(‘A’)), values(‘code’), fail(‘Not Starting with A’)), userAgent=Gremlin Console/3.5.4, batchSize=64}}].”}
Type ‘:help’ or ‘:h’ for help.
Display stack trace? [yN]n

TextP.regex()

The regex() option on TextP allows for construction of predicates that are built from a regex expression, which opens a wide degree of flexibility when filtering string values:

gremlin> g.V().values(‘code’)
==>AUS
==>DFW
==>LAX
==>JFK
==>ATL
gremlin> g.V().has(‘code’, regex(‘A’)).values(‘code’)
==>AUS
==>LAX
==>ATL
gremlin> g.V().has(‘code’, regex(‘^A’)).values(‘code’)
==>AUS
==>ATL
gremlin> g.V().has(‘code’, regex(‘^A|J’)).values(‘code’)
==>AUS
==>JFK
==>ATL

property(Map)

Adding multiple properties to an element with the property() step involves chaining them one after the other, calling it once for each key-value pair being added:

gremlin> g.addV(‘airport’).property(id, ‘203’).property(‘code’,’AUS’)
==>v[203]

Of course, data from applications often comes in the shape of a Map, which means that you have to unroll the Map either in a loop in your code or in Gremlin itself. TinkerPop 3.6.x offers a new overload to the property() step that will directly take a Map, thereby saving you this added step:

gremlin> g.addV(‘airport’).property([(T.id): ‘203’, code: ‘AUS’])
==>v[203]

Behavior of by()

The by() modulator is used with a variety of different steps in Gremlin to help configure them with additional options. The behaviors that by() triggered within these steps varied quite considerably and often brought a fair bit of confusion when exceptions were raised as a result of the arguments given to it. In 3.6.x, there is better consistency around the behavior that by() triggers. If a by() produces a result, then that result is used in the parent step. If the opposite is true, then the traverser to which it was to be applied is filtered.

It’s hard to say what changes your Gremlin queries will need to have when you upgrade because your query semantics may or may not have taken advantage of the older behaviors. If your application was somehow relying on exception behavior for a failed by(), then that is likely an area of code for you to look at first when you upgrade. Consider the following example in Java code where there is a possibility that there are no out() edges for some vertices:

List l;
try {
l = g.V().aggregate(“a”).by(out()).cap(“a”).toList();
} catch (Exception ex) {
// The provided traverser does not map to a value: v[2]->[VertexStep(OUT,vertex)]
l = g.V().aggregate(“a”).by(in()).cap(“a”).toList();
}

If you have code that looks like the preceding example, meaning that you rely on the exception thrown as a result of by() that doesn’t produce a result for all cases given to it, you’ll want to make some adjustments after upgrade because the exception will no longer be thrown. Of course, cases like this are few and far between because it’s typical to encounter this sort of exception during development. The usual course is to account for the problematic by() in the query itself rather than use an exception as a switch for modifying the query’s behavioral flow. In any event, it’s worth a solid review of your code and query results to ensure that all is behaving as expected after upgrade.

Given the number of steps that use by(), it’s a bit too much to examine them all in this post. For examples for all the steps affected, refer to Consistent by() Behavior.

gremlin-driver and host availability

When using the Java gremlin-driver to connect to Neptune, you likely encountered a NoHostAvailableException at some point. It’s an exception that essentially means that all the available hosts configured in the driver are unreachable and until the driver detects a working host, further requests will result in this same exception. While examining the workings of the nature of the NoHostAvailableException during development of 3.5.5 and 3.6.2, it was noted that the driver was a bit pessimistic when evaluating host health and as a result, the driver fell into states where it could have been sending requests but instead produced NoHostAvailableException.

For 3.6.2 and 3.5.5, the driver was altered to offer a more optimistic strategy for determining host availability. The driver now looks at errors in borrowing connections from the pool more skeptically and offers more immediate and reliable retries before giving up on the host. Being more optimistic about the server staying online helps make the driver more resilient to temporary failures. It is also worth noting that logging for the driver was greatly improved, which should help isolate problems more easily should they occur. To read more details on the exact nature of the change, refer to gremlin-driver Host Availability.

Neptune users who are using the Java driver should definitely look for opportunities to upgrade to either of these releases based on the Neptune version you’re using.

Compilation and dependencies

If upgrading from earlier versions of TinkerPop, there are a number of changes in 3.6.x that may create compilation problems when moving to this release. Fortunately, these issues should be straightforward to address.

Java Gremlin DSL dependencies – If you have developed a Gremlin DSL in Java, note that the libraries that support this functionality no longer exist in the gremlin-core module. They have been extracted to the gremlin-annotations module. Package names and class names were retained in this refactoring, therefore simply adding the gremlin-annotations dependencies should resolve any compilation problems.
Groovy dependency removal – The dependency on Groovy in gremlin-driver, which was declared as <optional> in 3.5.x, has been wholly removed, along with support for JsonBuilder serialization, which was deprecated in 3.5.2.
Removal of Gryo support – All Gryo MessageSerializer implementations have been removed after being deprecated as late as 3.4.3. You should now prefer GraphBinary for network serialization needs. Unless you’re upgrading from a very old version of gremlin-driver or have a specific configuration for Gryo, you’re likely already using GraphBinary when communicating with Neptune. You won’t be able to use older versions of gremlin-driver with Gryo configured to communicate with Neptune 1.2.1.0. Attempts to do so will result in an error on the client.
GraphBinary as a default – GraphBinary serialization is now the default serialization option across all officially supported programming languages. GraphSON 3.0 serialization is still supported, but not recommended.
Moved GroovyTranslator – GroovyTranslator, a class meant to translate Gremlin traversals to a String form, has been moved from the gremlin-groovy module. If you utilize this class, you should drop your dependency to gremlin-groovy (assuming you don’t have other use for it) and simply rely on gremlin-core where the class can now be found in org.apache.tinkerpop.gremlin.process.traversal.translator package.
Step naming in Python – The following steps were renamed with the standard underscore suffix to avoid conflicts with certain Python keywords: filter_(), id_(), max_(), min_(), range_(), and sum_(). References to these steps during upgrade will require the steps to be renamed or otherwise aliased. Also, note that camelcase steps of gremlinpython now have more Pythonic equivalents (for example, valueMap() now also has the more Python friendly naming of value_map()). Although the old camelcase naming remains, it’s recommended that you prefer the more Pythonic approach because the camelcase naming may be removed in the future.

Conclusion

The TinkerPop 3.6.x release line has many features that Neptune users will be excited to have. As always, refer to the TinkerPop upgrade documentation and its CHANGELOG for a full listing of all of the changes through the most current release of 3.6.2. Please upgrade to Neptune 1.2.1.0 to take advantage of these important new features.

About the author

Stephen Mallette is a member of the Amazon Neptune team at AWS. He has developed graph database and graph processing technology for many years. He is a decade long contributor to the Apache TinkerPop project, the home of the Gremlin graph query language.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments