Data Import Decisions: Why Choose Kettle for Neo4j Data Import?

Jennifer Reif
6 min readMay 14, 2019

--

I have recently begun exploring a couple options in Neo4j’s massive list of integration tools and wrote a developer guide on importing relational data into Neo4j.

Data import is one of the most time-consuming and complex steps in any project, and there are a variety of ways to go about it. One of the ways to get data into Neo4j is with the Neo4j plugins in Kettle. Of course, you can import or export all kinds of data using Kettle, but I had never dealt much with the tool, so I wanted to see how it worked and why it would be better than other alternatives.

Kettle

Kettle started as a project within Pentaho and was completely open-sourced. It allows the user to build a visual process for how data moves through your system, essentially giving you a detailed data flow diagram and process steps. Whether you are dealing with multiple databases, batch processes, flat files, or data manipulation and transformation procedures, Kettle can help you build and realize what is happening to your data and where it goes.

There are also quite a few customization options for various situations and unique data that allow the user to handle the data entirely within the tool and accurately map the process for their team, department, or company (rather than sending pieces outside the tool to feed back in later).

By adding Neo4j to the mix, users can maximize these efforts by making it the source and receive an overall representation of data across sources. Kettle could also be simply a means to get massive amounts of data into Neo4j for transactional data or analytics. No matter the use case, there are reasons to choose Kettle over other Neo4j import methods such as LOAD CSV, APOC, or Cypher. We will briefly cover a couple.

Why go with Kettle?

  1. Your developers are already familiar with the tool. There are very few reasons to completely start from scratch when dealing with any technology if you don’t have to, so the less you have to rip-and-replace, the easier it is to on-board current developers. Kettle is a popular enterprise tool, so if it’s already in the pipeline for your group, then take advantage of it!
  2. You need to import massive amounts of data and/or need to import from a variety of data sources. Because Kettle was designed to work with large systems, it is meant to handle a lot of data very well from all kinds of different systems and sources. You might have a piece of data that gets filtered out and formatted a certain way to get sent for a report in one area, but other data gets fed into your data analytics system, and another slice is shipped to a batch process for payroll. No matter how you slice and dice, Kettle seems to be able to accommodate it.
  3. You need to handle different kinds of transformations or data formats. The nice thing about this tool is that it can handle a variety of scenarios all within the same tool. There is no passing data out to an external process or other system to manipulate or calculate in a particular way before passing the formatted results back into the system. Kettle can do this all within itself. For clean and robust data processes, this seems to do the job well.
  4. You have a complex data flow process that is difficult to maintain and explain. Kettle can help organize and depict the data flow process and the steps data takes to reach various outputs. It is helpful to document and maintain the process as a whole, as well as to update and fix specific steps or sub-processes.

Setup for Neo4j and Kettle

Now that we have a foundation for when Kettle is useful and why, we can walkthrough the steps to get Kettle downloaded and ready to go. There are a couple of plugins to use Kettle with Neo4j — Neo4j output, Neo4j Cypher, and Neo4j Graph Output. These three handle quite a bit of interaction with Neo4j, though more options are in-progress, so keep an eye on announcements and blog posts for the latest. Documentation and explanation of these plugins can also be found in the Github project.

Neo4j Kettle plugins

Neo4j output will allow users to specify a node, two nodes, or two nodes with relationship to update with other details about what is needed, and the plugin will generate and execute the necessary Cypher statements to accomplish it. If developers are not Cypher gurus yet, or you have less-technical users interacting, this plugin would allow them to make changes to the data without technical intervention.

In contrast to the previous option, the Neo4j Cypher plugin allows users to write Cypher statements and execute them over the built-in BOLT protocol. With this option, you can read, write, or read/write in your queries and write Cypher to your heart’s content. There are some parameters you can specify to batch the results or to pass in or unwind the incoming data.

The third option is for Neo4j Graph Output, which is still being developed and improved. It maps input fields to a translated graph data model and also bypasses null values, so you don’t get halted on errors or cleaning data. This plugin defines the model locally, as of right now, but this will likely change in the future. You can edit and define the mappings to build the data model that fits your needs.

Installing and starting Kettle

Kettle has a nice UI and built-in plugins that all come bundled in with the download on SourceForge’s Kettle project page. It’s a pretty large download, so do be sure to have some space on your hard drive before you start it! :)

Once you download the .zip file and unzip the contents, then we need to add the Neo4j plugins. The neo4j.kettle.be link points to the source where all of the latest code can be found. Click on the Neo4jOutput .zip file to download the plugins, unzip, and move the unzipped folder to the data-integration/plugins/ folder from your Kettle download.

Next, you can open the Spoon (or Data Integration) application. This will open the UI for Kettle. If you’re using a Mac, the application may not open without adjusting a security setting (this is because Kettle was downloaded from the internet, and Apple is picky about external applications). All you need to do to allow it is to type the below command in the Mac terminal, open a Finder window where the Data Integration app is located, drag it to your terminal window (should paste the path to the app at the end of the below command), execute it, and you should be good to go!

sudo xattr -dr com.apple.quarantine + Data Integration application path

The steps for this can also be found in a blog post here. Once you do those steps (Mac users only), then you can try to open the Data Integration app again, and it should load. When you first load the app, you should get a screen like the one below. This is the main screen of Kettle where you can start creating your data process and steps.

Highlights

For this post, we simply discussed Kettle as a data import option for Neo4j and looked at the reasons behind choosing this method, as well as how to get started with the tool. Kettle has a suite of plugins for specific scenarios, including a few plugins for Neo4j. The initial set of plugins was designed to ease the import process for those comfortable with Neo4j and Cypher, as well as those new to graphs and how to interact with them.

There are many ways to handle data import with Neo4j using built-in methods or integration tools such as Kettle. No matter your need, there is probably a great solution out there for you. The difficulty is in finding it. Hopefully, this post will aid in your decision and encourage you to test Kettle out as a data import solution to see if it’s right for your project.

Resources

--

--

Jennifer Reif

Jennifer Reif is an avid developer and problem-solver. She enjoys learning new technologies, sometimes on a daily basis! Her Twitter handle is @JMHReif.