1. Data Loader¶
This is the user documentation for the Data Loader module contained in the ICDC-Dataloader utility.
1.1. Introduction¶
The Data Loader module is a versatile Python application used to load data into a Neo4j database. The application is capable of loading data from either a system directory or from an Amazon Web Services (AWS) S3 Bucket
The Data Loader can be found in this Github Repository: ICDC-Dataloader
1.2. Pre-requisites¶
Python 3.6 or newer
An initialized and running Neo4j database
AWS S3 bucket, if applicable
If the Neo4j database is running remotely, SSH login permissions are required.
On the system running the Neo4j database, the
<Neo4j Home>/bin
directory must be added to thePATH
environment variable in order to perform the backup operation.
1.3. Dependencies¶
Run pip3 install -r requirements.txt
to install dependencies. Or run pip install -r requirements.txt
if you are using virtualenv. The dependencies included in requirements.txt
are listed below:
pyyaml
neo4j - version 1.7.6
boto3
requests
1.4. Inputs¶
A Data Loader configuration file
Neo4j endpoint and credentials
YAML formatted schema file and properties files
If loading from an AWS S3 bucket, the S3 folder and bucket name
The dataset directory or a local temporary folder if loading from an AWS S3 bucket
1.5. Outputs¶
The Data Loader module loads data into the specified Neo4j database, and log messages to console as well as a log file inside tmp/
folder.
1.6. Data File Format Specifications¶
Files must be in TSV format with
.tsv
or.txt
extensionFiles must contain a
type
column indicates what node type of the record/nodeAny column with a
parent_node_type.parent_id_field
formatted heading will be used as parent of current record/nodeAny column with a
relationship_type$field_name
formatted heading will be treated as a property on that relationshipAll other columns will be treated as regular properties for the record/node
1.7. Configuration File¶
All the inputs of Data Loader can be set in a YAML format configuration file by using the fields defined below. Using a configuration file can make your Data Loader command significantly shorter.
An example configuration file can be found in config/data-loader-config.example.yml
neo4j:uri
: Address of the target Neo4j endpointneo4j:user
: Username to be used for the Neo4j databaseneo4j:password
: Password to be used for the Neo4j databaseschema
: The file path(s) of the YAML formatted schema file(s)prop_file
: The file containing the properties for the specified schemacheat_mode
: Disables data validation before loading datadry_run
: Runs data validation only, disables loading datawipe_db
: Clears all data in the database before loading the datano_backup
: Skips the backing up the database before loading the databackup_folder
: Location to store database backupno_confirmation
: Automatically confirms any confirmation prompts that are displayed during the data loadingmax_violations
: The maximum number of violations (per data file) to be displayed in the console output during data loadingno_parents
: Does not save parent node IDs in children nodessplit_transactions
: Splits the database load operations into separate transactions for each files3_bucket
: The name of the S3 bucket containing the data to be loadeds3_folder
: The name of the S3 folder containing the data to be loadedloading_mode
: The loading mode to be useddataset
: The directory containing the data to be loaded, a temporary directory if loading from an S3 bucket
1.8. Command Line Arguments¶
All the command line arguments can be specified in the configuration file. If an argument is specified in both the configuration file and the command line then the command line value will be used.
Configuration File
The YAML file containing the configuration details for the Data Loader execution
Command :
<configuration file>
Not Required
Default Value :
N/A
Neo4j URI
Address of the target Neo4j endpoint
Command :
-i/--uri <URI>
Not required
Default Value :
bolt://localhost:7687
Neo4j Username
Username to be used for the Neo4j database
Command :
-u/--user <username>
Not required
Default Value :
neo4j
Neo4j Password
Password to be used for the Neo4j database
Command :
-p/--password <password>
Not required if specified in the configuration file, or in environment variable
NEO_PASSWORD
Default Value :
N/A
Schema File(s)
The file path(s) of the YAML formatted schema file(s). Use multiple –schema arguments (one for each schema file), if more than one schema files are needed
Command :
-s/--schema <schema1> -s/--schema <schema2> …
Not required if specified in the configuration file
Default Value :
N/A
Properties File
The file containing additional properties for the specified schema such as type mappings, id field identifications, and plurals mappings.
Command :
--prop-file <properties file>
Not required if specified in the configuration file
Example files can be found under
config/
folder, such asconfig/props-bento-ext.yml
for Bento reference implementation, andconfig/props-ctdc.yml
for CTDC etc.Default Value :
N/A
Enable Cheat Mode
Disables data validation before loading data
Command :
-c/--cheat-mode
Not required
Default Value :
false
Enable Dry Run
Runs data validation only, disables loading data
Command :
-d/--dry-run
Not required
Default Value :
false
Wipe Database
Clears all data in the database before loading the data
Command :
--wipe-db
Not required
Default Value :
false
Disable Backup
Skips the backing up the database before loading the data
Command :
--no-backup
Not required
Default Value :
false
Database Backup Folder
The folder where the database backup will be stored.
Command :
--backup-folder
Required unless the backup operation is disabled by the
--no-backup
commandDefault Value :
N/A
Enable Auto-Confirm
Automatically confirms any confirmation prompts that are displayed during the data loading
Command :
-y/--yes
Not Required
Default Value :
false
Maximum Violations to Display
The maximum number of violations (per data file) to be displayed in the console output during data loading
Command :
-M/--max-violations <number>
Not Required
Default Value :
10
S3 Bucket Name
The name of the S3 bucket containing the data to be loaded
Command :
-b/--bucket <bucket name>
Not required if data is not being loaded from an S3 bucket
Default Value :
N/A
S3 Folder Name
The name of the S3 folder containing the data to be loaded
Command :
-f/--s3-folder <folder name>
Not required if data is not being loaded from an S3 bucket
Default Value :
N/A
Mode Selection
The loading mode, valid inputs are
upsert
,new
,delete
Command :
-m/--mode <mode>
Not Required
Default Value :
UPSERT_MODE
Enable No Parent IDs Mode
Does not save parent node IDs in children nodes
Command :
--no-parents
Not Required
Default Value :
false
Enable Split Transactions Mode
Creates a separate database transactions for each file while loading
Command :
--split-transactions
Not Required
Default Value :
false
Dataset Directory
The directory containing the data to be loaded, a temporary directory if loading from an S3 bucket
Command :
--dataset <dir>
Not required if specified in the configuration file
Default Value :
N/A
1.9. Usage Example¶
Below is an example command to run the Model Converter:
python3 loader.py config/config.yml -p secret -s tests/data/icdc-model.yml -s tests/data/icdc-model-props.yml --prop-file config/props-icdc.yml --no-backup -–dataset /data/Dataset-20191119
1.9.1. Example Inputs¶
Neo4j Password
secret
Schema Files
tests/data/icdc-model.yml
tests/data/icdc-model-props.yml
Configuration File
config/config.yml
Properties File
config/props-icdc.yml
Disable Backup
Set to
true
with--no-backup
argument
Dataset Directory
/data/Dataset-20191119