3. File Copier¶
This is the user documentation for the File Copier module contained in the ICDC-Dataloader utility.
3.1. Introduction¶
The File Copier copies files from a source URL to a designated AWS S3 Bucket. It has 3 modes of operation:
Master mode - The File Copier will read all of the file information from the pre-manifest, push jobs onto the job queue, and then listen to the results queue for the loading results.
Slave mode - The File Copier will grab jobs from the job queue, perform the copy job, and then push the job result to the result queue.
Solo mode - The File Copier will read all of the file information from the pre-manifest and then copy all of the files to the destination S3 bucket.
The File Copier can be found in this Github Repository: ICDC-Dataloader
3.2. Pre-requisites¶
Python 3.6 or newer
An initialized destination AWS S3 bucket
AWS Command Line Interface (CLI)
Initialized Job and Result SQS FIFO Queues (
masterandslavemodes only)An adapter to process information read from pre-manifest
3.3. Dependencies¶
Run pip3 install -r requirements.txt to install dependencies. Or run pip install -r requirements.txt if you are using virtualenv. The dependencies included in requirements.txt are listed below:
pyyaml
neo4j - version 1.7.6
boto3
requests
3.4. Inputs¶
The location of the files to be copied
The name of the destination S3 bucket
A File Copier config file
The module name and class name of the adapter for the data being transferred
A pre-manifest file (in TSV format)
The names of the job and result SQS FIFO queues (
masterandslavemodes only)
3.5. Outputs¶
The File Copier module will produce following outputs
Copies files into the specified S3 bucket
Generates two manifest files in the same place as pre-manifest file, one for DCF/IndexD, the other for Neo4j database.
Log messages to console as well as a log file inside
tmp/folder.
3.6. Configuration file¶
All the inputs of File Copier can be set in a YAML format configuration file by using the fields defined below.
An example configuration file can be found in config/file-copier-config.example.yml
domain: The domain name of the project.adapter_module: The module name of the adapter that will be used by the File Copier during operation.adapter_class: The class name of the adapter that will be used by the File Copier during operation.adapter_params: An object which contains parameters for the adapter’s constructor. Only available in configuration file, not as CLI arguments.bucket: The files in the source S3 Bucket will be copied into this destination S3 Bucket.prefix: Prefix for files being copied into the destination bucket.first: The first line to load. Lines are indexed starting with 1 and header lines are not counted.count: The number of files to be copy, a value of-1will copy all files.retry: The number of times that the File Copier will retry the copy operation.mode: The mode that the File Copier will run, the only valid inputs aremaster,slave, andsolo.job_queue: The File Copier will send jobs to the job SQS queue with the name specified by this input.result_queue: The results of the File Copier jobs will be sent to the result SQS queue with the name specified by this input.pre_manifest: The TSV file containing the details of the files to be copied.overwrite: Overwrites files even if they already exist in the destination and are the same size.dryrun: Runs checks on original files but does not perform the copy operation.verify_md5: Verify that the size and MD5 hash of the original file and the generated copy are the same.
3.7. Command Line Arguments¶
Configuration File
The YAML file containing the configuration details for the File Copier execution
Command :
<configuration file>Required
Default Value:
N/A
Destination S3 Bucket Name
The files in the source S3 Bucket will be copied into this destination S3 Bucket.
Command:
-b/--bucket <S3 bucket name>Required
Default Value:
N/A
Project Domain Name
The domain name of the project.
Command:
--domain <domain name>Required when not in
slavemodeDefault Value:
N/A
File Prefix
Prefix for files being copied into the destination bucket.
Command:
-p/--prefix <prefix>Required when not in
slavemodeDefault Value:
N/A
First Line
The first line to load. Lines are indexed starting with 1 and header lines are not counted.
Command:
-f/--first <index of first line>Not Required
Default Value:
1
Number of Files to Copy
The number of files to be copy, a value of
-1will copy all files.Command:
-c/--count <number of files to copy>Not Required
Default Value:
-1
Enable Overwrite
Overwrites files even if they already exist in the destination and are the same size.
Command:
--overwriteNot Required
Default Value:
false
Enable Dry Run
Runs checks on original files but does not perform the copy operation.
Command:
-d/--dryrunNot Required
Default Value:
false
Verify Original MD5
Verify that the size and MD5 hash of the original file and the generated copy are the same.
Command:
-v/--verify-md5Not Required
Default Value:
false
Number of Times to Retry
The number of times that the File Copier will retry the copy operation.
Command:
-r/--retryNot Required
Default Value:
3
Running Mode
The mode that the File Copier will run, the only valid inputs are
master,slave, andsolo.Command:
-m/--modeRequired
Default Value:
N/A
Job SQS Queue Name
The File Copier will send jobs to the job SQS queue with the name specified by this input.
Command:
--job-queueRequired when not in
solomodeDefault Value:
N/A
Result SQS Queue Name
The results of the File Copier jobs will be sent to the result SQS queue with the name specified by this input.
Command:
--result-queueRequired when not in
solomodeDefault Value:
N/A
Pre-manifest File
The TSV file containing the details of the files to be copied.
Command:
--pre-manifestRequired when not in
slavemodeDefault Value:
N/A
Adapter Module Name
The module name of the adapter that will be used by the File Copier during operation.
Command:
--adapter-moduleRequired when not in
slavemodeDefault Value:
N/A
Adapter Class Name
The class name of the adapter that will be used by the File Copier during operation.
Command:
--adapter-classRequired when not in
slavemodeDefault Value:
N/A
3.8. Usage Examples¶
Below are example commands to run the File Copier.
3.8.1. Solo Mode¶
file_copier.py -b example_bucket --domain example_domain -p example_prefix -m solo --pre-manifest example_file.tsv --adapter-module example_module --adapter-class example_class example_config.yml
3.8.2. Master Mode¶
file_copier.py -b example_bucket --domain example_domain -p example_prefix -m master --job-queue example_job_queue --result-queue example_result_queue --pre-manifest example_file.tsv --adapter-module example_module --adapter-class example_class example_config.yml
3.8.3. Solo Mode¶
file_copier.py -b example_bucket -m slave --job-queue example_job_queue --result-queue example_result_queue example_config.yml
3.8.4. Example Inputs¶
Destination S3 Bucket Name
example_bucket
Project Domain Name
example_domain
File Prefix
example_prefix
Running Mode
solomasterslave
Job SQS Queue Name
example_job_queue
Result SQS Queue Name
example_result_queue
Pre-manifest File
example_file.tsv
Adapter Module Name
example_module
Adapter Class Name
example_class
Configuration File
example_config.yml