3. File Copier¶
This is the user documentation for the File Copier module contained in the ICDC-Dataloader utility.
3.1. Introduction¶
The File Copier copies files from a source URL to a designated AWS S3 Bucket. It has 3 modes of operation:
Master mode - The File Copier will read all of the file information from the pre-manifest, push jobs onto the job queue, and then listen to the results queue for the loading results.
Slave mode - The File Copier will grab jobs from the job queue, perform the copy job, and then push the job result to the result queue.
Solo mode - The File Copier will read all of the file information from the pre-manifest and then copy all of the files to the destination S3 bucket.
The File Copier can be found in this Github Repository: ICDC-Dataloader
3.2. Pre-requisites¶
Python 3.6 or newer
An initialized destination AWS S3 bucket
AWS Command Line Interface (CLI)
Initialized Job and Result SQS FIFO Queues (
master
andslave
modes only)An adapter to process information read from pre-manifest
3.3. Dependencies¶
Run pip3 install -r requirements.txt
to install dependencies. Or run pip install -r requirements.txt
if you are using virtualenv. The dependencies included in requirements.txt
are listed below:
pyyaml
neo4j - version 1.7.6
boto3
requests
3.4. Inputs¶
The location of the files to be copied
The name of the destination S3 bucket
A File Copier config file
The module name and class name of the adapter for the data being transferred
A pre-manifest file (in TSV format)
The names of the job and result SQS FIFO queues (
master
andslave
modes only)
3.5. Outputs¶
The File Copier module will produce following outputs
Copies files into the specified S3 bucket
Generates two manifest files in the same place as pre-manifest file, one for DCF/IndexD, the other for Neo4j database.
Log messages to console as well as a log file inside
tmp/
folder.
3.6. Configuration file¶
All the inputs of File Copier can be set in a YAML format configuration file by using the fields defined below.
An example configuration file can be found in config/file-copier-config.example.yml
domain
: The domain name of the project.adapter_module
: The module name of the adapter that will be used by the File Copier during operation.adapter_class
: The class name of the adapter that will be used by the File Copier during operation.adapter_params
: An object which contains parameters for the adapter’s constructor. Only available in configuration file, not as CLI arguments.bucket
: The files in the source S3 Bucket will be copied into this destination S3 Bucket.prefix
: Prefix for files being copied into the destination bucket.first
: The first line to load. Lines are indexed starting with 1 and header lines are not counted.count
: The number of files to be copy, a value of-1
will copy all files.retry
: The number of times that the File Copier will retry the copy operation.mode
: The mode that the File Copier will run, the only valid inputs aremaster
,slave
, andsolo
.job_queue
: The File Copier will send jobs to the job SQS queue with the name specified by this input.result_queue
: The results of the File Copier jobs will be sent to the result SQS queue with the name specified by this input.pre_manifest
: The TSV file containing the details of the files to be copied.overwrite
: Overwrites files even if they already exist in the destination and are the same size.dryrun
: Runs checks on original files but does not perform the copy operation.verify_md5
: Verify that the size and MD5 hash of the original file and the generated copy are the same.
3.7. Command Line Arguments¶
Configuration File
The YAML file containing the configuration details for the File Copier execution
Command :
<configuration file>
Required
Default Value:
N/A
Destination S3 Bucket Name
The files in the source S3 Bucket will be copied into this destination S3 Bucket.
Command:
-b/--bucket <S3 bucket name>
Required
Default Value:
N/A
Project Domain Name
The domain name of the project.
Command:
--domain <domain name>
Required when not in
slave
modeDefault Value:
N/A
File Prefix
Prefix for files being copied into the destination bucket.
Command:
-p/--prefix <prefix>
Required when not in
slave
modeDefault Value:
N/A
First Line
The first line to load. Lines are indexed starting with 1 and header lines are not counted.
Command:
-f/--first <index of first line>
Not Required
Default Value:
1
Number of Files to Copy
The number of files to be copy, a value of
-1
will copy all files.Command:
-c/--count <number of files to copy>
Not Required
Default Value:
-1
Enable Overwrite
Overwrites files even if they already exist in the destination and are the same size.
Command:
--overwrite
Not Required
Default Value:
false
Enable Dry Run
Runs checks on original files but does not perform the copy operation.
Command:
-d/--dryrun
Not Required
Default Value:
false
Verify Original MD5
Verify that the size and MD5 hash of the original file and the generated copy are the same.
Command:
-v/--verify-md5
Not Required
Default Value:
false
Number of Times to Retry
The number of times that the File Copier will retry the copy operation.
Command:
-r/--retry
Not Required
Default Value:
3
Running Mode
The mode that the File Copier will run, the only valid inputs are
master
,slave
, andsolo
.Command:
-m/--mode
Required
Default Value:
N/A
Job SQS Queue Name
The File Copier will send jobs to the job SQS queue with the name specified by this input.
Command:
--job-queue
Required when not in
solo
modeDefault Value:
N/A
Result SQS Queue Name
The results of the File Copier jobs will be sent to the result SQS queue with the name specified by this input.
Command:
--result-queue
Required when not in
solo
modeDefault Value:
N/A
Pre-manifest File
The TSV file containing the details of the files to be copied.
Command:
--pre-manifest
Required when not in
slave
modeDefault Value:
N/A
Adapter Module Name
The module name of the adapter that will be used by the File Copier during operation.
Command:
--adapter-module
Required when not in
slave
modeDefault Value:
N/A
Adapter Class Name
The class name of the adapter that will be used by the File Copier during operation.
Command:
--adapter-class
Required when not in
slave
modeDefault Value:
N/A
3.8. Usage Examples¶
Below are example commands to run the File Copier.
3.8.1. Solo Mode¶
file_copier.py -b example_bucket --domain example_domain -p example_prefix -m solo --pre-manifest example_file.tsv --adapter-module example_module --adapter-class example_class example_config.yml
3.8.2. Master Mode¶
file_copier.py -b example_bucket --domain example_domain -p example_prefix -m master --job-queue example_job_queue --result-queue example_result_queue --pre-manifest example_file.tsv --adapter-module example_module --adapter-class example_class example_config.yml
3.8.3. Solo Mode¶
file_copier.py -b example_bucket -m slave --job-queue example_job_queue --result-queue example_result_queue example_config.yml
3.8.4. Example Inputs¶
Destination S3 Bucket Name
example_bucket
Project Domain Name
example_domain
File Prefix
example_prefix
Running Mode
solo
master
slave
Job SQS Queue Name
example_job_queue
Result SQS Queue Name
example_result_queue
Pre-manifest File
example_file.tsv
Adapter Module Name
example_module
Adapter Class Name
example_class
Configuration File
example_config.yml