Import data from CSV files
You can bulk import data to node and relationship tables from CSV files
using the COPY FROM command. It is highly recommended to use COPY FROM if you are creating large
databases. You can use COPY FROM to import data into an empty table or to append data to an existing table.
There are a set of CSV configurations you can set during COPY FROM (as well as LOAD FROM). We will cover
these parameters below in the CSV Configurations section. This page first covers
examples of loading into node and relationship tables. Please see the section about Ignoring Erroneous Rows
on how to skip erroneous CSV rows during the import.
Import to node table
Create a node table User as follows:
CREATE NODE TABLE User(name STRING PRIMARY KEY, age INT64, reg_date DATE)The CSV file user.csv contains the following fields:
name,age,reg_dateAdam,30,2020-06-22Karissa,40,2019-05-12...The following statement will load user.csv into User table.
COPY User FROM "user.csv" (header=true);Import to relationship table
When loading into a relationship table, Ladybug assumes the first two columns in the file are:
FROMNode Column: The primary key of theFROMnodes.TONode Column: The primary key of theTOnodes.
The rest of the columns correspond to relationship properties.
Create a relationship table Follows using the following Cypher query:
CREATE REL TABLE Follows(FROM User TO User, since DATE)This reads data from the below CSV file follows.csv:
Adam,Karissa,2010-01-30Karissa,Michelle,2014-01-30...The following statement loads the follows.csv file into a Follows table.
COPY Follows FROM "follows.csv";Note that the header wasn’t present in the CSV file, hence the header parameter is not set.
To skip the first 3 lines of the CSV file, you can use the SKIP parameter as follows:
COPY Follows FROM "follows.csv" (SKIP=3);Import multiple files to a single table
It is common practice to divide a large CSV file into several smaller files for cleaner data management. Ladybug can read multiple files with the same structure, consolidating their data into a single node or relationship table. You can specify that multiple files are loaded in the following ways:
Glob pattern
This is similar to the Unix glob pattern, where you specify file paths that match a given pattern. The following wildcard characters are supported:
| Wildcard | Description |
|---|---|
* | match any number of any characters (including none) |
? | match any single character |
[abc] | match any one of the characters enclosed within the brackets |
[a-z] | match any one of the characters within the range |
COPY User FROM "User*.csv"List of files
Alternatively, you can just specify a list of files to be loaded.
COPY User FROM ["User0.csv", "User0.csv", "User2.csv"]CSV configurations
There are a set of configurations that can be set when importing CSV files, such as
whether the CSV file has a header that should be skipped during loading or what the delimiter character
between the columns of the CSV is. See below for the list of all supported configurations. These
configurations can be manually set by specifying parameters inside ( ) at the
end of the COPY FROM clause. Several of the supported configurations, such as the header and delimiter characters,
are automatically detected if they are not manually specified at the end of COPY FROM clause.
See the subsections below for more details on how Ladybug automatically detects these configurations.
The following configuration parameters are supported:
| Parameter | Description | Default Value |
|---|---|---|
HEADER | Whether the first line of the CSV file is the header. Can be true or false. | false |
DELIM or DELIMITER | Character that separates different columns in a lines. | , |
QUOTE | Character to start a string quote. | " |
ESCAPE | Character within string quotes to escape QUOTE and other characters, e.g., a line break. See the important note below about line breaks lines below. | \ |
SKIP | Number of rows to skip from the input file | 0 |
PARALLEL | Read CSV files in parallel or not | true |
IGNORE_ERRORS | Skips malformed rows in CSV files if set to true. Use SHOW_WARNINGS function to view information about malformed rows. Also see CLEAR_WARNINGS function. See more on Warnings table to inspect skipped rows. | false |
auto_detect | Turn ON/OFF the auto detection of configurations (more details below) | true |
sample_size | The number of sample CSV lines to use when auto detecting CSV configurations (more details below) | 256 |
NULL_STRINGS | The strings that should be treated as nulls in the CSV file. | ""(empty string) |
For example, the query below specifies that the CSV delimiter is | and also that the header row exists.
COPY User FROM "user.csv" (HEADER=true, DELIM="|");If any of the following configuration options are not manually specified at the end of the COPY FROM statement,
by default Ladybug will try to automatically detect them:
- HEADER
- DELIM
- QUOTE
- ESCAPE
If you specify a subset of these manually but not the others, then only those that have not been specified will be automatically detected.
You can turn off auto-detection by setting (auto_detect=false) as a parameter, in which case Ladybug will default to using the default values
for any of the unspecified configurations. For example, consider the example from above again:
COPY User FROM "user.csv" (HEADER=true, DELIM="|");In this case (which is equivalent to COPY User FROM "user.csv" (HEADER=true, DELIM="|", auto_detect=true)),
Ladybug will try to automatically detect the QUOTE and ESCAPE characters.
It will not try to automatically detect if the first line is a header line or the DELIM character,
since those configurations are manually specified in the query.
If instead the query was:
COPY User FROM "user.csv" (HEADER=true, DELIM="|", auto_detect=false);Then, Ladybug will use the default values of QUOTE and ESCAPE, which are " and \ respectively (and use
the manually specified configurations for HEADER and DELIM).
Sample size parameter
By default, Ladybug will use the first 256 lines of the CSV file to auto-detect unspecified configurations.
If you want to use a different number of lines, you can specify the sample_size parameter.
You can find more information on how Ladybug automatically tries to detect these configurations below.
Header auto detection
Ladybug parses the first line of the CSV into columns and checks if each column can be cast to the data type of the target column in the node or rel table that is being copied into. If so, the line is assumed to be a valid “data” line and inserted as a record into the target table. Otherwise, it is assumed to be a header line and skipped.
Delimiter, quote and escape character auto detection
Ladybug uses the first sample_size lines to auto detect any configuration that has not been manually specified.
The possible configurations for different configurations are:
- DELIM:
,,|,;,\t. - QUOTE:
",'and (no quote character) - ESCAPE:
",',\and (no escape character)
Null strings handling
By default, Ladybug treats only empty strings ("") as NULL values. However, in certain scenarios, the default behaviour may not be sufficient. For example, if you’re working with a CSV file exported by a tool that uses the string "NULL" to represent nulls, you can modify Ladybug’s behaviour by setting the NULL_STRINGS parameter to include both the empty string and the string "NULL":
LOAD FROM 'xxx.csv'(null_strings=[null_str1, null_str2, null_str3]) RETURN *;Compressed CSV files
To reduce file size, CSV files are often distributed in compressed formats. Ladybug supports directly scanning *.csv.gz files (compressed with gzip)
without requiring manual decompression. Simply specify the file path in the LOAD FROM or COPY FROM clause.
Assume that you compress the user.csv from above with the gzip command:
gzip -k user.csvThen, you can use the LOAD FROM clause to scan the compressed file directly.
LOAD FROM 'user.csv.gz' RETURN *;┌─────────┬───────┬────────────┐│ name │ age │ reg_date ││ STRING │ INT64 │ DATE │├─────────┼───────┼────────────┤│ Adam │ 30 │ 2020-06-22 ││ Karissa │ 40 │ 2019-05-12 │└─────────┴───────┴────────────┘Ignore erroneous rows
See the Ignore erroneous rows section for more details.