LOAD DATA INFILE [options] [file_location]fname[.gz] INTO TABLE name
LOAD DATA INFILE [options] [file_location]fname[.gz] INTO FILE fname
options:
(ROW_KEY_COLUMN '=' column_specifier ['+' column_specifier ...]
| TIMESTAMP_COLUMN '=' name |
| HEADER_FILE '=' '"' filename '"'
| ROW_UNIQUIFY_CHARS '=' n
| NO_ESCAPE
| DUPLICATE_KEY_COLUMNS
| IGNORE_UNKNOWN_COLUMNS
| SINGLE_CELL_FORMAT
| RETURN_DELETES)*
column_specifier =
[ column_format ] column_name
column_format
"%0" int
| "%-"
| "%"
file_location:
"dfs://" | "file://"
The LOAD DATA INFILE command provides a way to bulk load data from an
optionally compressed file or stdin (fname of "-"), into a table. The input
is assumed to start with a header line that indicates the format of the lines
in the file. The header can optionlly be stored in a separate file and
referenced with the HEADER_FILE option. The header is expected to have the
following format:
header =
[ '#' ] single_cell_format
| [ '#' ] multi_cell_format
single_cell_format =
"row" '\t' "column" '\t' "value" '\n'
| "timestamp" '\t' "row" '\t' "column" '\t' "value" '\n'
multi_cell_format =
column | string ( '\t' ( column | string ) )*
column = column_family [ ':' column_qualifier ]
Two basic tab-delimited formats are supported, a single cell format in which each line contains a single cell, and a multi-cell format in which each line can contain a list of cells. The following example shows the single-cell format:
row column value
1127071 query guardianship
1127071 item:rank 8
1127071 click_url http://adopting.adoption.com
1246036 query polish american priests association
1246036 item:rank 6
1246036 click_url http://www.palichicago.org
12653 query lowes
12653 item:rank 1
12653 click_url http://www.lowes.com
1270972 query head hunters
1270972 item:rank 2
1270972 click_url http://www.headhunters.com
2648672 query jamie farr
2648672 item:rank 1
2648672 click_url http://www.imdb.com
...
An optional initial timestamp column can be included which represents the cell timestamp, for example:
timestamp row column value
2009-08-12 00:01:08 1127071 query guardianship
2009-08-12 00:01:08 1127071 item:rank 8
2009-08-12 00:01:08 1127071 click_url http://adopting.adoption.com
2009-08-12 00:01:18 1246036 query polish american priests association
2009-08-12 00:01:18 1246036 item:rank 6
2009-08-12 00:01:18 1246036 click_url http://www.palichicago.org
2009-08-12 00:01:14 12653 query lowes
2009-08-12 00:01:14 12653 item:rank 1
2009-08-12 00:01:14 12653 click_url http://www.lowes.com
2009-08-12 00:01:10 1270972 query head hunters
2009-08-12 00:01:10 1270972 item:rank 2
2009-08-12 00:01:10 1270972 click_url http://www.headhunters.com
2009-08-12 00:01:17 2648672 query jamie farr
2009-08-12 00:01:17 2648672 item:rank 1
2009-08-12 00:01:17 2648672 click_url http://www.imdb.com
...
The multi-line format assumes that each tab delimited field represents a cell value and the column header specifies the name of the column. Unless otherwise specified, the first column is assumed to be the rowkey. For example:
anon_id query item:rank click_url
3613173 batman signal images 18 http://www.icomania.com
1127071 guardianship 8 http://adopting.adoption.com
1270972 head hunters 2 http://www.headhunters.com
465778 google 1 http://www.google.com
12653 lowes 1 http://www.lowes.com
48785 address locator 2 http://www.usps.com/ncsc/
48785 address locator 3 http://factfinder.census.gov
2648672 jamie farr 1 http://www.imdb.com
1246036 polish american 6 http://www.palichicago.org
605089 dachshunds for sale 2 http://www.houstonzone.org
760038 stds 1 http://www.ashastd.org
When loaded into a table with a straight LOAD DATA INFILE command, the above
file will produce a set of cells equivalent to Example 1 above.
ROW_KEY_COLUMN = column_specifier [ + column_specifier ... ]
The LOAD DATA INFILE command accepts a number of options. The first is the
ROW_KEY_COLUMN option. This is used in conjunction with the multi-cell
input file format. It provides a way to select which column in the input
file should be used as the row key (default is first column). By separating
two or more column names with the '+' character, multiple column values will
be concatenated together, separated by a single space character to form the
row key. Also, each column specifier can have one of the following prefixes
to control field width and justification:
| Prefix | Description |
|---|---|
%0<n> |
For numeric columns, specifies a field width of <n> and right-justify with '0' padding |
%-<n> |
Specifies a field width of <n> and right-justification with ' ' (space) padding | %<n> |
Specifies a field width of <n> and left-justification with ' ' (space) padding |
For example, assuming the data in Example 3 above is contained in a filed
named "query-log.tsv", then the following LOAD DATA INFILE command:
LOAD DATA INFILE ROW_KEY_COLUMN="%09anon_id"+query "query-log.tsv" INTO TABLE 'anon-id-query';
will populated the 'anon-id-query' table with the following content:
000012653 lowes item:rank 1
000012653 lowes click_url http://www.lowes.com
000048785 address locator item:rank 3
000048785 address locator item:rank 2
000048785 address locator click_url http://factfinder.census.gov
000048785 address locator click_url http://www.usps.com/ncsc/
000465778 google item:rank 1
000465778 google click_url http://www.google.com
000605089 dachshunds for sale item:rank 2
000605089 dachshunds for sale click_url http://www.houstonzone.org
000760038 stds item:rank 1
000760038 stds click_url http://www.ashastd.org
001127071 guardianship item:rank 8
001127071 guardianship click_url http://adopting.adoption.com
001246036 polish american item:rank 6
001246036 polish american click_url http://www.palichicago.org
001270972 head hunters item:rank 2
001270972 head hunters click_url http://www.headhunters.com
002648672 jamie farr item:rank 1
002648672 jamie farr click_url http://www.imdb.com
003613173 batman signal images item:rank 18
003613173 batman signal images click_url http://www.icomania.com
TIMESTAMP_COLUMN = column_name
The TIMESTAMP_COLUMN option is used in conjunction with the multi-cell input
file format to specify which field of the input file should be used as the
timestamp. The timestamp extracted from this field will be used for each cell
in the row. The timestamp field is assumed to have the format YYYY-MM-DD HH:MM:SS
HEADER_FILE = "filename"
The HEADER_FILE option is used to specify an alternate file that contains
the header line for the data file. This is useful in situations where you have
log files that roll periodically and/or you want to be able to concatenate
them. This option allows the input files to just contain data and the header
to be specified in a separate file.
ROW_UNIQUIFY_CHARS = n
The ROW_UNIQUIFY_CHARS option provides a way to append a random string of
characters to the end of the row keys to ensure that they are unique. The
maximum number of characters you can specify is 21 and each character
represents 6 random bits. It is useful in situations where the row key isn't
discriminating enough to cause each input line to wind up in its own row.
For example, let's say you want to dump a server log into a table, using the
timestamp as the row key. However, as in the case of an Apache log, the
timestamp usually only has resolution down to the second and there may be
many entries that fall within the same second.
NO_ESCAPE
The NO_ESCAPE option provides a way to disable the escaping mechanism.
The newline and tab characters are escaped and unescaped when transferred in
and out of the system. The LOAD DATA INFILE command will scan the input
for the two character sequences '\' 'n', '\' 't', and '\' '\' and will convert them
into a newline, tab, and backslash, respectively. The NO_ESCAPE option
disables this conversion.
DUPLICATE_KEY_COLUMNS
Normally input fields that represent the row key (the first field or the
ones designated in the ROW_KEY_COLUMN option) are not also inserted as cell
data. This option causes the system to also insert the row key fields
as cell data.
IGNORE_UNKNOWN_COLUMNSSkip input lines that refer to unknown (non-existent) column families.
SINGLE_CELL_FORMAT
The LOAD DATA INFILE command will attempt to detect the format of the input
file by parsing the first line if it begins with a '#' character. It assumes
that the remainder of the line is a tab-delimited list of column names.
However, if the file format is the 3-field single-cell format
(i.e. row,column,value) and lacks a header, and the first character of the
row key in first line happens to be the '#' character, the parser will get
confused and interpret the first cell as the format line. The
SINGLE_CELL_FORMAT option provides a way to tell the interpreter that there
is no header format line and the format of the load file is one of the single
cell formats (e.g. row,column,value or timestamp,row,column,value) and have
it determine the format by counting the tabs on the first line.
RETURN_DELETESThe RETURNDELETES option is used internally for debugging. When data is deleted from a table, the data is not actually deleted right away. A delete key will get inserted into the database and the delete will get processed and applied during subsequent scans. The RETURNDELETES option will return the delete keys in addition to the normal cell keys and values. This option can be useful when used in conjuction with the DISPLAY_TIMESTAMPS option to understand how the delete mechanism works.
If the data file name starts with the location specifier dfs:// then the file is
read from the DFS over the DfsBroker. If it begins with the specifier file:// then
it is read from the local FS (this is the default in the absence of a location
specifier).
If the name of the input file ends with a ".gz", the file is assumed to be compressed and will be streamed in through a decompressor (gzip).