SELECT

EBNF

SELECT [CELLS] ('*' | (column_predicate [',' column_predicate]*))
  FROM table_name
  [where_clause]
  [options_spec]

where_clause:
    WHERE where_predicate [AND where_predicate ...]

where_predicate:
  cell_predicate
  | row_predicate
  | timestamp_predicate

relop: '=' | '<' | '<=' | '>' | '>=' | '=^'

column_predicate
  = column_family
  | column_family ':' column_qualifer
  | column_family ':' '/' column_qualifier_regexp '/'

cell_spec: row ',' column

cell_predicate:
  [cell_spec relop] CELL relop cell_spec
  | '(' [cell_spec relop] CELL relop cell_spec
        (OR [cell_spec relop] CELL relop cell_spec)* ')'

row_predicate:
  [row_key relop] ROW relop row_key
  | '(' [row_key relop] ROW relop row_key
          (OR [row_key relop] ROW relop row_key)* ')'
  | ROW REGEXP '"'row_regexp'"'

timestamp_predicate:
  [timestamp relop] TIMESTAMP relop timestamp

options_spec:
  (REVS revision_count
  | OFFSET row_offset
  | LIMIT row_count
  | CELL_OFFSET cell_offset
  | CELL_LIMIT max_cells
  | CELL_LIMIT_PER_FAMILY max_cells_per_cf
  | OFFSET row_offset
  | CELL_OFFSET cell_offset
  | INTO FILE [file_location]filename[.gz]
  | DISPLAY_TIMESTAMPS
  | KEYS_ONLY
  | NO_ESCAPE
  | RETURN_DELETES
  | SCAN_AND_FILTER_ROWS)*

timestamp:
  'YYYY-MM-DD HH:MM:SS[.nanoseconds]'

file_location:
  "dfs://" | "file://"

Description

SELECT is used to retrieve cells from a table. The retrieved cells are filtered with predicates for row keys, timestamps or cell values. The parser only accepts a single timestamp predicate. The '=^' operator is the "starts with" operator. It will return all rows that have the same prefix as the operand. Use of the value_predicate without the "CELLS" modifier to the "SELECT" command is deprecated.

If your query selects several independent ranges by specifying multiple row predicates (i.e. WHERE ROW < 'a' OR ROW > 'c') then the LIMIT, CELLLIMIT, OFFSET, CELLOFFSET predicates are applied to each range independently.

Options

REVS revision_count

Each cell in a Hypertable table can have multiple timestamped revisions. By default all revisions of a cell are returned by the SELECT statement. The REVS option allows control over the number of cell revisions returned. The cell revisions are stored in reverse-chronological order, so REVS=1 will return the most recent version of the cell.

OFFSET row_offset

Skips the first row_offset rows returned by the SELECT statement. This option cannot be combined with CELL_OFFSET and currently applies independently to each row (or cell) interval supplied in the WHERE clause.

LIMIT row_count

Limits the number of rows returned by the SELECT statement to row_count. The limit applies independently to each row (or cell) interval specified in the WHERE clause.

CELL_OFFSET cell_offset

Skips the first cell_offset cells returned by the SELECT statement. This option cannot be combined with OFFSET and currently applies independently to each row (or cell) interval supplied in the WHERE clause.

CELL_LIMIT max_cells

Limits the total number of cells returned by the query to max_cells (applied after CELL_LIMIT_PER_FAMILY). The limit applies independently to each row (or cell) interval specified in the WHERE clause.

CELL_LIMIT_PER_FAMILY max_cells_per_cf

Limits the number of cells returned per row per column family by the SELECT statement to max_cells_per_cf.

OFFSET row_offset

Skips the first row_offset rows returned by the SELECT statement. Not allowed in combination with CELL_OFFSET.

CELL_OFFSET cell_offset

Skips the first cell_offset cells returned by the SELECT statement. Not allowed in combination with OFFSET.

INTO FILE [file://|dfs://]filename[.gz]

The result of a SELECT command is displayed to standard output by default. The INTO FILE option allows the output to get redirected to a file.
If the file name starts with the location specifier dfs:// then the output file is assumed to reside in the DFS. If it starts with file:// then output is sent to a local file. This is also the default location in the absence of any location specifier. If the file name specified ends in a .gz extension, then the output is compressed with gzip before it is written to the file. The first line of the output, when using the INTO FILE option, is a header line, which will take one of the two following formats. The second format will be output if the DISPLAY_TIMESTAMPS option is supplied.

 #row '\t' column '\t' value

 #timestamp '\t' row '\t' column '\t' value

DISPLAY_TIMESTAMPS

The SELECT command displays one cell per line of output. Each line contains three tab delimited fields, row, column, and value. The DISPLAY_TIMESTAMPS option causes the cell timestamp to be included in the output as well. When this option is used, each output line will contain four tab delimited fields in the following order:

 timestamp, row, column, value

KEYS_ONLY

The KEYS_ONLY option suppresses the output of the value. It is somewhat efficient because the option is processed by the RangeServers and not by the client. The value data is not transferred back to the client, only the key data.

NO_ESCAPE

The output format of a SELECT command comprises tab delimited lines, one cell per line, which is suitable for input to the LOAD DATA INFILE command. However, if the value portion of the cell contains either newline or tab characters, then it will confuse the LOAD DATA INFILE input parser. To prevent this from happening, newline, tab, and backslash characters are converted into two character escape sequences, described in the following table.

 Character   Escape Sequence 
 backslash \
 '\' '\' 
 newline \n 
 '\' 'n' 
 tab \t
 '\' 't' 
 NUL \0
 '\' '0' 

The NO_ESCAPE option turns off this escaping mechanism.

RETURN_DELETES

The RETURN_DELETES option is used internally for debugging. When data is deleted from a table, the data is not actually deleted right away. A delete key will get inserted into the database and the delete will get processed and applied during subsequent scans. The RETURN_DELETES option will return the delete keys in addition to the normal cell keys and values. This option can be useful when used in conjuction with the DISPLAY_TIMESTAMPS option to understand how the delete mechanism works.

SCAN_AND_FILTER_ROWS

The SCAN_AND_FILTER_ROWS option can be used to improve query performance for queries that select a very large number of individual rows. The default algorithm for fetching a set of rows is to fetch each row individually, which involves a network roundtrip to a range server for each row. Supplying the SCAN_AND_FILTER_ROWS option tells the system to scan over the data and filter the requested rows at the range server, which will reduce the number of network roundtrips required when the number of rows requested is very large.

Examples

SELECT * FROM test WHERE ('a' <= ROW <= 'e') and
                         '2008-07-28 00:00:02' < TIMESTAMP < '2008-07-28 00:00:07';
SELECT * FROM test WHERE ROW =^ 'b';
SELECT * FROM test WHERE (ROW = 'a' or ROW = 'c' or ROW = 'g');
SELECT * FROM test WHERE ('a' < ROW <= 'c' or ROW = 'g' or ROW = 'c');
SELECT * FROM test WHERE (ROW < 'c' or ROW > 'd');
SELECT * FROM test WHERE (ROW < 'b' or ROW =^ 'b');
SELECT * FROM test WHERE "farm","tag:abaca" < CELL <= "had","tag:abacinate";
SELECT * FROM test WHERE "farm","tag:abaca" <= CELL <= "had","tag:abacinate";
SELECT * FROM test WHERE CELL = "foo","tag:adactylism";
SELECT * FROM test WHERE CELL =^ "foo","tag:ac";
SELECT * FROM test WHERE CELL =^ "foo","tag:a";
SELECT * FROM test WHERE CELL > "old","tag:abacate";
SELECT * FROM test WHERE CELL >= "old","tag:abacate";
SELECT * FROM test WHERE "old","tag:foo" < CELL >= "old","tag:abacate";
SELECT * FROM test WHERE ( CELL = "maui","tag:abaisance" OR
                           CELL = "foo","tag:adage" OR
                           CELL = "cow","tag:Ab" OR
                           CELL =^ "foo","tag:acya");
SELECT * FROM test INTO FILE "dfs:///tmp/foo";
SELECT col2:"bird" from RegexpTest WHERE ROW REGEXP "http://.*"; 
SELECT col1:/^w[^a-zA-Z]*$/ from RegexpTest WHERE ROW REGEXP "m.*\s\S";
SELECT CELLS col1:/^w[^a-zA-Z]*$/ from RegexpTest WHERE VALUE REGEXP \"l.*e\";
SELECT CELLS col1:/^w[^a-zA-Z]*$/ from RegexpTest WHERE ROW REGEXP \"^\\D+\" 
    AND VALUE REGEXP \"l.*e\";",