Never did any man yet repent of having spoken too little, whereas many have been sorry they spoke so much.
Introduction
Filter plays an important role in querying data which can be used to reduce a mountain of unwanted data, especially in this data explosive era.
There are plenty of filters provided by HBase to meet basic requirements, available filters in HBase.
Client can implement their own filter, but my personal experience is that it is not quite comprehensive to implement one, that’s why i want to write it down after summary.
Filter
Following are the core methods in Filter
.
public abstract class Filter {
// Call between rows
void reset() throws IOException;
// Filter cell based on row key
boolean filterRowKey(Cell firstRowCell) throws IOException;
// Done with filtering
boolean filterAllRemaining() throws IOException;
// ReturnCode of filter result
ReturnCode filterCell(final Cell c) throws IOException;
// If transformation needed, not usual case
Cell transformCell(final Cell v) throws IOException;
// The results from filtering, this method provides a way to modify the results before returning to client side
void filterRowCells(List<Cell> kvs) throws IOException;
// The last chance to determine whether filter the entire row
boolean hasFilterRow();
// Called after hasFilterRow()
boolean filterRow() throws IOException;
// Fast filtering by returning a target cell to seek wanted position
Cell getNextCellHint(final Cell currentCell) throws IOException;
// Called at initialization
boolean isFamilyEssential(byte[] name) throws IOException;
// ReturnCode from method filterCell()
enum ReturnCode {
// Include the Cell
INCLUDE,
// Include the Cell and seek to the next column skipping older versions.
INCLUDE_AND_NEXT_COL,
// Skip this Cell
SKIP,
// Skip this column. Go to the next column in this row.
NEXT_COL,
/**
* Seek to next row in current family. It may still pass a cell whose family is different but
* row is the same as previous cell to {@link #filterCell(Cell)} , even if we get a NEXT_ROW
* returned for previous cell.
*/
NEXT_ROW,
// Seek to next key which is given as hint by the filter.
SEEK_NEXT_USING_HINT,
// Include KeyValue and done with row, seek to next. See NEXT_ROW.
INCLUDE_AND_SEEK_NEXT_ROW,
}
}
Sequence
But it is still confusing, so following it is the call sequence.
isFamilyEssential
is called first, but it does no filtering, it just helps to lock on the specified column family which may bring performance benefits by avoiding unnecessary column family scanning.filterRowKey
andfilterCell
can be regarded as entirety, first key then value, this philosophy is straightforward for a nosql system.filterRowKey
is called to determine an entire row, whilefilterCell
is called to determine a cell which returns aReturnCode
to indicate the next step.filterAllRemaining
may be called between filter key and filter value, but it is based on concrete implementation. LikePageFilter
does filtering based on size of page, after a page is filled,filterAllRemaining
is directly called to speed up filtering.getNextCellHint
is just another kind offilterCell
, but it does further filtering based on targeting to wanted cell, and it will be called only afterfilterAllRemaining
returns false andfilterCell
returnsSEEK_NEXT_USING_HINT
.transformCell
is called to transform a cell into user wanted only afterfilterAllRemaining
returns false andReturnCode
isINCLUDE***
, but it is not often used.hasFilterRow
,filterRow
andfilterRowCells
can be regarded as an entirety. IfhasFilterRow
returns false, those two left methods will not be called.filterRowCells
is called to modify the pending return results list to client, it is not often used either. ThenfilterRow
is called, the last chance to filter out entire row. LikeSingleColumnValueFilter
though found matched column, but didn’t find matched value, and this row would be filtered out byfilterRow
method.- 1~6 should finish an entire row, and
reset
is called between rows to reset filter status and continues next row.
Thoughts
- In server side code base, methods are called scatteredly, it can’t be regard as a good looking framework.
- Considering combination of
Coprocessor
framework, it becomes more complicated, not to mention the procedure logic of scan/get.