How to solve the problem of selective MPP queries over > 10 TB tables without performing a full table scan?

rtitins re gret timiztin if we knw whih lumns were ging t filter by nd wht kind f questins re ging t be sked n tht tble.

The rblem

rtitins re gret timiztin if we knw whih lumns were ging t filter by nd wht kind f questins re ging t be sked n tht tble.

But smetimes we dnt knw wht re ging t be the mst mmn BI questins n new kind f dt. nd smetimes we knw them but the huge vriety f vlues fr lumn mkes it imssible t be rtitined by. For example, lets sy I hve 1,000,000 event-driven sensrs (they send dt every time ertin event urs rund them). Every hur I get rerds stremed t my Hive tble frm nly 1,000 f them. fter ne yer Ill hve relly big tble.

My tble is rtitined by hurly DT (dte time). Tht hels when I wnt t sk questins tht filter n time. But wht if I wnt t erfrm ertin ggregtin n the vlues I gt frm seifi sensr in the st yer?

I uldnt rtitin my tble by the sensr ID, beuse I hve 1,000,000 f thse nd thts t muh.

If Ill wnt t erfrm the ggregtin bve with Iml/rest/Drill full tble sn will ur, nd thtll be t exensive nd will rbbly fil s the whle tble (>10TB) desnt fit in the memry.

S yu get the rblem nw, nlysts need t erfrm seletive queries ver relly big tbles nd their queries uses full (r lmst full) tble sns.


The Slutin: rtitin Index

We fund simle slutin t tht rblem. We reted smething we ll rtitin index.

rtitin Index.png


Wht is rtitin Index?

T exlin the ide Ill use the sensrs exmle gin. We reted dtset tht is bsilly ditinry f whih the key is the sensr ID nd the vlue is list f ll the DTs this sensr ID er in. Tht dtset is lled rtitin index.

It lks like this:

rtitin Index code.png


f urse there is muh mre timized mdel fr suh index but fr the simliity f tht rtile well stik t the ditinry mdel.


Generting & Mintining the Index

The ress f reting this rtitin index fr the first time is retty hevy, s it hs t ress eh nd every rerd in the tble. It n be dne with reltively simle Srk jb.

fter thts dne the nly thing we need t remember is t kee udting this rtitin index s new rtitins re dded t the tble.


Wht Kind f Tbles Need rtitin Index?

rtitin Index is fr tbles with lrge number f rtitins nd diverse vlues mng them.

Its imrtnt t nte tht this is nt slutin fr ll use-ses. Fr exmle if yur dt is nt rtitined by DT, r if it is but eh rtitin ntins ll the IDs (in ur exmle the sensr IDs) tht slutin is nt ging t wrk.


Using The rtitin Index

We reted simle litin tht kees the rtitin index in-memry nd n be queried thrugh REST I. The rtitin index is lded t the memry s the servie strts fr timl erfrmne.

S nw imgine I hve 10TB tble, rtitined by DT nd I wnt t erfrm n ggregtin n the vlues f seifi sensr frm the st yer. I first query the rtitin index with the sensr ID nd get ll the relevnt DTs. Nw I erfrm the Iml query while filtering n thse DTs nd insted f full tble sn, I sn nly the relevnt rtitins ( frtin f the dt) nd get the nswer 100x fster.


Infrstruture Imlementtin

We fund the rtitin index relly useful, but we wnted ur nlysts t simly erfrm query, withut even knwing tht the rtitin index exists.

S ur imlementtin ws n litin lyer in the ld blner between the lient nd the Iml demns, tht nlyzes the query nd genertes query tht uses the rtitin index.

Bsilly the user erfrms query t the ld blner:

query t the ld blner.png


nd in the ld blner we dded de tht tkes the query nd heks if:

  • Its n tble tht hs rtitin index.
  • It filters by the relevnt lumn (i.e. sensr_id).

Then it uses the rtitin index t generte nd submit n timized query with the relevnt rtitins in the where luse:

timized query.png


Tht wy, nlysts re exeriening 100x fster erfrmne n seletive queries ver big tbles withut ny hnge in their wrkflw.

rtitin Index is mnent used in n litin lyer between the lient nd the M engine tht mkes seletive queries run muh fster by reding nly the relevnt rtitins.

This ide n be imlemented in vrius wys but its verll retty esy slutin. It requires n hnge f the dt.

Original post can be found here.

Interested in upgrading your skills? Check out our trainings.

Siddharth Garg
Software Development Engineer

Share the knowledge

Still have questions?
Connect with us
Thank you.
Your request has been received.
Thank you!
The form has been submitted successfully.