使用PostgreSQL、CUTUS和t-Digest提供速度提高45倍的百分位数

2020-09-24 07:52:18

%3CLINGO-SUB%20id%3D%22lingo-sub-1685102%22%20slang%3D%22en-US%22%3EDiary%20of%20an%20Engineer%3A%20Delivering%2045x%20faster%20percentiles%20using%20Postgres%2C%20Citus%2C%20%26amp%3B%20t-digest%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-1685102%22%20slang%3D%22en-US%22%3E%3CP%3EWhen%20working%20on%20the%20internals%20of%20%3CA%20href。%3D%22https%3A%2F%2Fgithub.com%2Fcitusdata%2Fcitus%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3ECitus%3C%2FA%3E%2C%20an%20open%20source%20extension%20that%20transforms%20Postgres%20into%20a%20distributed%20database%2C%20we%20often%20get%20to%20talk%20with%20customers%20that%20have%20interesting%20challenges%20you%20won%E2%80%99t%20find%20everywhere.%20Just%20a%。20few%20months%20back%2C%20I%20encountered%20an%20analytics%20workload%20that%20was%20a%20really%20good%20fit%20for%20Citus.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EBut%20we%20had%20one%20problem%3A%20the%20percentile%20calculations%20on%20their%20data%20(over%20300%20TB%20of%20data)%20could%20not%20meet%20their%20SLA%20of%2030%20seconds.%3C%2FP%3E。%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ETo%20make%20things%20worse%2C%20the%20query%20performance%20was%20not%20even%20close%20to%20the%20target%3A%20the%20percentile%20calculations%20were%20taking%20about%206%20minutes%20instead%20of%20the%20required%2030%20second%20SLA.%20%3CBR%20%2F%3E%3CBR%20%2F%3EFiguring%20out%20how%20to%20meet%20the%2030%20second%20Postgres%20query%。20SLA%20was%20a%20challenge%20because%20we%20didn%E2%80%99t%20have%20access%20to%20our%20customer%E2%80%99s%20data%E2%80%94and%20also%20because%20my%20customer%20didn%E2%80%99t%20have%20the%20cycles%20to%20compare%20the%20performance%20for%20different%20approaches%20I%20was%20considering.%20So%20we%20had%20to%20find%20ways%20to%20%3CEM%3Eestimate%3C%2FEM%3E%20which%20types%20of%20percentile。%20calculations%20would%20meet%20their%20SLA%2C%20without%20having%20to%20spend%20the%20engineering%20cycles%20to%20implement%20different%20approaches.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThis%20post%20explores%20how%E2%80%94with%20the%20help%20of%20the%20Postgres%20open%20source%20community%E2%80%94I%20was%20able%20to%20reduce%20the%20time%20to%20calculate%20percentiles%20by%2045x%。20by%20using%20the%20%3CA%20href%3D%22https%3A%2F%2Fgithub.com%2Ftvondra%2Ftdigest%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3Et-digest%3C%2FA%3E%20extension%20to%20Postgres.%3CBR%20%2F%3E%3CBR%20%2F%3E%3C%2FP%3E%0A%3CH2%20id%3D%22toc-hId--1265782481%22%20id%3D%22toc-hId-。-1265782479%22%3EImportance%20of%20calculating%20percentiles%20in%20analytics%20workloads%3C%2FH2%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EMy%20customer%20operates%20a%20multi%20datacenter%20web%20application%20with%20a%20real-time%20analytics%20dashboard%20that%20displays%20statistics%20about%20a%20variety%20of%20signals%E2%80%94and%20they%20store%20the%20analytics%20data%20in%20%3CA%20href%3D%22https%3A%2F。%2Ftechcommunity.microsoft.com%2Ft5%2Fazure-database-for-postgresql%2Fazure-database-for-postgresql-hyperscale-citus-now-generally%2Fba-p%2F1014865%22%20target%3D%22_blank%22%20rel%3D%22noopener%22%3EHyperscale%20(Citus)%3C%2FA%3E%20on%20our%20%3CA%20href%3D%22https%3A%2F%2Fazure.microsoft.com%2Fservices%2Fpostgresql%2F%22%20target%3D%22_blank%22%20rel%3D%22noopener%。20noopener%20noreferrer%20noopener%20noreferrer%22%3EAzure%20Database%20for%20PostgreSQL%3C%2FA%3E%20managed%20service.%20They%20ingest%20over%202%20TB%20of%20data%20per%20hour%20and%20needed%20to%20get%20%26lt%3B%2030%20second%20performance%20for%20their%20queries%20over%20a%207-day%20period%20This%20analytics%20dashboard%20is%20used%20by%20their%20engineers%20to%20debug%20and%20root%20cause%20customer-reported%20issues.%20So%20they%。20query%20metrics%20like%20latency%2C%20status%20codes%2C%20and%20error%20codes%20based%20on%20dimensions%20such%20as%20region%2C%20browser%2C%20data%20center%2C%20and%20the%20like.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ELatency%20is%20of%20course%20an%20important%20metric%20for%20understanding%20these%20types%20of%20issues.%20However%2C%20average%20latency%20can%20be%20very%。20misleading%E2%80%94which%20is%20where%20percentiles%20come%20in.%20If%201%25%20of%20your%20users%20are%20experiencing%20super%20slow%20response%20times%2C%20the%20average%20query%20response%20time%20may%20not%20change%20much%2C%20leading%20you%20to%20(incorrectly)%20think%20that%20nothing%20is%20wrong.%20However%2C%20you%20would%20see%20a%20notable%20difference%20in%20P99%2C%20allowing%20you%20to%20isolate%20issues%20much。%20faster.%3CBR%20%2F%3E%3CBR%20%2F%3EWhich%20is%20why%20metrics%20like%20P99%20are%20so%20important%20when%20monitoring%20analytics%20workloads.%20A%20P99%20query%20response%20time%20of%20500ms%20means%20that%20the%20response%20time%20for%2099%25%20of%20your%20queries%20are%20faster%20than%20500ms.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CH2%