The TPC-D*enchmark is a decision support benchmark originally published in 2006. Since then, no one has published a fully audited result of the TPC-D*enchmark. Until now. In March of 2018, Cisco published the first audited result of the benchmark. The benchmark was run using Cisco UCS Integrated Infrastructure for Big Data and *ytics with Transwarp Data Hub v5.1. The benchmark result is available here http://www.tpc.org/tpcds/results/tpcds_advanced_sort.asp.
A key question presents itself: why did it takes so long; what is it about the TPC-D*enchmark that makes it so challenging?
TPC-DS Test ProcessTPC-D*enchmark workloads include 99 queries; but the benchmark test is not just about those queries. The test process and metrics include the following six steps (refer TPC-DS Specification for more detail):
Data Load test (LD).
Power test (PT), which runs the 99 queries in one stream.
Throughput test 1 (TT1), which runs the 99 queries in multiple streams.
Maintenance test 1 (DM1), which updates the databases with the refresh functions.
Throughput test 2 (TT2), which reruns the 99 queries in multiple streams.
Maintenance test (DM2), which re-updates the databases with the refresh functions.
The 99 queries are defined as templates. Each query has several parameters which are determined by dsqgen with SEED, specifically “is selected as the timestamp of the end of the database load time (Load End Time) expressed in the format mmddhhmmsss”. This timestamp “guarantees that the query substitution parameter values are not known prior to running”. In addition to the parameters, the query orders of different streams are different. Figure 1 shows the execution order.
The TPC-DS performance result depends on the time of each step (time to generate SQL is not part of the result). To get better results, the system should have better IO throughput to load the data, better SQL performance to run the power and throughput tests and better transaction throughput for the maintenance tests. Not surprisingly, there are trade-off*etween the different test steps to achieve an optimal result.
Key Factors for DBMSAccording to the TPC-DS test process, SQL syntax and transaction support, optimization, stability and scalability are identified as the key factors for the DBMS.
SQL Syntax SupportTPC-DS Power and Throughput tests involve 99 queries. The queries include a lot of advanced features, such as correlated and unrelated sub-queries, common sub-expressions, having with sub-queries, exists, intersect and so on. Here is a summary of SQL syntax support for typical Hadoop-based systems and traditional vendors.
DBMS ANSI SQL Stored Procedures TPC-DS 99 query templates Hive SQL’92 + Some extension in SQL’99 and 2003 No No TPC approved templates in public Impala SQL’92 + Some extension in SQL’99 and 2003 No No TPC approved templates in public SparkSQL SQL’92 + Some extension in SQL’99 and 2003 No No TPC approved templates in public Transwarp Inceptor SQL 2003 compatible Yes 99 query templates with minimal changes (Approved by TPC) Traditional DBMS like Oracle, DB2 SQL 2003 Yes TPC provides standard templates for oracle/db2/sqlserver etc.According to the TPC-DS specification, only minimal changes are allowed for the test.
Transaction SupportData maintenance tests are performed as a must-have part of the benchmark execution. The tests simulate the periodic synchronization with source OLTP databases, an integral part of the data warehouse lifecycle. Maintenance tests include a set of refresh runs. Refresh functions are defined in Section 5 of the TPC-DS Specification. Here i*ethod 2 as an example:
Delete rows from *_returns with corresponding rows in *_sales
where d_date between Date1 and Date2
Delete rows from *_sales
where d_date between Date1 and Date2
According to the specification, all the transformation*ust be SQL-based. This transaction is required for the DBMS, and represents a big challenge in a distributed environment. Here is a summary of transaction support.
Hive Impala SparkSQL Transwarp Inceptor Traditional DBMS Transaction Support Yes, with some limitation No No Yes Yes OptimizationThere are 99 queries in TPC-DS Power and Throughput tests. The workloads are classified into four categories: ad-hoc, reporting, iterative OLAP and data mining. Most queries are characterized by high CPU and IO load. So, the DBM*ust be *art enough to handle all the cases.
Most DBMS support some general optimizations like Predicate Pushdown (PPD), Cost Based Optimization (CBO), Rule Based Optimization (RBO), Partition Prune (PP), Vectorization (VEC) and Code Generator (CG). The system depends on the optimizer to acheive the best result.
Materialized View (MV) and CUBE are well-defined solutions in traditional DBMS to accelerate query performance. These are widely used in the TPC-H benchmark test; but it is hard for TPC-D*enchmark, because:
The TPC-DS specification restricts the use of materialization to the catalog sales channel.
The time to prepare and refresh materialization should be part of the performance result.
There are many combinations; and SQL parameters are different for different streams.
DBMS CBO/RBO/PPD/PP/CG/VEC MV/CUBE Hive Yes No Impala Yes No SparkSQL Yes No Transwarp Inceptor Yes Yes Traditional DBMS like Oracle, DB2 Yes YesFor traditional systems, due to architectural limitations, it leads to a single node bottleneck when there are lots of data exchanges. To decrease the impact, some vendors introduce special devices to handle the data exchanges, which makes the system expensive in dollars.
StabilityAccording to TPC-DS execute rules: “If there is any test fail in any step, the benchmark run is invalid;” and “the System Under Test (SUT), including any database server(s), shall not be restarted at any time after the power test begins until after all tests have completed.” The total process lasts a long time. Stability is another key factor particularly for systems that run 24x7.
Lots of new systems rely on aggressive memory computation and different kinds of caches to improve performance. Stability is a challenge for these systems when running the TPC-DS workloads without a break.
To improve stability, Transwarp Inceptor does careful tuning for memory usage, e.g. data will be spilled to disks when the total size exceeds a threshold. For complex “with as” queries, it will create temporary tables to save the result other than in-lining everything. This is the kind of trade-off between performance and stability that must be made.
ScalabilityHadoop is a distributed computing framework. It is designed to be easy to scale as the data volume grows. However, for traditional systems, the data has to be carefully deployed among the nodes; the scalability is poor and hard to extend to handle continuously increasing data.
OverallDue to SQL or transaction support limitations, most Hadoop-based vendors cannot run the full TPC-D*enchmark. Due to architectural and optimization restrictions, traditional DBMS vendors find it difficult to pass the test or cannot achieve good results, especially in a big-data environment.
The first ever audited result of the TPC-D*enchmark was run on Cisco UCS Integrated Infrastructure for Big Data and *ytics, and Transwarp Data Hub v5.1. It achieves, at a 10,000 GB scale factor, a composite query per hour of 1,580,649 QphDS and a price/performance of $0.64 USD / QphDS.
More information about the TPC-DS is available at http://www.tpc.org/tpcds/default.asp
More Information about the Transwarp Data Hub can be found at: http://www.transwarp.io/?lang=en
For more information about Cisco UCS Integrated Infrastructure for Big Data and *ytics, visit: http://www.cisco.com/go/bigdata