4 EVALUATIONIn this section, we evaluate the proposed IGS-TLB with gem5 simulator. Next, we present and analyze the experimental results. Finally, we present the sensitivity studies.4.1 Experimental SetupWe implement IGS-TLB based on gem5 simulator [26]. The GPU model in gem5 mainly simulates the GCN3 model proposed by AMD.Simulator. We have extended gem5 to model IGS-TLB. Specifically, we decoupled the L1 TLB from CU and added the interconnection network (crossbar) between L1 TLBs and CUs. And we incorporated the record table structure of the aggregation module into the L1 TLB. Then we integrate the address partitioning scheme and the request aggregation method into TLB hierarchy. All execution latency are counted in our evaluation.GPU System. In the GCN3 model, every four CUs share an instruction cache. In a CU, all threads share the same fetch/decode component. The other default system parameters for the experiment are shown in Table 1.Benchmarks. In our evaluation, we select a various set of workloads. Our workloads are selected from the GPU application test set including Rodinia [14], Pannotia [13], DNNmark [15], and HIPExamples [5] as test programs. Specifically, the workloads include KM, NW, DWT, Mis, Softmax, Pool, MM, and MT. A description of each workload and the input data set are given in Table 2.In order to verify the effectiveness of the IGS-TLB optimization scheme, we compare IGS-TLB with the traditional GPU address translation architecture (Baseline) and the current state-of-the-art TLB sharing optimization scheme (Valkyrie and NeiDty): Baseline: The traditional GPU address translation architecture described in Section 2.1. The L1 TLB is private to each CU and do not communicate with each other.Valkyrie [10]: Valkyrie's TLB probing method is based on the data sharing among L1 TLBs, which mainly utilize a bidirectional ring network that interconnects all L1 TLBs in each SE (16 L1 TLBs). In order to reduce the overhead, the authors also propose some specific optimization schemes, and these optimization schemes are also simulated in this paper.NeiDty [16]: NeiDty is an neighboring-TLB sharing scheme based on shared directory table. NeiDty is similar to Valkyrie and has the following advantages: 1) it limits the sharing scope of L1 TLBs and only achieves sharing of neighboring L1 TLBs. 2) it adds a directory table to store the shared information to reduce the overhead of probing.4.2 Experimental ResultsIn the following, we present our evaluation of IGS-TLB optimization method. The evaluation results include the address translation performance improvement, the hit ratio of L1 TLB, the traffic of L2 TLB and the overall power overhead of IGS-TLB.Performance results. Figure 11 shows the speedup of the average latency of an address translation of four schemes (Baseline, Valkyrie, NeiDty and IGS-TLB). We normalize the results based on the Baseline. Compared with Valkyrie and NeiDty, IGS-TLB achieves better performance on most workloads, providing an additional 19% and 22% speedup respectively.Specifically, for NW, on the one hand, NW has a very high TLB miss rate (88% for L1 TLB and 84% for L2 TLB). On the other hand, there are only 6% duplicated page table entries, distributing in adjacent L1 TLBs. These two factors lead to the failure of Valkyrie and NeiDty to increase the effective capacity of L1 TLB. Valkyrie and NeiDty even cause address translation performance degradation (16% and 21%, respectively) due to the increased access overhead among L1 TLBs. However, IGS-TLB can still effectively increase the capacity of L1 TLB when the sharing degree is low, and provide a 45% performance speedup. For MT, due to a poor locality, the speedup of Valkyrie and NeiDty are not significant (0.6% and 0.5%, respectively). For other workloads, Valkyrie, NeiDty and IGS-TLB can provide effective speedup. However, NeiDty only exploits the sharing degree between adjacent L1 TLBs, the distribution of duplication among L1 TLBs for many workloads does not only exist in adjacent L1 TLBs (see Section 2.2.2). Therefore, the performance speedup of NeiDty is not effective for some workloads (such as MM and KM). For Valkyrie, the performance speedup is higher than NeiDty due to its wider range of sharing degree. However, both NeiDty and Valkyrie reduce only a limited number of duplicate L1 TLB entries and access the shared information at a certain cost. IGSTLB directly eliminates duplication within group, thus providing the best overall performance speedup.Hit Ratio of L1 TLB. As can be seen from Figure 12, IGS-TLB improves the hit ratio of L1 TLB for all workloads. Compared with Baseline, the average hit ratio of IGS-TLB is increased by 18%. Moreover, the hit ratio result shows that Valkyrie and NeiDty can hardly improve the hit ratio of L1 TLB. Even for some workloads such as NW, MM and Softmax, the hit ratio is lower than Baseline. Although Valkyrie and NeiDty are able to take advantage of sharing characteristic among L1 TLBs to alleviate the request pressure on L2 TLBs. These two methods request the remote L1 TLB only when the local L1 TLB misses. On the one hand, this leads to an increased traffic on L1 TLBs. On the other hand, these methods can not effectively reduce the duplicate page table entries among L1 TLBs, and there is still a waste of capacity. Instead, IGS-TLB can completely eliminate duplication in L1 TLBs within a group, while avoiding network competition for communication among L1 TLBs. Therefore, the optimization of IGS-TLB can effectively improve the hit ratio of L1 TLB. Traffic of L2 TLB. We collect statistics on the traffic between L1 TLBs and L2 TLBs. The traffic results are shown in Figure 13. All traffic results are normalized based on the baseline L2 TLB. For all workloads, IGS-TLB reduces the traffic by 37% on average. For DWT and Softmax, IGS-TLB reduces L2 TLB traffic by 64% and 61%, respectively. Even for MT, IGS-TLB reduces its L2 TLB traffic by 5%. Compared to Valkyrie and NeiDty, IGS-TLB reduces more L2 TLB traffic. Specifically, IGS-TLB reduces traffic by an average of 18% more than Valkyrie and 22% more than NeiDty. Especially for NW, Valkyrie and NeiDty could not effectively improve the utilization of L2 TLB, while IGS-TLB can reduce the L2 TLB traffic by 47%. This verifies the efficient utilization of L1 TLB by IGS-TLB.Power results. To analyze the scalability of IGS-TLB, we use DSENT to estimate the overall power consumption. Specifically, we use gem5 to collect the configuration and activity of all components, then we feed the collected information into DSENT to compute the total power. The statistical power results are shown in Figure 14. The power results show that for IGS-TLB, except for the increase of 8% and 3% energy consumption in MT and Softmax, it introduces almost no additional energy consumption on other workloads. Since IGS-TLB improves the utilization of L1 TLB and L2 TLB, the power consumption of many workloads is decreased. This indicates that the hardware overhead of IGS-TLB is low enough to support its larger scale expansion.4.3 Sensitivity StudiesIn the sensitivity experiment, we adjust the L1 TLB size and the L2 TLB size to analyze the performance of IGS-TLB optimization scheme under different experimental settings. L1 TLB size. In order to explore the effect of IGS-TLB under different L1 TLB sizes, we conduct experiments on L1 TLBs with page table entry capacities of 16, 32 and 64. We mainly collect statistics on the address translation performance and the L2 TLB reduced traffic. Then we normalize the results based on the Baseline.From Figure 15, we can observe that for most workloads, the improved address translation performance of IGS-TLB is stable as the capacity increases. For Softmax and KM, the speedup decreases when L1 TLB capacity is increased to 64, which is due to the larger performance improvement of baseline as the TLB size increases.Figure 16 shows that for most workloads, as the L1 TLB capacity is promoted, the ratio of L2 TLB reduced traffic becomes higher As the capacity increases, the L1 TLB is able to handle a larger number of requests with higher hit ratio. Thus the traffic reduction is more obvious. For DWT, although the ratio of L2 TLB reduced traffic does not improve with the increase of L1 TLB capacity, the fluctuation is small. For MT, the fluctuation is relatively large due to the poor data locality, but IGS-TLB can still reduce the traffic of L2 TLB. L2 TLB size. Similarly, we also conduct sensitivity experiments for the variable capacity size of L2 TLB. The experimental results are shown in Figure 17 and Figure 18. We can obtain two observations. For most workloads, varying the capacity of the L2 TLB has little effect on the performance or the traffic between the L1 TLBs and L2 TLBs. Because IGS-TLB is mainly optimized for L1 TLB. IGSTLB can still achieve similar performance speedup under different L2 TLB capacity configurations, which proves the effectiveness of IGS-TLB. For Softmax, its overall performance is very sensitive to the capacity of L2 TLB. With the increase of L2 TLB capacity, IGS-TLB can achieve higher address translation speedup. This is mainly because there is a large number of requests hit in L2 TLB for Softmax.
gpu performance 3
声明:以上文章均为用户自行发布,仅供打字交流使用,不代表本站观点,本站不承担任何法律责任,特此声明!如果有侵犯到您的权利,请及时联系我们删除。