Sunday, March 7, 2021

Oracle Temp File Read Time Stats Problem

From time to time, we observe excessive discrepancies of Temp File Read Time stats between different AWR reports. Here two examples of AWR Section "IOStat by Filetype summary" about Temp File, both are with the almost identical workload. The first one shows that each Large Read takes 31.37ms, whereas the second only 303.76us, that is 100 times of difference.

IOStat by Filetype summary

    Filetype Name Reads: Data Reqs per sec Data per sec Writes: Data Reqs per sec Data per sec Small Read Large Read
    Temp File 640M 2.06 .177M 641M 0.85 .177M 309.84ms 31.37ms


    IOStat by Filetype summary

      Filetype Name Reads: Data Reqs per sec Data per sec Writes: Data Reqs per sec Data per sec Small Read Large Read
      Temp File 725M 1.71 .204M 725M 1.71 .204M 4.12ms 303.76us

      In this Blog, we will make tests to demonstrate that Temp File Megabytes Read and Read Requests stats are accurately counted, but Read Time stats are not precisely measured.

      Note: Tested in Oracle 19.7 on Solaris with usual parameters below:
         db_file_multiblock_read_count   32
         _db_file_optimizer_read_count   32 (calculate optimizer query cost at parse time)
         _db_file_exec_read_count        32 (control the multiblock reads at execution time)
         disk_asynch_io                  TRUE
         db_block_size                   8192
      
      At first, we create a GTT table of 700,000 rows:
      
      set time on timing on
      drop table test_gtt_tab purge; 
      create global temporary table test_gtt_tab on commit preserve rows as select level x, rpad('ABC', 1000, 'X') y from dual connect by level <= 7*1e5; 
      
      The GTT table has 100,096 data blocks and takes about 780 MB:
      
      select s.sid, s.serial#, tablespace, contents, segtype, segfile#, segblk#, extents, blocks, segrfno#, ts#, sql_id_tempseg
      from v$session s, v$tempseg_usage t where s.saddr=t.session_addr and sid = 189;
      
        SID    SERIAL# TABLESPACE CONTENTS  SEGTYPE    SEGFILE#  SEGBLK#  EXTENTS  BLOCKS  SEGRFNO# TS# SQL_ID_TEMPSE
      ----- ---------- ---------- --------- --------- --------- -------- -------- ------- --------- --- -------------
        189      36349 TEMP       TEMPORARY DATA           3073   477312      391  100096         1   3 0m2mbh7cxvfnr
        189      36349 TEMP       TEMPORARY TEMP_UNDO      3073   373632        1     256         1   3 0m2mbh7cxvfnr
      
      select sid, s.serial#, p.spid, round(pga_used_mem/1024/1024) used_mb, round(pga_alloc_mem/1024/1024) alloc_mb,round(pga_max_mem/1024/1024) max_mb
      from v$session s, v$process p where s.paddr=p.addr and sid = 189;
      
        SID  SERIAL# SPID     USED_MB   ALLOC_MB     MAX_MB
      ----- -------- ----- ---------- ---------- ----------
        189    36349 27944          9          9         53
      
      Now we run a query from GTT table and watch its reading stats
      
      SQL> select sysdate, large_read_megabytes, large_read_reqs, large_read_servicetime, small_read_megabytes, small_read_reqs, small_read_servicetime
             from v$iostat_file v where filetype_name = 'Temp File';
      
      SYSDATE  LARGE_READ_MEGABYTES LARGE_READ_REQS LARGE_READ_SERVICETIME SMALL_READ_MEGABYTES SMALL_READ_REQS SMALL_READ_SERVICETIME
      -------- -------------------- --------------- ---------------------- -------------------- --------------- ----------------------
      06:13:35               278097         1113328                   2916                  507           63543                    817
      
      SQL> select count(*) from test_gtt_tab t;
        COUNT(*)
      ----------
          700000
      
      Elapsed: 00:00:00.33
      
      SQL> select sysdate, large_read_megabytes, large_read_reqs, large_read_servicetime, small_read_megabytes, small_read_reqs, small_read_servicetime
             from v$iostat_file v where filetype_name = 'Temp File';
      
      SYSDATE  LARGE_READ_MEGABYTES LARGE_READ_REQS LARGE_READ_SERVICETIME SMALL_READ_MEGABYTES SMALL_READ_REQS SMALL_READ_SERVICETIME
      -------- -------------------- --------------- ---------------------- -------------------- --------------- ----------------------
      06:13:36               278878         1116453                   2916                  507           63545                    817
      
      We can see that full scan of test_gtt_tab increases LARGE_READ_REQS 3125 (1116453-1113328), and increases read LARGE_READ_MEGABYTES 781 (278878-278097), but LARGE_READ_SERVICETIME is not updated and still remains the same (2916).

      Each LARGE_READ_REQS fetches 32 blocks, 3125*32 makes about 100,000 blocks (781 MB). (There are also 2 (63545-63543) SMALL_READ_REQS. For Temp File read, we mainly focus on LARGE_READ).

      If we re-run above query and at the same truss its process:
      
      $ > truss -c -t pread,read,readv -p 27944
        syscall               seconds   calls  errors
        read                     .000       3
        pread                    .184    3127
                             --------  ------   ----
        sys totals:              .184    3130      0
        usr time:                .288
        elapsed:                9.680
      
      The summary report shows that there are 3127 pread to read test_gtt_tab from Temp File. 3127 pread system calls are corresponding 3125 LARGE_READ_REQS and 2 SMALL_READ_REQS. All 3127 pread took 0.184 seconds, usr time is 0.288 seconds.

      We can also truss each read system call, the output looks like (repeated pread lines are removed):
      
      $ > truss -ADE -t pread,read,readv -p 27944
      
      /1:     read(17, 0x7FFFBDE7658E, 8208)  (sleeping...)
      /1:     06:35:26.287987     7.510987     7.510987    read(17, "\0\001 h06\0\0\0\0\011 i".., 8208)       = 360
      /1:     06:35:26.289297     0.001353     0.000065    pread(268, "10A2\0\080 H G\094 5 ` N".., 8192, 0xE9100000) = 8192
      /6:     06:35:26.289979     7.513009     0.000409    pread(268, "06A2\0\081 H G\0DA 0 ` N".., 253952, 0xE9102000) = 253952
      /7:     06:35:26.290062     7.513088     0.000433    pread(268, "06A2\0\0A0 H G\0DA 0 ` N".., 262144, 0xE9140000) = 262144
      /4:     06:35:26.290445     7.513482     0.000117    pread(268, "06A2\0\0C0 H G\0DA 0 ` N".., 262144, 0xE9180000) = 262144
      /5:     06:35:26.290516     7.513563     0.000089    pread(268, "06A2\0\0E0 H G\0DA 0 ` N".., 262144, 0xE91C0000) = 262144
      /6:     06:35:26.290725     0.000764     0.000088    pread(268, "06A2\0\0\0 I G\0DA 0 ` N".., 262144, 0xE9200000) = 262144
      /7:     06:35:26.290862     0.000822     0.000132    pread(268, "06A2\0\0   I G\0DA 0 ` N".., 262144, 0xE9240000) = 262144
      /4:     06:35:26.291015     0.000580     0.000093    pread(268, "06A2\0\0 @ I G\0DA 0 ` N".., 262144, 0xE9280000) = 262144
      /5:     06:35:26.291158     0.000643     0.000117    pread(268, "06A2\0\0 ` I G\0DA 0 ` N".., 262144, 0xE92C0000) = 262144
      ...
      /6:     06:35:26.727446     0.000574     0.000096    pread(268, "06A2\0\0\0B5 E\0DA 0 ` N".., 262144, 0xB6A00000) = 262144
      /7:     06:35:26.727545     0.000559     0.000086    pread(268, "06A2\0\0  B5 E\0DA 0 ` N".., 8192, 0xB6A40000) = 8192
      /1:     06:35:26.728299     0.438996     0.000563    read(17, "\0\0\01506\0\0\0\0\00305".., 8208)       = 21
      /1:     06:35:26.746981     0.018691     0.018558    read(17, "\0\001F206\0\0\0\0\003 ^".., 8208)       = 498
      /1:     read(17, 0x7FFFBDE9505E, 8208)  (sleeping...)
      
        Description:
          Column 1: LWP
          Column 2: time stamp at the completion of the system call (Option -A)
          Column 3: time stamp delta betweeen two same LWP (Option -D), e.g. for /5,  06:35:26.291158 - 06:35:26.290516 => 0.000643
          Column 4: time elapsed between the beginning and end of a system call(wall clock time) (Option -E)
          Column 5: System call function with arguments and return value
      
      There are 3127 pread system calls from file descriptor 268. Two of them read 8192 bytes (1 data block SMALL_READ), which are the first and last pread. All other 3125 read 262144 bytes (32 blocks, except one read 253952 bytes or 31 blocks). There are also three read system calls from file descriptor 17. (For Oracle Physical Read with pread, read and readv system calls, see right linked book: Oracle Database Performance Tuning (Studies . Practices . Researches) Chapter 1: Data Acess - Physical Read )

      The two file descriptors in above truss output are:
      
      $ > pfiles 27944
      
        17: S_IFSOCK mode:0666 dev:16377,0 ino:30968 uid:0 gid:0 size:0
            O_RDWR FD_CLOEXEC
              sockname: AF_INET6 ::ffff:10.116.10.50  port: 1522
              peername: AF_INET6 ::ffff:10.55.101.112  port: 52955
              SOCK_STREAM
      
       268: S_IFREG mode:0640 dev:285,65562 ino:3052 uid:100 gid:101 size:3925876736 (4GB)
            O_RDWR|O_DSYNC|O_LARGEFILE FD_CLOEXEC
            /oratestdb/oradata/testdb/temp01.dbf
            offset:0
      
      If we make a SQL trace of same query:
      
      alter session set events='10046 trace name context forever, level 12' tracefile_identifier='read_time_stats_1';
      select count(*) from test_gtt_tab t;
      alter session set events='10046 trace name context off';
      
      The TKPORF output looks like:
      
      SQL ID: a4qgnmwbw6nxf Plan Hash: 2907084990
      select count(*) from test_gtt_tab t
      
      call     count       cpu    elapsed       disk      query    current        rows
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      Parse        1      0.00       0.00          0          1          0           0
      Execute      1      0.00       0.00          0          0          0           0
      Fetch        2      0.34       4.02     100001     100010          0           1
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      total        4      0.34       4.03     100001     100011          0           1
      
      Rows (1st)  Row Source Operation
      ----------  ---------------------------------------------------
               1  SORT AGGREGATE (cr=100010 pr=100001 pw=0 time=4029979 us starts=1)
          700000   TABLE ACCESS FULL TEST_GTT_TAB (cr=100010 pr=100001 pw=0 time=517483 us starts=1 cost=30 size=0 card=8168)
      
      Elapsed times include waiting on following events:
        Event waited on                             Times   Max. Wait  Total Waited
        ----------------------------------------   Waited  ----------  ------------
        db file sequential read                         1        0.07          0.07
        direct path read temp                         535        0.14          3.62
      
      And raw trace file contains 535 lines with nam='direct path read temp' (repeated lines removed):
      
      WAIT #140736367740528: nam='direct path read temp' ela= 21302 file number=3073 first dba=477313 block cnt=31 obj#=4062216 tim=2558073933872
      WAIT #140736367740528: nam='direct path read temp' ela= 21723 file number=3073 first dba=477440 block cnt=32 obj#=4062216 tim=2558073956542
      WAIT #140736367740528: nam='direct path read temp' ela= 863   file number=3073 first dba=475520 block cnt=32 obj#=4062216 tim=2558073957967
      ...
      WAIT #140736367740528: nam='direct path read temp' ela= 115   file number=3073 first dba=375744 block cnt=32 obj#=4062216 tim=2558077865590
      WAIT #140736367740528: nam='direct path read temp' ela= 127   file number=3073 first dba=375808 block cnt=32 obj#=4062216 tim=2558077865921
      
      Now we make a further test to select only 7000 rows:
      
      SQL> select sysdate, large_read_megabytes, large_read_reqs, large_read_servicetime, small_read_megabytes, small_read_reqs, small_read_servicetime
             from v$iostat_file v where filetype_name = 'Temp File';
      
      SYSDATE  LARGE_READ_MEGABYTES LARGE_READ_REQS LARGE_READ_SERVICETIME SMALL_READ_MEGABYTES SMALL_READ_REQS SMALL_READ_SERVICETIME
      -------- -------------------- --------------- ---------------------- -------------------- --------------- ----------------------
      07:30:35               284347         1138328                   9766                  507           63559                    887
      
      SQL> select count(*) from test_gtt_tab t where rownum <=7*1e3;
        COUNT(*)
      ----------
            7000
      
      SQL> select sysdate, large_read_megabytes, large_read_reqs, large_read_servicetime, small_read_megabytes, small_read_reqs, small_read_servicetime
             from v$iostat_file v where filetype_name = 'Temp File';
      
      SYSDATE  LARGE_READ_MEGABYTES LARGE_READ_REQS LARGE_READ_SERVICETIME SMALL_READ_MEGABYTES SMALL_READ_REQS SMALL_READ_SERVICETIME
      -------- -------------------- --------------- ---------------------- -------------------- --------------- ----------------------
      07:30:35               284355         1138361                   9766                  507           63560                    887
      
      There are 33 (1138361-1138328) LARGE_READ and 1 (63560-63559) SMALL_READ to read 8 (284355-284347) MB data, bot both LARGE_READ_SERVICETIME and SMALL_READ_SERVICETIME remain same before and after.

      truss also shows 34 pread.
      
      $ > truss -c -t pread,read,readv -p 27944
        syscall               seconds   calls  errors
        read                     .000       3
        pread                    .002      34
                             --------  ------   ----
        sys totals:              .002      37      0
        usr time:                .004
        elapsed:                4.770
      
      If we repeat the same query 1000 times:
      
      SQL> select sysdate, large_read_megabytes, large_read_reqs, large_read_servicetime, small_read_megabytes, small_read_reqs, small_read_servicetime
             from v$iostat_file v where filetype_name = 'Temp File';
      
      SYSDATE  LARGE_READ_MEGABYTES LARGE_READ_REQS LARGE_READ_SERVICETIME SMALL_READ_MEGABYTES SMALL_READ_REQS SMALL_READ_SERVICETIME
      -------- -------------------- --------------- ---------------------- -------------------- --------------- ----------------------
      07:39:27               383269         1534394                   9848                  601           75561                    887
      
      SQL> begin
             for i in 1..1000 loop
               for c in (select count(*) from test_gtt_tab t where rownum <=7*1e3) loop
                 null;
               end loop;
             end loop;
           end;
           /
      
      Elapsed: 00:00:04.42
      
      SQL> select sysdate, large_read_megabytes, large_read_reqs, large_read_servicetime, small_read_megabytes, small_read_reqs, small_read_servicetime
             from v$iostat_file v where filetype_name = 'Temp File';
      
      SYSDATE  LARGE_READ_MEGABYTES LARGE_READ_REQS LARGE_READ_SERVICETIME SMALL_READ_MEGABYTES SMALL_READ_REQS SMALL_READ_SERVICETIME
      -------- -------------------- --------------- ---------------------- -------------------- --------------- ----------------------
      07:39:31               391512         1567394                   9851                  609           76561                    887
      
      There are 33000 (1567394-1534394) LARGE_READ and 1000 (76561-75561) SMALL_READ. LARGE_READ_REQS increases 8243 (391512-383269) MB. They are exactly 1000 times of one single execution.

      LARGE_READ_SERVICETIME increases 3ms (9851-9848) and SMALL_READ_SERVICETIME remains same before and after.

      truss also shows 34000 pread with sys totals: 1.938 seconds and usr time: 4.093 seconds.
      
      $ > truss -c -t pread,read,readv -p 27944
        syscall               seconds   calls  errors
        read                     .000       2
        pread                   1.938   34000
                             --------  ------   ----
        sys totals:             1.938   34002      0
        usr time:               4.093
        elapsed:                8.350
      
      If we make a SQL trace of same query:
      
      alter session set events='10046 trace name context forever, level 12' tracefile_identifier='read_time_stats_2';
      begin
        for i in 1..1000 loop
          for c in (select count(*) from test_gtt_tab t where rownum <=7*1e3) loop
            null;
          end loop;
        end loop;
      end;
      /
      alter session set events='10046 trace name context off';
      
      TKPROF shows:
      
      SQL ID: 4nca0gxwjjwkx Plan Hash: 2658366304
      SELECT COUNT(*) FROM TEST_GTT_TAB T WHERE ROWNUM <=7*1E3
      
      call     count       cpu    elapsed       disk      query    current        rows
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      Parse        1      0.00       0.00          0          2          0           0
      Execute   1000      0.03       0.03          0          0          0           0
      Fetch     1000      4.19       4.40    1056000    1003000          0        1000
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      total     2001      4.23       4.44    1056000    1003002          0        1000
      
      Rows (1st) Rows (avg) Rows (max)  Row Source Operation
      ---------- ---------- ----------  ---------------------------------------------------
               1          1          1  SORT AGGREGATE (cr=1003 pr=1056 pw=0 time=4961 us starts=1)
            7000       7000       7000   COUNT STOPKEY (cr=1003 pr=1056 pw=0 time=5612 us starts=1)
            7000       7000       7000    TABLE ACCESS FULL TEST_GTT_TAB (cr=1003 pr=1056 pw=0 time=4484 us starts=1 cost=26 size=0 card=7000)
      
      Elapsed times include waiting on following events:
        Event waited on                             Times   Max. Wait  Total Waited
        ----------------------------------------   Waited  ----------  ------------
        reliable message                             1000        0.00          0.07
        db file sequential read                      1000        0.00          0.01
        direct path read temp                        2184        0.00          0.14
      
      We can also collect one AWR report by:
      
      exec sys.dbms_workload_repository.create_snapshot('ALL');
      begin
        for i in 1..1000 loop
          for c in (select count(*) from test_gtt_tab t where rownum <=7*1e3) loop
            null;
          end loop;
        end loop;
      end;
      /
      exec sys.dbms_workload_repository.create_snapshot('ALL');
      

      IOStat by Filetype summary

        Filetype Name Reads: Data Reqs per sec Data per sec Writes: Data Reqs per sec Data per sec Small Read Large Read
        Temp File 8.1G 6013.44 1.4G 0M 0.00 0M .00ns 151.52ns

        The whole elapsed time is about 6 seconds (8.1/1.4 or 34000/6013.44), with total LARGE_READ time 5 ms (34000*151.52ns). However, the system tool truss shows that 34000 pread system calls already takes 1.938 seconds, and usr time is 4.093 seconds. Oracle Layer SQL Trace shows total elapsed 4.44 seconds, Total Waited of 2184 "direct path read temp" accounts for 0.14 seconds. All those are much bigger than the total disproportionate LARGE_READ time 5 ms.

        Here is a quick query to list Temp File Read Time stats. By looking fluctuation of column ms_per_large_req, we can detect the discrepancies.
        
        select round(l_time_d/nullif(l_req_d, 0),8) ms_per_large_req, round(s_time_d/nullif(s_req_d, 0),8) ms_per_small_req, v.* 
         from (
           select h.end_interval_time ts, f.snap_id,  
                  large_read_megabytes, large_read_reqs, large_read_servicetime,
                  large_read_megabytes-lag(large_read_megabytes)     over (order by f.snap_id) l_mb_d,
                  large_read_reqs-lag(large_read_reqs)               over (order by f.snap_id) l_req_d,
                  large_read_servicetime-lag(large_read_servicetime) over (order by f.snap_id) l_time_d,
                  small_read_megabytes, small_read_reqs, small_read_servicetime,
                  small_read_megabytes-lag(small_read_megabytes)     over (order by f.snap_id) s_mb_d,
                  small_read_reqs-lag(small_read_reqs)               over (order by f.snap_id) s_req_d,
                  small_read_servicetime-lag(small_read_servicetime) over (order by f.snap_id) s_time_d
             from dba_hist_iostat_filetype f, dba_hist_snapshot h 
             where f.snap_id = h.snap_id and filetype_name = 'Temp File'
            order by f.snap_id) v
        -- where l_req_d > 0 or s_req_d > 0
        ;
        
        For above AWR report, the output about LARGE_READ shows that MS_PER_LARGE_REQ is .00015152ms, or 151.52ns.
        
        MS_PER_LARGE_REQ  TS       SNAP_ID LARGE_READ_MEGABYTES LARGE_READ_REQS LARGE_READ_SERVICETIME L_MB_D L_REQ_D L_TIME_D
        ---------------- --------- ------- -------------------- --------------- ---------------------- ------ ------- --------
                          08:28:19   68202               457449         1831394                   9924      0       0        0
               .00015152  08:28:25   68203               465691         1864394                   9929   8242   33000        5
        
        v$iostat_file (dba_hist_iostat_filetype) is underlined by the fixed view x$ksfdstfile. Above test shows that LARGE_READ_SERVICETIME is hardly updated, and not coordinated with LARGE_READ_MEGABYTES LARGE_READ_REQS. It is not clear how x$ksfdstfile collect Temp File Read Time stats and when it is updated.

        In the same DB, there is also one snapshot in which MS_PER_LARGE_REQ is irrational 5592.17391 ms (5 seconds). For such case, it is not yet clear how to reproduce it.
        
        MS_PER_LARGE_REQ  TS       SNAP_ID LARGE_READ_MEGABYTES LARGE_READ_REQS LARGE_READ_SERVICETIME  L_MB_D L_REQ_D L_TIME_D
        ---------------- --------- ------- -------------------- --------------- ---------------------- ------- ------- --------
                          12:00:23   68148                  559            2330                  26162                 
              5592.17391  12:27:19   68149                  628            2606                1569602      69     276  1543440
        
        By the way, for TEST_GTT_TAB "obj#=4062216" in above raw trace file, we can run the query below to map it to TEMP_DATA_OBJECT_ID: 4671616 in x$ktsso (underline fixed view of v$tempseg_usage):
        
        SQL> select o.object_name, o.object_id, ktssoobjd temp_data_object_id, ktssotsnum ts#, ktssofno segfile#,
                    ktssorfno segrfno#, ktssobno segblk#, ktssoexts extents, ktssoblks blocks
               from x$ktsso t, dba_objects o
              where t.ktssoobjn=o.object_id and (o.object_name = 'TEST_GTT_TAB' or ktssoobjd=4062216);
        
        OBJECT_NAME     OBJECT_ID TEMP_DATA_OBJECT_ID  TS#   SEGFILE#   SEGRFNO#    SEGBLK#  EXTENTS  BLOCKS
        -------------- ---------- ------------------- ---- ---------- ---------- ---------- -------- -------
        TEST_GTT_TAB      4062216             4671616    3       3073          1     477312      391  100096
        
        
        SQL> select objd, file#, block#, class#, status, dirty, temp, ts# from v$bh where objd=4671616;
        
            OBJD  FILE#  BLOCK#  CLASS# STATUS     D T  TS#
        -------- ------ ------- ------- ---------- - - ----
         4671616      1  477312       4 xcur       N Y    3
        
        The temp_data_object_id 4671616 is also referred in v$bh as OBJD (CLASS# 4 for 'segment header').