Ok. Simple yet proves the point if anyone is interested. Pick your own
table and/or query etc... this is just proof of concept. Most of the
code is for formatting the output.
set serveroutput on size 100000
execute sample_rand(&log_flt_id,&percent)
************************************************************************
****
create or replace procedure sample_rand
(in_logical_flight_id IN NUMBER,
in_percent IN NUMBER) AS
--
************************************************************************
****-- * Script: SAMPLE_RAND.SQL
-- * Author: Joel Patterson
-- * Usage: SAMPLE_RAND(<logical_flight_id>, <percent>)
-- * Function: selects in_percent rows from total number of rows
--
************************************************************************
****
c_dep_date varchar2(24);
c_sch_arr varchar2(24);
c_sch_dep varchar2(24);
n_rows number;
n_row rowid;
cursor c_sample is
select rowid
from (
select rowid, count(*) over() total
from logical_flights
where logical_flight_id = in_Logical_flight_id
order by dbms_random.random()
)
where rownum <= n_rows;
cursor c_rows is
select * from logical_flights
where rowid = n_row;
-- cursor c_sample is
-- select *
-- from logical_flights
-- where logical_flight_id = In_logical_flight_id
-- and dbms_utility.get_hash_value(dump(rowid),0,n_rows) <=
in_percent
-- order by departure_date;
begin
--> RETRIEVE TOTAL NUMBER OF ROWS (will always return som
select ceil(count(*) * in_percent/100) into n_rows
from logical_flights
where logical_flight_id = in_Logical_flight_id;
dbms_output.put(rpad('DEPARTURE_DATE',21,' '));
dbms_output.put(rpad(lpad('DO_AVS',6,' '),7,' '));
dbms_output.put(rpad('FCC',4,' '));
dbms_output.put(rpad('FLT_STAT',9,' '));
dbms_output.put(rpad(lpad('LFID',6,' '),7,' '));
dbms_output.put(rpad('SCH_ARRIVAL',21,' '));
dbms_output.put(rpad('SCH_DEPARTURE',21,' '));
dbms_output.put(rpad('STA_STATUS',11,' '));
dbms_output.put_line(' comment');
dbms_output.put(rpad('-',20,'-')||' ');
dbms_output.put(rpad('-',6,'-')||' ');
dbms_output.put(rpad('-',3,'-')||' ');
dbms_output.put(rpad('-',8,'-')||' ');
dbms_output.put(rpad('-',6,'-')||' ');
dbms_output.put(rpad('-',20,'-')||' ');
dbms_output.put(rpad('-',20,'-')||' ');
dbms_output.put(rpad('-',10,'-')||' ');
dbms_output.put_line(rpad('-',19,'-'));
FOR cs in c_sample LOOP
n_row := cs.rowid;
FOR cr in c_rows LOOP
c_dep_date := to_char(cr.departure_date, 'yyyy-mon-dd
hh24:mi:ss');
c_sch_arr := to_char(cr.sch_arrival, 'yyyy-mon-dd hh24:mi:ss');
c_sch_dep := to_char(cr.sch_departure, 'yyyy-mon-dd hh24:mi:ss');
dbms_output.put(rpad(c_dep_date,21,' '));
dbms_output.put(rpad(lpad(to_char(cr.do_avs),6,' '),7,' '));
dbms_output.put(rpad(nvl(cr.flight_cancel_code,' '),4,' '));
dbms_output.put(rpad(cr.flight_status,9,' '));
dbms_output.put(rpad(lpad(to_char(cr.logical_flight_id),6,' '),7,'
'));
dbms_output.put(rpad(c_sch_arr,21,' '));
dbms_output.put(rpad(c_sch_dep,21,' '));
dbms_output.put_line(rpad(cr.station_status,11,' '));
END LOOP;
END LOOP;
dbms_output.put_line(chr(10));
dbms_output.put_line(n_rows||' Rows');
end; -- sample_rand.sql
/
show errors
-----Original Message-----
From: Anthony Molinaro
[mailto:oracledba-ezmlmshield-x25314062.[Email address protected]
Sent: Friday, July 29, 2005 10:30 AM
To: LazyDBA Discussion
Subject: RE: GET HASH VALUE
Joel,
SAMPLE it "tricky".
So, straight from the oracle doc:
"Row sampling reads rows without regard to their physical placement on
disk. This method provides the most random data for estimates, but it
can result in reading more data than necessary. For example, in the
worst case a row sample might select one row from each block, requiring
a full scan of the table or index."
So, the value from sample depends on how much is read during the
sampling (sorta like using "estimate statistics") different executions
may read different amounts of data so each exec returns slightly
different info.
"Block sampling reads a random sample of blocks and uses all of the rows
in those blocks for estimates. This method reduces the amount of I/O
activity for a given sample size, but it can reduce the randomness of
the sample if rows are not randomly distributed on disk."
Unless you specified how many rows per block in your table, block
Sampling aint gonna be 100% either.
So, the "problem" if you wanna call it that, is SAMPLE, not you ;)
As for hash_value, you can get creative and do all sorts of things with
these functions, but not sure hashing is gonna give you what you want.
I mean, hash functions in general exist to take strings and create Some
numeric value for them based on whatever algorithm you've chosen To do
so. Anyone who's taken a data structures and/or algorithms Class knows
there's many ways to hash, but in general It's done in such a way that
similar values are hashed further Apart. For example:
SQL> select dbms_utility.get_hash_value('joe',0,100) from dual;
DBMS_UTILITY.GET_HASH_VALUE('JOE',0,100)
----------------------------------------
18
SQL> select dbms_utility.get_hash_value('jon',0,100) from dual;
DBMS_UTILITY.GET_HASH_VALUE('JON',0,100)
----------------------------------------
42
only one char different but very different hashed values.
SQL> select dbms_utility.get_hash_value('jay',0,100) from dual;
DBMS_UTILITY.GET_HASH_VALUE('JAY',0,100)
----------------------------------------
93
SQL> select dbms_utility.get_hash_value('jan',0,100) from dual;
DBMS_UTILITY.GET_HASH_VALUE('JAN',0,100)
----------------------------------------
5
again, same thing.
Now, lemme ask you this. I selected against dual,
Which we all know has one row, but based on the
Values returned from those queries I just showed,
would you know that? How would you get 5% of the rows
from dual? You can't, Know what I mean?
So, to get a hash size of 100, you can use a base of 1 and a hash_size
of 100 (use 100 as the 3rd parameter to get_hash_value). Now, once you
do that, you're saying to the function, return Me some number between 1
and 100 based on the string I am passing you. Ok, so, it returns some
random number based on how it hashed your string And the start and end
points you've defined,
how do you turn that into N percent?
You can't really, cuz the values returned are some random number, It
could be 1, could be 99, who knows, but it has nothing to do with The
number of rows in the table.
so, if you say:
'where get_hash_value < 5'
you are asking for rows where the string hashed is less than the number
5. maybe only 1 row hashed to 5, maybe all of them, you don't know, so,
there's no direct correlation between the value returned from a hash
function and the cardinality of the set the value it is hashing from.
The cardinality of the set is what you need to do an accureate % sample.
Or at the very least, you need to read a minimum amt of rows To give you
N%. For example, if your table has 10 rows, then you Couldn't even get
5%, but if it' had 100 rows and you wanted A random 5% you need to read
a minimum of 20 rows (5% of 20 is 1, the smallest Random sample you can
return). So, really there's no correlation Between hashed strings and
pcts of a sets rows. Know what I mean?
- a
-----Original Message-----
From: Patterson Joel
[mailto:oracledba-ezmlmshield-x14639030.[Email address protected]
Sent: Friday, July 29, 2005 9:40 AM
To: LazyDBA Discussion
Subject: RE: GET HASH VALUE
Yes, once again you pull through with a superior solution. I like it
better... But whenever I stumble across something new I try and make
it work, (even if it is intended to work the way I am using it), and it
becomes rather frustrating when you don't really know the intentions
behind the feature. ie why doesn't sample return the right amount of
rows? And if not, why not? (surely this is intentional or oracle
wouldn't put it out there... would they?) Or, why can't I understand
what I'm doing with get_hash_value? (ok some of you: if it's not good
don't say it).
So since I'm going to go towards your solution (and believe me there is
more to this problem than has been revealed), all that is left on this
tangent is to gain some insight into the get_hash_value.
I realize that I was going to get the same rows each time, (which
instinctively is bad, but not sure whether was unacceptable) -- except
when I concatenate the get_time function to the first parameter, then I
can get a different number of rows depending upon when I ran it.
------
I'm starting to suspect that I'm asking to much of this function as to
my problem, but now I want to understand it a little better.
Let's start with I want a number (percent) from 0 to 100. so do I use a
hash size of 100? (not a power of 2). If I want 5% do I ask 'where
get_hash_value < 5'?
Now I have probably already answered both of these questions in that I
am not getting what I want, so this really means I'm not using the
function for it's intended purpose. This has never stopped me from
trying to bend steel before, as is the case now. I'll give up shortly
after I have exhausted any input from this discussion as your solution
is clearly easier.
Sorry for being verbose.
--------
website: http://www.LazyDBA.com
Please don't reply to RTFM questions
Oracle documentation is here: http://tahiti.oracle.com
To unsubscribe: see http://www.lazydba.com/unsubscribe.html
To subscribe: see http://www.lazydba.com
By using this list you agree to these
terms:http://www.lazydba.com/legal.html
Oracle LazyDBA home page