Sorry, the formatting got removed. So resending the question.
Hello,
I have 2,000,000 records in an RDBMS (Oracle 9i/10g or SQL Server 2000)
Let me define the question in the following points:
1. I have 2 Million records like the following
TABLE NAME: NamesTable
Columns: NameId, NameString, length,
Example Data:
Row1: "R2ZRWRTY",
"RRRARAZZZRAAZRZARWZWZRWRZWZAZRRRZWRZRRRRZRWRRRZAARAZZRWRZWZZWZRRRRARZWRZRRR
ZAWRRZZRWRZRRZAZAARWRRRRANZRRRA", 106
Now in the above structure of table assume you have 2 million records.
NameString length is never more than 200.
2. In the "NamesTable" Table Remove all the DUPLICATE "NameString".
3. In the table left after removing duplicates, REMOVE ALL THE SUB -
NameString.
For Example, If NameString1="WORLDISGREAT" and NameString2 = "GREAT",
then you can clearly see that NameString2 exists in NameString1. So REMOVE
NameString2 because it is contained in NameString1.
So remove all Sub-NameStrings from the table.
How do you think we should do this? We can use Oracle 9i/10g or Sql Server
2000. Considering that there is so much of data and there are so many
possible combinations.
What is the best possible and the fastest solution to solve this kind of
DUPLICATE AND SUB-STRING REMOVAL PROBLEM.
Please Help
Thanks and Regards
Ranjit
DB2 & UDB email list listserv db2-l LazyDBA home page