A relatively small change to the way SAP represented characters four years ago is threatening to complicate upgrades to the latest edition of its software. SAP's implementation of Unicode - now widely adopted - has massively expanded the amount of data held in the databases underpinning many customers' SAP systems, increasing …
I wonder why they need four bytes? Unicode only requires two bytes. I recall that there was a pre-Unicode standard that simply mashed all the know characters sets together and this required four bytes - perhaps that's what they've done?
A terabyte disk costs a few hundred bucks. That's enough for 4000 unicode chars on everyone in the UK.
It's hard not to include that this is another instance of the IT industry making a huge meal of a very simple problem.
"Macro4 SAP product specialist Markus Fehr told The Reg..."
"... that Unicode requires four bytes of memory per character"
is he a unicode expert then? I thinks I don't believe that I'm not convinced at all.
AFAIK most unicode (western) code points can be represented with two bytes (exceptions are Han & ilk I understand) and if most of your data is pre-unicode then it's presumably ascii which can be represented as UTF8 with no change in representation at all.
And if "It's static or historical information" then it doesn't need changing at all at all.
They couldn't possibly have gone for UTF32 without considering the implications, could they?
Oracle seems to have gone for UTF8 whereas SAP has gone for UTF32.
"Macro4 SAP product specialist Markus Fehr told The Reg that Unicode requires four bytes of memory per character"
It does, if you use UTF-32, which is an odd choice, as most Unicode based systems use UCS-2 or UTF-8. Perhaps SAP needs to support Old Persian and Mahjong tiles?
who has recently been looking into the windows api, unicode is a masssive headache all around. Well, actually it's pretty straightforwad as a concept, and as long as everything's using it, but any programs you have written that make silly assumptions, like how big a character is (strangely this comes up occasionally when trying to process a string), need to be re-written with this in mind.
@rich, corporate storage bears no relation to real world storage. You have to take into account daily backups and transaction logs, possibly neeeding to be kept for years. Suddenly doubling, or quadrupling, the size of every character will have a massive financial impact.
"Macro4 SAP product specialist Markus Fehr told The Reg that Unicode requires four bytes of memory per character" ... which as a sound bite is technically correct re. UTF-32
Trouble is, SAP uses UTF-16 on the application server layer and either UTF-8, CESU-8, or UTF-16 on the database layer.
Perhaps the 'SAP product Specialist'' should read SAP's FAQs
Clarification on the impact of Unicode encoding for SAP upgrades
Unicode can involve using up to 4 bytes for some characters, depending on the encoding scheme used. For SAP systems the encoding scheme in the database varies from vendor to vendor. For example, Oracle uses CESU-8 and MS SQL Server uses UTF-16.
The key issue with upgrades to ERP 6.0 however is that all data in the database needs to be converted to Unicode - and this means additional downtime for the SAP system. Therefore a data archiving strategy is best implemented before an upgrade as it will reduce the overall volume of data and hence the costly conversion time.
The fact that the database can be larger after the upgrade because of the Unicode encoding can lead to additional data management problems. Again data archiving could help here by reducing volumes.
Aside from the Unicode issue, we are finding that SAP users generally can make their upgrade process easier by using archiving to reduce the size of their database.
SAP following the MS strategy of ever increasing hardware requirements?
Bloat the software and force a hardware upgrade. Make $en$e to me.