Data de-duplication feature: Breaking the mould

The ever-increasing demand for data storage from regulators and business users, who want to know more about the customer for cross-selling purposes, has seen a rapid expansion in the amount of data being kept by financial institutions, but storing all of this data is expensive. Justin Quillinan asks if data de-duplication technologies are still able to help and can be part of the re-moulding of how firms handle data?

Apiece of mouldy cheese kept in a fridge takes up space and costs money to keep cool for no useful purpose - even a mouse would turn its nose up at it. The financial sector, which generates vast quantities of transactional data, email and other documents every year, is as guilty as anyone when it comes to storing things that in all probability will never be used again, only in its case it is data that's kept unnecessarily.

Storage is expensive whether data is collected on tape, disk or virtually on remote servers. But beyond that, it can easily be jumbled up and then untangling it is hard work. Sorting the wheat from the chaff is important to the bottom line of a financial institution and this can retrospectively be done with data de-duplication technologies which, as the name suggests, delete multiple copies and try to cut down on unnecessary storage on whatever medium. It is also sometimes referred to as 'intelligent compression' or 'single instance storage'. It can be also potentially be operationally beneficial in speeding up cross-selling opportunities by delivering single copies of data more quickly, which can be important with customers connected to call centres who want to get off the phone quickly. Retrieving information consistently with no confusing duplicates also impresses regulators, and usually also indicates that you are protecting your customers records adequately by not leaving personal details around that can go missing or fall into the hands of fraudsters.

The unnecessary duplication of data, sometimes via human error or because the backup procedures in a business continuity process has kept too many copies, is a common complaint. The answer could be data de-duplication because it weeds out errors and possibly entries with malevolent intent - for instance, a disgruntled employee might try and copy a proprietary piece of code but if only master copies are available then this should be easy to spot in a well ordered data function.

Fuzzy logic
One technology that falls under the 'de-duping' umbrella is known as 'fuzzy logic'. This is software that can spot erroneous or deliberate mistakes by looking at broader comparisons. Mistakes are easy to make. The name 'Jack' is currently the most popular boys' name, for example, but it can also mean John, as it did for JFK, and 'Leslie/Lesley' can mean either a man or woman if the input to a database is slightly incorrect. One of the most popular names being registered in the UK is Mohammed but this didn't even appear in the top ten published by the Office of National Statistics because it can be spelled in so many different ways, potentially causing confusion at a bank or insurer and thereby opening up the scope for multiple copies or fraud attempts by criminals.

A recent report from the Financial Services Authority (FSA) entitled Financial services firms' approach to UK sanctions recommends that the sector keeps its data more up-to-date, clean and accurate, and to adopt fuzzy-matching systems that enable duplicates to be identified and automatically merged according to pre-defined business rules or policies. Despite their often traditional approach and reliance on legacy systems, the retail banks have been forced to embrace new technology to remain competitive.

The drivers for banks and others to use the 'data de-dupe' technology are present, therefore, and some of them contacted by FST expect great strides to be made in the development of it over the coming years, especially its integration into wider data management strategies that might perhaps use more modern technology, such as virtualisation. Mastering your data is only going to become more and more central to the operation of a successful financial institution in the future. Clydesdale & Yorkshire Banks, for example, both part of the National Australia Group (NAG), tell me that they are currently working on a project to cleanse and improve their data, but declined to give details at this stage.

Barclays says that it too takes a proactive approach to mastering its data, especially its compliance to the Basel II capital adequacy requirements. Amongst other objectives, the bank particularly wanted to gain a better understanding of the inter-relationships between its 50,000 'buy-to-let' mortgage customers, their properties and their associations, to ensure more appropriate risk management, something that was sadly lacking from many banks when these mortgages were allowed to build up without stringent checks about the ability to repay them ever having been done. Without the ability to analyse such data, Barclays feels that it could not adequately manage risk within this segment now. DataFlux, a subsidiary of the software vendor, SAS, worked with Barclays on their solution. The company has also carried out research within the financial sector at large and found:

• 86 per cent of institutions view data as an extremely important strategic asset - the main driver of investment being compliance with industry regulation.
• The majority of organisations are equipped to process data against criminal lists, however a 'significant minority' report no such process in place.
• Two thirds of organisations are currently implementing or planning to implement a data governance programme.

DataFlux's managing director Colin Rickard puts a personal slant on the de-dupe issue. "I often appear on databases as 'Colin Richard' as though we were two different people," he jokes. The recent rash of mergers and acquisitions in the finance sector makes things even more complex with much data convergence going on this year - for example at Lloyds' as it absorbs HBOS. "If company A has seven different systems for seven different product types, and company B has something similar then you have a lot of knitting to unravel to create a single de-duplicated view because you can't assume you're putting apples together with apples," says Rickard.

On a more serious note he points out that duplication can help terrorists and other criminals to hide behind inaccurate data that can be spotted by fuzzy-logic techniques. Fraudulent mortgages and insurance claims are more commonplace crimes; this illegal money can go to numerous 'dodgy' causes. "You've got to know who you're dealing with because there are organisations out there that very cleverly vary parts of their data to avoid being identified as the same individual or company."

While many banks are unwilling to discuss the way they handle data others are more forthcoming about their use of de-duplication technology, including Danske Bank in Denmark, the State Bank of India and PGGM, the pension fund administrator in the Netherlands. All use Data Protector, a de-dupe and back-up system from HP. Sudhir Rao, assistant general manager at the State Bank of India, comments: "If I had to summarise the system's benefits in just one word, it would be 'ease.' Ease of management because our distributed setups have now been tied together, ease of response due to bigger backup windows and ease of use with noticeable leaps in productivity." PGGM reports that data de-dupe, which is part of the package, enables the organisation to make better use of storage media such as disk and tape.

Erik Moller, HP's director of marketing for information management, points out that the finance sector typically has a lot of remote offices and branch offices within their organisations - known as RoBo in the trade. The people in these RoBos may well know how to broker an insurance deal or open an account, but he questions whether they are necessarily IT-savvy. "Data de-dupe comes in because the amount of information now being kept is very large. There's often too much information to transfer from local copies and back-ups to a central site such as a regional head office. Data de-duplication can help because it dramatically reduces the amount of data that needs to be transferred from the RoBo."

Technology
There are three main ways of moving information around - by disk, tape, or server virtualisation, with the latter gaining ground in recent years, allied to 'mirroring' techniques for business continuity purposes. The aim is always to minimise the movement of data and according to Moller up to 70 per cent of material stored on databases can be stripped out and around 50 per cent of text documents. Images are a different matter with little duplication because of the richness of their content.

Choices surround the quantity, quality and expense of storing information - and for how long. Some documents need to be stored for decades and many are on tape as it's the older medium. As has been shown recently with various breaches of security and indeed theft in the UK, tape should be kept under lock and key in air-conditioned premises. Even then tape doesn't last forever. De-duping information helps by reducing what needs to be stored in the first place.

Options
Moller explains that there are two different types of de-duplication - dynamic and accelerated. "Dynamic de-dupe starts as soon as data goes into the device and looks for common data blocks and compares them to what is already there, deciding whether you need to keep it or not. Accelerated de-dupe models move information from back-up and then
de-duplicates it, which can be done more intelligently based upon previous records."

Mark Galpin, international product marketing manager at the vendor Quantum, points out that trading on the Stock Exchange is governed by a financial institution's ability to invoke disaster recovery and business continuity and strict standards have been drawn up. "There's nothing wrong with tape and we sell a lot of it because it's by far the cheapest and most convenient form of media [for record keeping]," he says. The problems associated with tape are the transport, handling and storage of it and there are now more modern and flexible alternatives available, such as virtualisation, which can be deployed depending upon the needs of the business. There's still a place for tape, and indeed disk, though, especially Galpin believes as the efficiency of de-duping can be impaired by how quickly server software understands the data that is being ingested. He doesn't think real-time is deliverable just yet, although great strides are being made. Nonetheless virtualisation and data mirroring are increasingly the option for huge multi-national banks but tape and disk are still important to them as is having a clean, lean database. Data de-duplication software can deliver this. Each individual firm needs to decide at what stage they deploy it during their data management storage procedures. For smaller banks and insurers, tape and disk still tend to predominate and keeping copies down to a minimum is just as important to them as it is any large organisation.

The final word goes to Frank Bunn, a director at the Storage Networking Industry Association (SNIA Europe) trade body, who says: "Without de-duping the amount of hardware and wasted management time will grow exponentially, which is disastrous when people are looking to drive down costs at the moment. When it comes to long-term archiving, after 10-15 years when the hardware is no longer supported you'll have to migrate it to newer systems anyway, and the more you have to migrate the greater the pain."

    Share Story:

Recent Stories


Safeguarding economies: DNFBPs' role in AML and CTF compliance explained
Join FStech editor Jonathan Easton, NICE Actimize's Adam McLaughlin and Graham Mackenzie of the Law Society of Scotland as they look at the role Designated Non-Financial Businesses and Professions (DNFBPs) play in the financial sector, and the challenges they face in complying with anti-money laundering and counter-terrorist financing regulations.

Ransomware and beyond: Enhancing cyber threat awareness in the financial sector
Join FStech editor Jonathan Easton and Proofpoint cybersecurity strategist Matt Cooke as they discuss the findings of the State of the Phish 2023 report, diving into key topics such as awareness of cyber threats, the sophisticated techniques being used by criminals to target the financial sector, and how financial institutions can take a proactive approach to educating both their employees and their customers.

Click here to read the 2023 State of the Phish report from Proofpoint.

Cracking down on fraud
In this webinar a panel of expert speakers explored the ways in which high-volume PSPs and FinTechs are preventing fraud while providing a seamless customer experience.

Future of Planning, Budgeting, Forecasting, and Reporting
Sage Intacct is excited to present FSN The Modern Finance Forum’s “Future of Planning, Budgeting, Forecasting, and Reporting Global Survey 2022” results. With participation from 450 companies around the globe, the survey results highlight how organisations are developing their core financial processes by 2030.