cracking the big data problem file analytics 3 2012 · pdf filemckinsey’s definition of...

Cracking the Big Data Problem with APTARE StorageConsole File Analytics


© 2012 APTARE, Inc. ALL RIGHTS RESERVED. This document is the exclusive property of APTARE, Inc. APTARE and StorageConsole are registered trademarks of APTARE, Inc. of 16

One doesn’t need to be an industry expert to know we are experiencing a tidal wave in the amount of data stored in our datacenters. What is surprising is this growth is now driven largely by unstructured data – files containing email, word processing documents, presentations, digital media, and the like. IDC estimates that unstructured data now accounts for more than 90% of our digital universe with “1.8 trillion gigabytes in 5000 quadrillion ‘files’ – and more than doubling every two years.”1 The volume of data has become so immense that IT analysts and the trade press have coined a new term for it – big data. But, what is big data? McKinsey Global Institute defines big data as “datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.”2 Remember this definition. We will return to it shortly.

With big data come big challenges. First and foremost is the sheer volume of data. Too much volume is a storage issue, but too much data is also a massive analysis issue. Gartner claims that solving the big data challenge involves more than just managing the volume of data. According to Gartner “while volume is a significant challenge in managing big data, business and IT leaders must focus on information volume, variety, and velocity.”3 Gartner expands on their data volume concern by adding two other challenges – understanding the increase in the types of data consumed, variety; determining how fast data is being produced, velocity. Measuring the volume, variety and velocity of data is important. However, the key to cracking the big data problem is determining the value of data.

Unfortunately, all data is not of equal value. An enterprise has critical transactional and non-transactional data sources essential to running the business. Other data sources in an enterprise are unimportant and may actually have potentially negative value (e.g., copyright infringements). These data sources potentially waste our IT resources.

The real question then is “how can we determine the value of data?” We have a reasonable grasp on the corporate value of structured data stored in our databases managed by database administrators, but we have difficulty in answering this question when it is applied to data stored in the millions of files occupying storage resources throughout the IT environment. Some of these files are necessary to the operation of the enterprise, while many others represent duplicate data, stale data, or projects long since completed. Compound this problem by backing up and replicating those files and we have exacerbated an already bad situation.

The difficulty is determining the value of each individual file we have stored in our datacenters. If we can get a reasonable estimate for the value for a file, this would allow us to act optimally based on its derived value to the enterprise. These actions can run the gamut from outright deletion, media archival, or storage tier optimization.

1 John Gantz and David Reinsel, “Extracting Value from Chaos”, IDC IVIEW, June 2011. 2 James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers, “Big data: The next frontier for innovation, competition, and productivity”, McKinsey Global Institute, May 2011. 3 Yvonne Genovese and Stephen Prentice, “Pattern-‐Based Strategy: Getting Value from Big Data”, Gartner Research, 17 June 2011.



The quickest way to assess the value of unstructured data is to use the file-level metadata associated with each file. Our value criteria can be based on age, type, owner, size, or a combination of these attributes. What’s required is the establishment of value criteria for unstructured data and, once criteria are established, performing value assessment across the sum of files, folders, and directories contained in the enterprise.

Given the sheer volume of files contained on a single server, performing these value assessments manually is not a reasonable approach. Multiply this problem by the number of servers in our datacenter – both real and virtual – and this task becomes nearly impossible in human terms. This may be one of the primary reasons we continue to preserve the status quo – letting the volume of unstructured data occupying our storage resources continue to grow beyond our control. What can we do to get our hands around the problem of assessing the value of the files stored in our datacenters?

McKinsey’s definition of the big data problem is correct – “datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.”4 To capture, store, manage and analyze the millions of files stored in our datacenters requires a new solution. Automated tools have tried to solve this problem, but have failed due to their architecture and design that limits their scalability and performance when faced with the magnitude of this big data problem. These solutions typically require vast amounts of time and resources to “scan all the data”, use a cumbersome relational database to hold the results, cannot provide meaningful growth statistics since they are point solutions with no aggregated concept over time, and then require individual and lengthy queries to analyze this data. The architecture and design of prevailing file profiling solutions are their Achilles’ heel when faced with solving the big data problem for unstructured data.

Required is a new way of thinking about how to solve this problem – a solution that will scale and performed regardless of the volume of data to be processed and analyzed. The answer is APTARE® StorageConsole® File Analytics – a state-of-the-art data profiling and data analysis solution tailored to cracking the big data problem in managing unstructured data.

APTARE StorageConsole File Analytics is different from other data profiling tools representing a paradigm shift in the way it acquires, processes, aggregates, analyzes and exploits file-level metadata. This solution removes the barriers of profiling and analyzing vast amounts of corporate files by using new ideas on how to handle vast amounts of file-level metadata through advanced patented technologies. Unique to the APTARE StorageConsole File Analytics solution are:

• An optimized agentless data collector tuned to specific server and storage environments • The APTARE Catalog Engine (ACE), a patented, purpose-built database engine for

maintaining volumes of file-level metadata, scalable in three dimensions – capacity, performance and index structure

• Built-in file-level knowledge aggregators that can immediately glean relevant information

4 Manyika et al, May 2011.



in a single pass on profile information stored in ACE database and persist this aggregated information in the APTARE StorageConsole portal relational database

• Standard and custom reporting analytics for point-in-time analysis and timeframe based trending statistics of the aggregated file metrics through a Web 2.0 browser interface

• A File List Export feature allowing users to extract actionable file-level detail information from the ACE database.

Figure 1. APTARE StorageConsole File Analytics Architecture

The process of collecting file-level metadata using APTARE StorageConsole File Analytics is easy. It starts with the definition of a data collector in the APTARE portal server. An administrator simply logs into the APTARE StorageConsole web portal and adds a data collector definition by giving it a name, an optional password, domain name and update characteristics. Once this initial task is complete the administrator can then proceed to install the data collector on a server within the datacenter. This server can be a physical server or a virtual machine with sufficient processing, memory and storage resources.5 Key to the deployment of this data collector is IP connectivity to the systems to be profiled and IP connectivity to the APTARE portal server.

5 For recommend data collector configurations, see “APTARE StorageConsole Data Collector Sizing Guidelines”.



Deployment of an actual data collector requires downloading it from the APTARE portal server to the data collector server (via copy or ftp) and installing it on this server. Following installation, the administrator then configures the data collector settings on the APTARE portal server. This configuration process allows the administrator to tailor the file-level metadata collection through a predefined collection scheduled and identifying the systems and the shares to be profiled along with their protocols and credentials (see Figure 2). The File Analytics data collector can be configured for:

• CIFS – enable data collection from a CIFS share • NetApp storage system – enable data collection from a NetApp storage system using

Data ONTAP API services • Host-attached storage – enable data collection from Windows systems via its Master File

Table or Linux/Unix systems via SSH.

File Analytics also supports an import feature for large datacenter environments with hundreds of shares, allowing users to setup the entire data collector configurations through bulk import from a comma-separated-values file. The entire data collector configuration and installation process typically takes less than one hour.

Figure 2. APTARE StorageConsole File Analytics Data Collector Configuration



Once installed the data collector on startup connects to the APTARE server portal where it obtains its configuration definition. If automatic update is also enabled, the data collector also determines if there are any updates for the data collector itself. If so, the data collector literally reinstalls itself without administrator intervention. This unique feature of the APTARE data collector allows updates to be pushed out to all installed environments automatically ensuring that every deployed data collector is up-to-date. Profiling of the specified systems by each data collector is achieved using multi-threading techniques to parallelize file-level metadata collection, thereby attaining high degrees of efficiency.

The first step in cracking the big data problem for unstructured data is to have data collection mechanism that is fully automated, self tuning, simple to install and configure, non-impacting to the profiled systems, and extremely fast at profiling systems measuring results in hours, not days. APTARE StorageConsole File Analytics answer this challenge with a file-level metadata collection mechanism that is highly efficient, lightweight, and tuned to maximize performance collection on each profiled system.

Through APTARE’s unique hub and spoke architecture, a single APTARE portal server can easily support datacenters located throughout an enterprise. The data collection design for the entire APTARE StorageConsole suite of products is to use agentless data collectors (spokes) employing push technology to convey system and storage information over a secure IP connection to data receivers within a single APTARE portal server (hub). In the case of APTARE StorageConsole File Analytics the file-level metadata is pushed to an APTARE portal server data receiver where this metadata is prepared for insertion into the ACE database.

The purpose-built APTARE Catalog Engine (ACE) is the heart of APTARE StorageConsole File Analytics. ACE represents a whole new way of solving the storage and retrieval of the billions of entries containing file-level metadata. Its patented design uses unique hashing and index caching algorithms to reduce the size of each entry to as few bytes as possible, while still containing all the information in each metadata entry and achieving a full index searching capability across all file attributes. The ACE database design results in a highly scalable database in terms of storage economics, exceptional performance, yet still enabling the ability to efficiently search across billions of entries.

The next and most critical step in cracking the big data problem for unstructured data is a highly proficient database engine to store vast quantities of file-level metadata. Using a general-purpose relational database is not the answer. APTARE StorageConsole File Analytics answers this challenge with a patented, purpose-built database technology – the APTARE Catalog Engine – providing the repository for storing, aggregating and searching vast quantities of file-level metadata.

One of the key features of APTARE StorageConsole File Analytics is its built-in file profiling and information classification mechanism or knowledge-based aggregators. As represented in Figure 3, these aggregators can quickly glean in-depth information on unstructured data



contained in the ACE database. In a single pass across the ACE database, these knowledge-based aggregators produce relevant summary information on specific characteristics of file-level metadata, such as:

• File addition trending • Capacity utilization by volume and share • Host server summaries • Ownership usage • Size of files • Delineation by file types • Delineation by file categories • Duplicate files • Delineation by file date categories.

Figure 3. APTARE StorageConsole File Analytics Knowledge-based Aggregators

This aggregated information is persisted in the APTARE StorageConsole relational database. Two key objectives are achieved by persisting of aggregated information in a relational database. First, this provides summary information across the entire profiled environment. Second, this provides a historical archive for each volume and share profiled permitting accurate trending forecasts as these shares and volumes grow.

The next step in cracking the big data problem for unstructured data is the ability to aggregate vast quantities of file-level metadata efficiently. Running longwinded SQL queries against a general-purpose relational database is not the answer. APTARE StorageConsole File Analytics answers this challenge through a set of knowledge-based aggregators that in a single traversal across APTARE Catalog Engine database can provide aggregated information for the



entire profiled environment in a matter of seconds and persist this summary information in the ATPARE StorageConsole relational database.

The power of APTARE StorageConsole File Analytics is realized by providing users with meaningful information on their unstructured data environments. The file-level metadata collection, cataloging, and aggregations are only the precursors to the main event – making this collected information intelligent. Through a web browser storage administrators can glean intelligence out of this aggregated data to understand how their unstructured storage shares and volumes are being utilized. Users simply view built-in summary reports on their unstructured data through an intuitive Web 2.0 user interface.

The set of summary reports delivered with APTARE StorageConsole File Analytics cover usage metrics, file attribute metrics, and date metrics as shown in Figure 4. Through these reports users can quickly get a comprehensive understanding of what is stored in their file systems based on aggregated attributes, how their unstructured data is growing, and what is contributing to this growth. One of the key features of the APTARE StorageConsole web portal is its ability to create custom dashboards from a set of reports tailored to a specific user. Dashboard creation is simply accomplished by dragging reports into a dashboard template and then saving this template. These reports can be further tailored by filtering the information displayed to the user or by setting alert notifications when certain user settable thresholds are exceeded. Another key feature of the APTARE StorageConsole web portal is the ability to email reports on a scheduled basis. This provides users with feedback on File Analytics metrics on a systematic basis. Additionally, users can export reports for further analysis using spreadsheet applications or other tools.

Figure 4. APTARE StorageConsole File Analytics Standard Reports



Figure 5. APTARE StorageConsole File Analytics File Categories Report

One of the unique capabilities of the entire APTARE StorageConsole suite of products is the ability to produce custom reporting through APTARE’s Report Template Designer. The Report Template Designer allows users to identify the scope of the report, create their own queries against the APTARE StorageConsole portal database, determine the results set for the report, and, finally, specify how this information is displayed within the produced report. The flexibility of this Report Template Designer gives users of File Analytics control over how the aggregated information persisted in the APTARE portal database is visualized to provide the greatest value tailored to an enterprise’s specific requirements.



Figure 6. APTARE StorageConsole File Analytics Access Dates Report

All these capabilities give users of APTARE StorageConsole File Analytics the information necessary to make intelligent decision regarding their unstructured data environments. Illustrated in Figures 5 and 6 are screenshots of actual results produced from profiling and classifying unstructured data using APTARE StorageConsole File Analytics. Immediately, users can gain understanding of their unstructured data and can realize the potential of APTARE StorageConsole File Analytics.

The next step in cracking the big data problem for unstructured data is the ability to make the profiled information of the environment humanly consumable. Again, running longwinded SQL queries against the entire set of file-level metadata is construct summary information is not the answer. APTARE StorageConsole File Analytics answers this challenge through built-in and custom reports conveyed through a Web 2.0 user interface intuitive to use and customizable to address each user’s specific needs.

Now that we have the ability to distill relevant information on our unstructured data environments, we need to be able to take action on this information. Producing a set of summary reports on shares and volumes is useless if users cannot make this information actionable at the individual file level. A major differentiator between APTARE StorageConsole File Analytics and other file profiling tools is its File List Export feature. With this feature File



Analytics users can extract file-level detail specific to their requirements across the entire set of profiled file-level metadata.

Figure 7. APTARE StorageConsole File Analytics Add Export Feature

To utilize the File List Export feature users simply click on the “File List Export” tab in the Tools menu pull down and are presented with a dialogue panel listing the currently defined file exports already defined. Next, one clicks on the “New” button to create a new export request, which results in the “Add Export Request” dialogue panel presented to the user as shown in Figure 7. This dialogue panel gives the user a rich choice of qualifying attributes, date ranges, file size specification, and owner identification to build the File List Export request specific to this user’s needs. Clicking on “OK” results in queuing this File List Export to be executed as a background task. This allows users to continue exploring File Analytics while their File List Export request executes in the background. Once the request completes its status goes from “Queued” to “Available”. Users can now download the results as a comma-separated-values (CSV) file where it can be made actionable by using this as input to a data mover, shell script, report, or other tools. File List Export is simple to use and gives users the ability to comb and mine their profiled data to put intelligence around the file-level metadata maintained in the ACE database.

The final step in cracking the big data problem for unstructured data is the ability to extract relevant file-level detailed information from their unstructured data environment and take action on this information. A key differentiator of the APTARE StorageConsole File Analytics solution over other file profiling solutions is its ability to mine the APTARE Catalog Engine (ACE) database based on user specified criteria through the File List Export feature and generate a result set to produce actionable results.

Let’s look at a few ways APTARE StorageConsole File Analytics has cracked real world big data problems in helping enterprises manage their unstructured data.



Problem: The legal department of a publically traded U.S. company discovered their IT organization had hundreds of video titles stored on corporate NAS volumes. Their investigation determined these video titles were “non-corporate information”6 and were a potential liability due to copyright infringements. This company’s problem was to find a way to quickly sift through millions of files stored throughout the organization, uncover non-corporate files, and remove these files, all without disturbing legitimate video files such as corporate training media.

Solution: This company employed APTARE StorageConsole File Analytics to immediately resolve their problem. In a matter of days, File Analytics had profiled all the NAS volumes in their datacenter. Once this was accomplished, it was easy to utilize File Analytics’ File List Export feature to produce a list of all video files stored on these NAS volumes. Corporate owned video files were identified and removed from this list through manually inspection (this process was easier than originally thought as the corporate owned video files were immediately identified by their directory structure and name). The result was a complete list of the offending files. This list was used as input to a script to remove these offending files from their NAS volumes. The entire process from installation of APTARE StorageConsole File Analytics to deletion of the offending files took less than five days with minimum impact of both human and physical resources. A side benefit was the recovery of nearly 2 TBs of storage.

This company now uses APTARE StorageConsole File Analytics once a month to ensure that only appropriate company information is stored on their NAS volumes.

Problem: A new CIO at a Fortune 1000 company realized his costs for acquiring and managing the unstructured data on NetApp storage systems exceeded his IT budget. He found provisioning additional capacity and moving data required constant IT attention and often resulted in application unavailability. Data migrations due to consolidations and upgrades were painful and time-consuming, often requiring business downtime. Further, backup times were growing increasingly lengthy and were exceeding their backup windows. His problem was getting a handle on this critical portion of the IT environment he inherited. As he stated, “you can’t measure what you can’t see.”

Solution: This CIO selected APTARE to perform a proof of concept on a single representative NetApp storage system using APTARE StorageConsole File Analytics. He was so impressed at the depth and breadth of information uncovered by File Analytics during this proof of concept, he acquired the product for his entire IT organization spread across five datacenters. What he then discovered was even more amazing. Nearly thirty percent of the files had not been accessed in over three years. Over fifty percent of the files had not been accessed in over one year. Unstructured data was growing exponentially in the organization, but no one was managing this growth.

The CIO and his staff immediately put an action plan in place. They identified three storage tiers with different data protection schemes. Using APTARE StorageConsole File Analytics’ File List Export feature they easily identified (1) those files not been accessed in over three

6 Quote from the company’s Request for Information.



years and archived those to tape, (2) those files not accessed in over one year and moved those onto lower-cost SATA drives with a monthly backup plan, and (3) moved all active files to Fibre Channel drives with both snapshot and nightly backups data protection plans.

The improvement in the operation of this organization NetApp storage in implementing this plan was immediate. They freed up some 40 terabytes of NetApp storage. They eliminated the constant drain on IT resources to provision new storage. They reduced application downtime due to storage migrations. Backups no longer exceed their backup windows.

Currently, this organization has a single centralized APTARE StorageConsole portal with a data collector installed in each of their datacenters. They are using APTARE StorageConsole File Analytics to understand utilization and trending information for planning purposes by re-profiling all their NetApp storage systems each weekend. This organization continues to be impressed with the low overhead, outstanding performance, and meaningful metrics obtained by using the File Analytics product.

Problem: A large international financial institution required complete management over their master data spread across the organization for regulatory control purposes. This financial institution had a complete understanding of their master data stored within structured databases. However, they had limited understanding of where master data as files were spread throughout the organization. Manual processes put in place to control unstructured master data met with little success and required inordinate amount of human resources to keep current and correct. Required was a software solution that could catalog this financial institution’s master data on a monthly basis and ensure accuracy for auditing and regulatory compliance.

Solution: This financial institution was sold on APTARE StorageConsole File Analytics the moment they recognized that File Analytics could delineate file categorizations by file type. This financial institution had imposed a cleaver mechanism to distinguish their clean master data files through by appending a unique file type suffix (e.g., “contract.docx.xyz”). This file type suffix was applied to all their regulatory filings, legal contracts, marketing communications, human resource information, and the like. This appended suffix identified those files as master data critical to the operation of the company.

By using this unique file type suffix as the qualifier on a File Analytics List File Export request, they could immediately determine where their entire universe of file master data was stored. Further, they could control any proliferation of this data to ensure there was only a single clean master copy for each file. This was accomplished by profiling their unstructured data environment weekly and then extracting master data file information by refreshing their List File Export. What used to be a resource-draining manual process with unsure outcomes now takes less than one hour with total confidence in the produced results.

Problem: A circuit design company has a problem when one of their key engineers leaves the company, typically to one of their competitors. They hold a “fire drill” to find all the design files owned by this individual to assess their intellectual property exposure. Given the mobility of engineering resources in this company’s competitive environment these “fire drills” were



happening at alarming frequency and consuming an inordinate amount of resources. Required was a way to cull through millions of files representing designs, implementations, and testing results representing the company’s intellectual property based on file ownership.

Solution: The company needed to ensure that, first, their intellectual property is protected and, second, that there is an orderly reassignment of these designs to another engineer. They tested the APTARE StorageConsole File Analytics solution on one of their NetApp storage systems. Once profiled, they saw they could immediately identify file ownership through the “Usage by Owner” standard report. This gave them the information they required at a high level. They further wanted this information broken out by owner, by file type. This was easily accommodated through the development of their own custom report using the APTARE StorageConsole Report Template Designer feature. Additionally, now when an engineer leaves the company they use the File List Export feature to assess intellectual property exposure and to reassign ownership to another engineer.

Currently, this customer has deployed APTARE StorageConsole File Analytics across all their engineering NetApp storage systems. They are using the APTARE solution for not only managing the unstructured data representing their intellectual property but looking at data protection for this property and growth requirements for their NAS environment.

These are just some of the business benefits received by our customers realized by utilizing APTARE StorageConsole File Analytics in solving the big data problem on their unstructured data. In all cases, the value of this solution immediately paid for itself in solving specific requirements with their unstructured data and they realized additional benefits as they explored their unstructured data environment using APTARE StorageConsole File Analytics.

Cracking the big data problem for unstructured data requires a new way to capture, store, manage and analyze data volumes and shares that are beyond those found in typical database software tools. An off-the-shelf relational database solution will not work when faced with the enormous volume of metadata found in today’s enterprise environments. Needed to solve this problem are new ideas that not only scale when presented with enormous workload volumes, but also provide nearly uniform performance across these workloads. In other words, we need to invent a solution that scales exponentially while maintaining acceptable performance.

This is the design premise for APTARE StorageConsole File Analytics – deliver a scalable solution with outstanding performance allowing users to effectively manage their unstructured data environments. From the file-level metadata collectors, the APTARE Catalog Engine database, the knowledge-based aggregators, the intuitive Web 2.0 user experience, to making this metadata actionable, APTARE StorageConsole File Analytics represents a paradigm shift in cracking the big data problem for unstructured data.

The chart summarizes the challenges presented by the big data problem for an enterprise’s unstructured data and how APTARE StorageConsole File Analytics meets these challenges through new and novel methods. The result is a solution that brings real benefit and value to our customers on an ongoing basis that crack their real world unstructured data problems.



Big Data Challenge APTARE StorageConsole File Analytics Solution

Data Collection Mechanism Easy to configure and install Automated, agentless Tuned to profiled environments Non-‐impacting on profiled systems Fast, efficient

File-‐level Metadata

Assimilation

Patented APTARE Catalog Engine (ACE) design Purpose-‐built for storing file-‐level metadata Scalable – handles billions of files Efficient – hashes entries to minimize storage High performance, cached indices

Data Aggregation Knowledge-‐based aggregation Single ACE database scan Fast and efficient Results maintained in APTARE portal database Aggregated information persisted over time

permitting historical analysis

Reporting Intuitive Web 2.0 design accessible via standard browser

Comprehensive views on file analytic metrics Historical trending analysis over user specified

timeframes Custom dashboards Information filtering and alerting capability Information export capability Custom reporting via APTARE’s Report

Template Designer feature

Data Mining File List Export feature Extract file-‐level data based on attributes, file

size, file dates, owner Background task to maximize efficiency Downloadable results as CSV file Repeatable

cracking the big data problem file analytics 3 2012 · pdf filemckinsey’s definition of...

Documents