disadvantage of hadoop.docx

26
Disadvantage Of Hadoop http://www.quora.com/What-are-the-limitations-of-Hadoop 1. Security Concerns Just managing a complex application such as Hadoop can be challenging. A classic example can be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If whoever’s managing the platform lacks the know how to enable it, your data could be at huge risk. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps. 2. Vulnerable By Nature Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the most widely used yet controversial programming languages in existence. Java has been heavily exploited by cybercriminals and as a result, implicated in numerous security breaches. For this reason, several experts have suggested dumping it in favor of safer, more efficient alternatives. 3. Not Fit for Small Data While big data isn't exclusively made for big businesses, not all big data platforms are suited for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its high capacity design, the Hadoop Distributed File System or HDFS, lacks the ability to efficiently support the random reading of small files. As a result, it is not recommended for organizations with small quantities of data. 4. Potential Stability Issues Hadoop is an open source platform. That essentially means it is created by the contributions of the many developers who continue to work on the project. While improvements are constantly being made, like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to make sure they are running the latest stable version, or run it under a third-party vendor equipped to handle such problems. 5. General Limitations When it comes to making the most of big data, Hadoop may not be

Upload: kailash-singh

Post on 16-Dec-2015

8 views

Category:

Documents


0 download

TRANSCRIPT

Disadvantage Of Hadoop

http://www.quora.com/What-are-the-limitations-of-Hadoop 1. Security ConcernsJust managing a complex application such as Hadoop can be challenging. A classic example can be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If whoevers managing the platform lacks the know how to enable it, your data could be at huge risk. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps.2. Vulnerable By NatureSpeaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the most widely used yet controversial programming languages in existence.Java has been heavily exploited by cybercriminalsand as a result, implicated in numerous security breaches. For this reason, several experts have suggested dumping it in favor of safer, more efficient alternatives.3. Not Fit for Small DataWhile big data isn't exclusively made for big businesses, not all big data platforms are suited forsmall dataneeds. Unfortunately, Hadoop happens to be one of them. Due to its high capacity design, the Hadoop Distributed File System or HDFS, lacks the ability to efficiently support the random reading of small files. As a result, it is not recommended for organizations with small quantities of data.4. Potential Stability IssuesHadoop is an open source platform. That essentially means it is created by the contributions of the many developers who continue to work on the project. While improvements are constantly being made,like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to make sure theyare running the latest stable version, or run it under a third-party vendor equipped to handle such problems.5. General LimitationsWhen it comes to making the most of big data, Hadoop may not be the only answer.Apache Flume, Mill-wheel, and Googles own Cloud Data-flow as possible solutions. What each of these platforms have in common is the ability to improve the efficiency and reliability of data collection, aggregation, and integration.

http://www.sas.com/en_us/insights/big-data/hadoop.html

What are the challenges of using Hadoop?1. MapReduce programming is not a good match for all problems.Its good for simple information requests and problems that can be divided into independent units, but it's not efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes dont intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.2. Theres a widely acknowledged talent gap.It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That's one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings.3. Data security.Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step forward for making Hadoop environments secure.4. Full-fledged data management and governance.Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.

http://www.tutorialspoint.com/avro/avro_tutorial.pdf Disadvantages of Hadoop Serialization To serialize Hadoop data, there are two ways You can use the Writable classes, provided by Hadoops native library. You can also use Sequence Files which store the data in binary format.

The main drawback of these two mechanisms is that Writables and Sequence Files have only a Java API and they cannot be written or read in any other language. Therefore any of the files created in Hadoop with above two mechanisms cannot be read by any other third language, which makes Hadoop as a limited box. To address this drawback, Doug Cutting created Avro, which is a language independent data structure.

http://www.dezyre.com/article/big-data-is-an-illusion-there-is-no-big-data-/133

As the backbone of so many implementations, Hadoop is almost synomous with big data. Offering distributed storage, superior scalability, and ideal performance, many view it as the standard platform for high volume data infrastructures. But as anarticle on Googles big data expertisesuggests, Hadoop isnt necessarily the end all be all to big data.1. Security ConcernsJust managing a complex application such as Hadoop can be challenging. A classic example can be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If whoevers managing the platform lacks the knowhow to enable it, your data could be at huge risk. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps.2. Vulnerable By NatureSpeaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the most widely used yet controversial programming languages in existence.Java has been heavily exploited by cybercriminalsand as a result, implicated in numerous security breaches. For this reason, several experts have suggested dumping it in favor of safer, more efficient alternatives.3. Not Fit for Small DataWhile big data isnt exclusively made for big businesses, not all big data platforms are suited forsmall dataneeds. Unfortunately, Hadoop happens to be one of them. Due to its high capacity design, the Hadoop Distributed File System or HDFS, lacks the ability to efficiently support the random reading of small files. As a result, it is not recommended for organizations with small quantities of data.4. Potential Stability IssuesHadoop is an open source platform. That essentially means it is created by the contributions of the many developers who continue to work on the project. While improvements are constantly being made,like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to make sure they are running the latest stable version, or run it under a third-party vendor equipped to handle such problems.5. General LimitationsOne of the most interesting highlights of the Google article referenced earlier mentions that when it comes to making the most of big data, Hadoop may not be the only answer. The article introducesApache Flume, MillWheel, and Googles own Cloud Dataflow as possible solutions. What each of these platforms have in common is the ability to improve the efficiency and reliability of data collection, aggregation, and integration. The main point the article stresses is that companies could be missing out on big benefits by using Hadoop alone.Now that the flaws of Hadoop have been exposed, will you continue to use it for your big data initiatives, or swap it for something else?5 Big Disadvantages of Hadoop for Big DatabyBig Data Companieshttp://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-for-big-data/

10 Reasons Why Hadoop Is Not The Best Big Data Platform All The Time!

There are several reasons why Hadoop is not always the best solution for all purposes. Let's discuss ten disadvantages of Hadoop here.

Rate this news:(2 Votes)

Tuesday, November 11, 2014:Hadoop has become the backbone of several applications and Big Data cannot be even imagined without Hadoop. Hadoop offers distributed storage, scalability and huge performance. It's also considered as the standard platform for high-volume data infrastructures. But there are several reasons why Hadoop is not always the best solution for all purposes. Let's discuss ten disadvantages of Hadoop here:

1. Pig vs. Hive:

Hive UDFs are not allowed to be used in Pig. Hcatalog is required to access Hive tables in Pig. Pig UDFs cannot be used in Hive too. If any extra functionality is required in Hive, then a Pig script is always not much preferred.

2. Security concerns:

If Hadoop is used to manage a complex application, then it becomes a huge challenge. Hadoop's security model is not very recommended one and if used in complex applications, it gets disabled by default. Data is at huge risk as encryption is missing in Hadoop system at the storage and network levels. Without encryption, data can always be compromised easily.

3. Big Data cravings:

Hadoop is mostly craved when business is built on a Big data dataset. But before using Hadoop, you need to know answers to certain questions like how much terrabyte of data do you have, if you are having a steady and huge flow of data or not and how much data will be operated upon in reality.

4. Shared libraries forcefully stored in HDFS:

Hadoop keeps repeating this issue. If Pig script is stored in HDFS, then it's assumed that there will be JAR files too. This theme recurs in Oozie and other tools too. Well, storing shared libraries in HDFS is not that much a bad idea, but if it's to be done across a huge organisation, then the task is painful.

5. Vulnerable by nature:

Hadoop is always risky when it comes to security concerns. The framework of Hadoop is written in Java, the programming language known for its popularity to be the most vulnerable one among cyber criminals. It means Hadoop is quite vulnerable to data breaches automatically.

6. Oozie:

Debugging is not a funny job. If there is an error, it doesn't always mean you have done something wrong. It can also be a protocol error which arrives in case of a configuration typo or a schema validation error. These kinds of errors fail on the server. In these cases, Oozie is always not of much help, if not distributed properly.

7. Unsuitable for small data:

Big Data doesn't always mean big businesses. Big Data platforms are also not suited for small data needs always. Hadoop is one such platform which is not at all compatible with small data. It has high capacity design and its Hadoop Distributed File System or HDFS cannot read small files randomly. Hence, Hadoop is not the best solution for organisations which deal with small amount of data.

8. Stability issues:

Hadoop, being an open source platform, means it has been developed by several contributors, who are still working on the project. There are always some new improvements, like any other open source software. Hadoop has its stability issues to a huge extent. Organisations are advised to run the latest versions of Hadoop to avoid these kind of stability issues.

9. Documentation:

Documentation of Hadoop system is not very refined as there are several errors in the same. Shared examples are not always checked, which lead to mistakes. The most formidable part is the documentation for Oozie, as its examples don't even pass the schema validation.

10. Repository management:

If you have done any installation from the Hadoop repositories then you must know that the repositories don't act properly all the time, as they are mismanaged. It doesn't even check compatibility all the time while installing any new application.

http://www.efytimes.com/e1/fullnews.asp?edid=152456

http://iianalytics.com/research/evaluating-hadoop-for-enterprise-big-data-etl

7 Limitations Of Big Data In Marketing AnalyticsBig data -- the cutting edge of modern marketing or an overhyped buzzword? Columnist Kohki Yamaguchi dives in to some of the limitations of user-centered data.Kohki Yamaguchion February 12, 2015 at 10:25 am

As everyone knows, big data is all the rage in digital marketing nowadays. Marketing organizations across the globe are trying to find ways to collect and analyze user-level or touchpoint-level data in order to uncover insights about how marketing activity affects consumer purchase decisions and drives loyalty.In fact, the buzz around big data in marketing has risen to the point where one could easily get the illusion that utilizing user-level data is synonymous with modern marketing.This is far from the truth. Case in point,Gartners hype cycleas of last August placed big data for digital marketing near the apex of inflated expectations, about to descend into the trough of disillusionment.It is important for marketers and marketing analysts to understand that user-level data is not the end-all be-all of marketing: as with any type of data, it is suitable for some applications and analyses but unsuitable for others.Following is alist describingsome of the limitations of user-level data and the implications formarketing analytics.1. User Data Is Fundamentally BiasedThe user-level data that marketers have access to is only of individuals who have visited your owned digital properties or viewed your online ads, which is typically not representative of the total target consumer base.Even within the pool of trackable cookies, the accuracy of the customer journey is dubious: many consumers now operate across devices, and it is impossible to tell for any given touchpoint sequence how fragmented the path actually is. Furthermore, those that operate across multiple devices is likely to be from a different demographic compared to those who only use a single device, and so on.User-level data is far from being accurate or complete, which means that there is inherent danger in assuming that insights from user-level data applies to your consumer base at large.2. User-Level Execution Only Exists In Select ChannelsCertain marketing channels are well suited for applying user-level data: website personalization, email automation, dynamic creatives, and RTB spring to mind.In many channels however, it is difficult or impossible to apply user data directly to execution except via segment-level aggregation and whatever other targeting information is provided by the platform or publisher. Social channels, paid search, and even most programmatic display is based on segment-level or attribute-level targeting at best. For offline channels and premium display, user-level data cannot be applied to execution at all.3. User-Level Results Cannot Be Presented DirectlyMore accurately, it can be presented via a few visualizations such as aflow diagram, but these tend to be incomprehensible to all but domain experts. This means that user-level data needs to be aggregated up to a daily segment-level or property-level at the very least in order for the results to be consumable at large.4. User-Level Algorithms HaveDifficulty Answering WhyLargely speaking, there are only two ways to analyze user-level data: one is to aggregate it into a smaller data set in some way and then apply statistical or heuristic analysis; the other is to analyze the data set directly using algorithmic methods.Both can result in predictions and recommendations (e.g. move spend from campaign A to B), but algorithmic analyses tend to have difficulty answering why questions (e.g. why should we move spend) in a manner comprehensible to the average marketer. Certain types of algorithms such asneural networksare black boxes even to the data scientists who designed it. Which leads to the next limitation:5. User Data Is Not Suited For Producing LearningsThis will probably strike you as counter-intuitive. Big data = big insights = big learnings, right?Wrong! For example, lets say you apply big data to personalize your website, increasing overall conversion rates by 20%. While certainly a fantastic result, the only learning you get from the exercise is that you should indeed personalize your website. While this result certainly raises the bar on marketing, but it does nothing to raise the bar formarketers.Actionable learnings that require user-level data for instance, applying a look-alike model to discover previously untapped customer segments are relatively few and far in between, and require tons of effort to uncover. Boring, ol small data remains far more efficient at producing practical real-world learnings that you can apply to execution today.6. User-Level Data Is Subject To More NoiseIf you have analyzed regular daily time series data, you know that a single outlier can completely throw off analysis results. The situation is similar with user-level data, but worse.In analyzing touchpoint data, you will run into situations where, for example, a particular cookie received for whatever reason a hundreddisplay impressions in a row from the same website within an hour (happens much more often than you might think). Should this be treated as a hundredimpressions or just one, and how will it affect your analysis results?Even more so than smaller data, user-level data tends to be filled with so much noise and potentially misleading artifacts, that it can take forever just to clean up the data set in order to get reasonably accurate results.7. User Data Is Not Easily Accessible Or TransferableBecause of security concerns, user data cannot be made accessible to just anyone, and requires care in transferring from machine to machine, server to server.Because of scale concerns, not everyone has the technical know-how to query big data in an efficient manner, which causes database admins to limit the number of people who has access in the first place.Because of the high amount of effort required, whatever insights that are mined from big data tend to remain a one-off exercise, making it difficult for team members to conduct follow-up analyses and validation.All of these factors limit agility of analysis and ability to collaborate.So What Role Does Big Data Play?So, given all of theselimitations, is user-level data worth spending time on? Absolutely its potential to transform marketing is nothing short of incredible, both for insight generation as well as execution.But when it comes to marketing analytics, I am a big proponent of picking the lowest-hanging fruit first: prioritizing analyses with the fastest time to insight and largest potential value. Analyses of user-level data falls squarely in the high-effort and slow-delivery camp, with variable and difficult-to-predict value.Big data may have the potential to yield more insights than smaller data, but it will take much more time, consideration, and technical ability in order to extract them. Meanwhile, there should be plenty of room to gain learnings and improve campaign results using less granular data. I have yet to see such a thing as a perfectly managed account, or a perfectly executed campaign.So yes, definitely start investing in big data capabilities. Meanwhile, lets focus as much if not more in maximizing value from smaller data.Note:In this article I treated big data and user-level data synonymously for simplicitys sake, but the definition of big data can extend to less granular but more complex and varied data sets.

Some opinions expressed in this article may be those of a guest author and not necessarily Marketing Land. Staff authors are listedhere.

http://marketingland.com/7-limitations-big-data-marketing-analytics-117998

The Challenges of Real-Time Big Data AnalyticsOf course, Real-Time Big Data Analytics is not only positive as it also offers some challenges.

It requires special computer power: The standard version of Hadoop is, at the moment, not yet suitable for real-time analysis. New tools need to be bought and used. There are however quite some tools available to do the job and Hadoop will be able to process data in real-time in the future.

Using real-time insights requires a different way of working within your organisation:if your organisation normally only receives insights once a week, which is very common in a lot of organisations, receiving these insights every second will require a different approach and way of working. Insights require action and instead of acting on a weekly basis this action is now in real-time required. This will have an affect on the culture. The objective should be to make your organisation an information-centric organisation.https://datafloq.com/read/the-power-of-real-time-big-data/225 .

The ultimate limitation of big data for developmentCopyright: NASA/Pat IzzoSpeed read Big data are based on what happened in the past, which restricts forecasting ability But development policy aims to create a future that is distinct from the past Making predictions about an unprecedented future requires theory-driven models1101287Google +50Big data can only capture the past without theory, they cannot predict into a changing future, says Martin Hilbert.

Recently, much has beenwritten,talked, anddoneabout the usefulness ofbig data for development. The UN Economic and Social Council recognises that big data have the potential to produce more relevant and more timely statistics than traditional sources of official statistics, such as survey and administrative data sources, while the OECD is convinced that big data now represents a core economic asset that can create significant competitive advantage. [1,2]

At the same time, obstacles and perils have been noted mostly well-known challenges previously discussed in the context of the digital divide, including shortages in skills and infrastructure, and privacy concerns.

But there is one ultimate, theoretical limitation on whatbig datacan do and what it cannot do, and it is particularly relevant for development work. This is a limitation inherent to big data, and should make advocates alert and cautious when working with and trusting in it.Data from the past

The gist behind this limitation is known as the Lucas critique in economics, as Goodharts law in finance and as Campbells law ineducation. All date back to 1976, when US economist Robert Lucas criticised colleagues who used sophisticated statistics to make economic predictions (econometrics). He argued that no useful information can emerge from such predictions because anypolicychange will also change the econometric models. [3]

The reasoning is that all kinds of data, including econometric or big, are from the past or, at best, the real-time present. So any analysis that uses them can only tell us about what has already happened. Where the past, present and future follow the same logic, this is useful. However, if significant changes occur in the dynamic of the system being described, empirical statistics are, at best, limited.

Development work aims to create such changes. Its explicit goal is to create a future that significantly differs from the past. Given the complexity and uniqueness of each social, economic, cultural and natural system that is subject to development interventions, the result of such interventions is almost always novel and unique for each case. It is, in essence, a reality that has never been different from the past and different from other cases. So what could the past possibly tell us about the future in this case?Predicting a changing future

To predict a future that has never been, theory-driven models are necessary. These allow variables to be adjusted with values based on theory that have never existed in statistically observable reality. New variables can even be included, such as those introduced by a development intervention.

This is especially important in social systems. Complex social dynamics are notoriously stacked with non-linear properties that defy most methods of statistical extrapolation, which are linear. A developed Africa will not simply be an extrapolated version of Europes past development trajectory.

Think about big data this way. Facebook, Google and Amazon can predict your future behaviour better than any psychologist but only if your future behaviour follows the same logic as your past behaviour. No theory is required. Big data is sufficient. This is often referred to as the end of theory due to big data. [4]

But if you fall in love, or change your job, or change the country where you live, predictions from past data will be limited, if not deceiving. In that case, a psychologist or an economist who has a theory-driven model of you will still be able to make predictions, by changing the models variables according to the changed environment.

For example, if you worked as a bar tender in Brazil last year and then became a data analyst in Germany, big data on your last years behaviour will be limited in predicting your future behaviour, while a more comprehensive model of your preferences might still be able to give insights about your shifting interests from caipirinhas to Hadoop analytics.The same counts for development. For example, when Googles search habits changed, the ability of Google flu trends to predict epidemics became limited. [5] The model can, of course, be constantly adjusted (after the fact) but without some theory, one cannot predict into a changing future.Millions of variables

Thankfully, the digital revolution is not limited to producing big data; it also helps with modelling. While in the past, theory-driven models had only a few variables, todays computational power allows for thousands or even millions of variables.

Computer simulations (such as agent-based models) do not have any conceptual limitations regarding the achievable level of detail and precision. The behaviour of individuals and organisations can be adjusted to and in response to an ever-changing reality.

Beyond accuracy, the biggest advantage of computer simulation models for development is their modular flexibility. Each line of their code defines some kind of behaviour or characteristic. Adding them together and letting things interact recreates a social complexity often similar to the one we see in reality. Reusing the code allows us to create tailor-made models for concrete problems in specific, local- and context-dependent settings.

Similarly to the creation of different versions of the city-building video game SimCity, the computer simulation of a unique community in Africa can use existing software modules while evaluating a context-dependent future that is different from the past. The image of SimCity highlights an additional benefit: that multimedia visualisation can be used to engage and convince policy makers without sophisticated statistical or economic training.

This is the ultimate goal of computational social science: providing understandable and scalable solutions to embrace both the complexity of development and the uniqueness of its ever-changing paths.Social science future

But does this sound like a scary future in which the behaviour of entire societies, communities and each of its members is replicated in real-time computer simulations that are constantly adjusted with the big data incessantly collected about each one of us? Scary or not, it is certainly the future of social science. Some branches of science, industry and governments are working on it at high speed.

For example, the UN Environment Programme has teamed up with Microsoft Research for the past three years to create a computer model that simulates all ecological life on Earth. [6] And the US city of Portland has simulated the daily behaviour of its 1.6 million residents over 180,000 locations in order to optimise the roll-out of light-rail infrastructure and to simulate epidemics. [7]

We should make these powerful tools work for development policy. While we have spent significant effort focusing on big data, we are far from having models that can be used in an ever-changing reality of ongoing challenges inhealth,education, economic growth, poverty or social cohesion. Much more effort has to be put into such theory-driven models.

If not, we run the risk of falling into the same traps that Lucass colleagues did in the 1970s, some four decades before the big data revolution.

Martin Hilbert is part of the faculty of the Department of Communication at the University of California, Davis, United States. He can be contacted [email protected] viahttp://www.martinhilbert.net

http://www.scidev.net/global/data/opinion/ultimate-limitation-big-data-development.html

http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html?_r=0 Eight (No, Nine!) Problems With Big DataByGARY MARCUSandERNEST DAVISAPRIL 6, 2014Photo

CreditOpen, N.Y.Continue reading the main storyShare This Page Email Share Tweet Save MoreContinue reading the main storyContinue reading the main storyContinue reading the main story

This story is included with an NYT Opinion subscription.Learn more AdvertisementContinue reading the main storyBIG data is suddenly everywhere. Everyone seems to be collecting it, analyzing it, making money from it and celebrating (or fearing) its powers. Whether were talking about analyzing zillions of Google search queries to predict flu outbreaks, or zillions of phone records to detect signs of terrorist activity, or zillions of airline stats to find the best time to buy plane tickets, big data is on the case. By combining the power of modern computing with the plentiful data of the digital era, it promises to solve virtually any problem crime, public health, the evolution of grammar, the perils of dating just by crunching the numbers.Or so its champions allege. In the next two decades, the journalist Patrick Tucker writes in the latest big data manifesto, The Naked Future, we will be able to predict huge areas of the future with far greater accuracy than ever before in human history, including events long thought to be beyond the realm of human inference. Statistical correlations have never sounded so good.Is big data really all its cracked up to be? There is no doubt that big data is a valuable tool that has already had a critical impact in certain areas. For instance, almost every successful artificial intelligence computer program in the last 20 years, from Googles search engine to the I.B.M. Jeopardy! champion Watson, has involved the substantial crunching of large bodies of data. But precisely because of its newfound popularity and growing use, we need to be levelheaded about what big data can and cant do.The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But its hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation wont by itself tell us whether diet has anything to do with autism.Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence, and scientists working on the problem use big data as one tool among many. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis; you will always need to start with an analysis that relies on an understanding of physics and biochemistry.Third, many tools that are based on big data can be easily gamed. For example, big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out how such a program works, they start writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text. Even Googles celebrated search engine, rightly seen as a big data success story, is not immune to Google bombing and spamdexing, wily techniques for artificially elevating website search placement.AdvertisementContinue reading the main storyFourth, even when the results of a big data analysis arent intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. In 2009, Google reported to considerable fanfare that by analyzing flu-related search queries, it had been able to detect the spread of the flu as accurately and more quickly than the Centers for Disease Control and Prevention. A few years later, though, Google Flu Trends began to falter; for the last two years it has made more bad predictions than good ones.As a recentarticlein the journal Science explained, one major contributing cause of the failures of Google Flu Trends may have been that the Google search engine itself constantly changes, such that patterns in data collected at one time do not necessarily apply to data collected at another time. As the statistician Kaiser Funghas noted, collections of big data that rely on web hits often merge data that was collected in different ways and with different purposes sometimes to ill effect. It can be risky to draw conclusions from data sets of this kind.A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound. Consider translation programs like Google Translate, which draw on many pairs of parallel texts from different languages for example, the same Wikipedia entry in two different languages to discern the patterns of translation between those languages. This is a perfectly reasonable strategy, except for the fact that with some of the less common languages, many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error.A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.Continue reading the main storyRECENT COMMENTSAravindApril 8, 2014Our lives should be driven by PRINCIPLES and not DATA, which should only play a supportive role. As the author Stephen Covey noted, from...99PercentApril 8, 2014Interesting article, but one thing is wrong: it's mostly not about BD but about statistics. Old stuff. The distribution of flu is nothing...AAFApril 8, 2014A very smart set of observations, none that could be arrived at by using 'Big Data.' In mental health, 'Big Data' has worked with the... SEE ALL COMMENTSSeventh, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions. In the past few months, for instance, there have been two separate attempts to rank people in terms of their historical importance or cultural contributions, based on data drawn from Wikipedia. One is the book Whos Bigger? Where Historical Figures Really Rank, by the computer scientist Steven Skiena and the engineer Charles Ward. The other is an M.I.T. Media Lab project called Pantheon.Both efforts get many things right Jesus, Lincoln and Shakespeare were surely important people but both also make some egregious errors. Whos Bigger? claims that Francis Scott Key was the 19th most important poet in history; Pantheon has claimed that Nostradamus was the 20th most important writer in history, well ahead of Jane Austen (78th) and George Eliot (380th). Worse, both projects suggest a misleading degree of scientific precision with evaluations that are inherently vague, or even meaningless. Big data can reduce anything to a single number, but you shouldnt be fooled by the appearance of exactitude.FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like in a row). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as dumbed-down escapist fare that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate dumbed-down escapist fare into German and then back into English: out comes the incoherent scaled-flight fare. That is a long way from what Mr. Lowe intended and from big datas aspirations for translation.Wait, we almost forgot one last problem: the hype. Champions of big data promote it as a revolutionary advance. But even the examples that people give of the successes of big data, like Google Flu Trends, though useful, are small potatoes in the larger scheme of things. They are far less important than the great innovations of the 19th and 20th centuries, like antibiotics, automobiles and the airplane.Big data is here to stay, as it should be. But lets be realistic: Its an important resource for anyone analyzing data, not a silver bullet.Gary Marcusis a professor of psychology at New York University and an editor of the forthcoming book The Future of the Brain.Ernest Davisis a professor of computer science at New York University.

http://www.enterprisemanagement360.com/wp-content/files_mf/1360922634PARACCEL1.pdf