Or big change in our world?
I think that the answer can be all of the above. “Hype” you might be thinking? Well, here is the deal. Our world has changed in unimaginable ways. The amount of information created daily is reaching levels that just a few years ago would’ve been considered science fiction or even plain old crazy.
Lots and Lots of Data
To make it even more interesting, a lot of it is unstructured data. Which can be kind of a problem if we think about it, because the success of relational databases has taught a lot of us to think in a columnar and relational way.
And this is not bad… at all. It is nice to have all your data and metadata organized neatly. You can use select, join, where, group by and more to get what you need.
But the success of relational databases can also create a blind spot for many. Just a few days ago I was talking with the VP and cofounder of a company related to migrations and artificial intelligence software whose company has faced success (as well as a few failures or learning experiences) in several world class projects. They had lots of data that they obtain from their automated code conversion tools and what are they doing? They are normalizing it into a database.
I don’t think it is a bad approach, however it is not the one that I would take. Long story short, I would store the logs as is in their raw format and then use any of the available projects to analyse it in multiple ways, looking for key points, failures, trends and more. But what you do with the data is the topic of another post or a Pluralsight training. Let’s go back to our main point.
Mountains of data is being generated daily and the amount will just continue to -grow- explode.
Unstructured Data “Just Happens”
If you had to structure all your data, do you imagine what the cost would be? Just go ask your manager for an Oracle system and some servers to process all of your web server logs to put them in tables. The cost would be exorbitant.
And beside cost, sometimes you may not know the structure of your data. And that is one of the beautiful parts of Big Data. You can just store your logs in raw format and later come back and do your work, modelling your data in different ways. And what if you have too much data and the process is taking longer than expected?
Well, just add a few more servers and get the job done in parallel. Hadoop runs in commodity hardware, thus you can get many relatively inexpensive machines to work together and process your data according to your needs.
The Cloud and the Bar
And even better, remember “the cloud”! A few years ago if you were a startup and needed beefy power, you would need a lot of upfront cash to cover expenses. Now with AWS and Azure we have the possibility of turning a few virtual machines, get a cluster up and running, crunch the data, get the result, turn them off and only pay for the time you use.
And this change has lowered the entry bar for innovation. Now many brilliant ideas can be tested or theories can be analysed at a much lower cost, benefiting all man kind. For example, it is possible to run analysis on medical treatments to help cure cancer or many other diseases. Sometimes answers to hard questions lie right there in the data, they just need to be discovered.
Hype or Go Figure This Hadoop Thing Out
But what about hype? Let me make this clear, I don’t think Big Data is hype. I do think that there is a lot of hype around it and even though we are able to do great things with Big Data, the greater public does not yet fully understand what can be done and how so I have taken a personal mission to help developers and the public in general understand Big Data (and Search)
So then it is time to ask ourselves this question:
What Are My Choices for Getting Started with Big Data?