What is Hadoop?

Hadoop is getting a lot of attention now in  the database and content management circles, but many people in the industry still don’t really know what it is and or how it can be best applied.

The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated intoNutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.

Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.

Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together.

It’s fair to say that a current Hadoop adopter must be more sophisticated than a relational database adopter. There are not that many “shrink wrapped” applications today that you can get right out of the box and run on your Hadoop processor. It’s similar to the early ’80s when Ingres and IBM were selling their database engines and people often had to write applications locally to operate on the data.

That said, you can develop applications in a lot of different languages that run on the Hadoop framework. The developer tools and interfaces are pretty simple. Some of our partners — Informatica is a good example — have ported their tools so that they’re able to talk to data stored in a Hadoop cluster using Hadoop APIs. There are specialist vendors that are up and coming, and there are also a couple of general process query tools: a version of SQL that lets you interact with data stored on a Hadoop cluster, and Pig, a language developed by Yahoo that allows for data flow and data transformation operations on a Hadoop cluster.

Hadoop’s deployment is a bit tricky at this stage, but the vendors are moving quickly to create applications that solve these problems. I expect to see more of the shrink-wrapped apps appearing over the next couple of years.

Still using Stored Procedures…

I’m sure we’ve all heard, over and over, that inline SQL is generally a bad practice, and that we should use Stored Procedures when possible. But let’s be realistic for a minute. Who wants to write a stupid stored procedure for every stupid little simple query needed.

Have you ever worked on a system where someone decreed* that all database calls must be Stored Procedures, and SQL is strictly verboten? I have, and this decision leads to incredible development pain:

  1. Stored Procedures are written in big iron database “languages” T-SQL (Microsoft). These so-called languages are archaic, and full of the crazy, incoherent design choices that always result from the torturous evolution of ten years of backwards compatibility. You really don’t want to be writing a lot of code in this stuff. For context, JavaScript is a giant step up from T-SQL.
  2. Stored Procedures typically cannot be debugged in the same IDE you write your UI. Frequently transitioning between two totally different IDEs, with completely different interfaces and languages, is not exactly productive.
  3. Stored Procedures don’t provide much feedback when things go wrong. Unless the proc is coded interally with weird T-SQL exception handling, we get cryptic ‘errors’ returned based on the particular line inside the proc that failed, such as Table has no rows. Uh, ok?
  4. Stored Procedures can’t pass objects. So, if you’re not careful, you can end up with a zillion parameters. If you have to populate a table row with 20+ fields using a proc, say hello to 20+ parameters. Worst of all, if I pass a bad parameter– either too many, not enough, or bad datatypes– I get a generic “bad call” error. SQL can’t tell me which parameters are in error! So I have to pore over 20 parameters, by hand, to figure out which one is the culprit.
  5. Stored Procedures hide business logic. I have no idea what a proc is doing, or what kind of cursor (DataSet) or values it will return to me. I can’t view the source code to the proc (at least, without resorting to #2 if I have appropriate access) to verify that it is actually doing what I think it is– or what the designer intended it to do. Inline SQL may not be pretty, but at least I can see it in context, alongside the other business logic.

So why use Stored Procedures at all? Conventional wisdom says we do it because:

  • Stored procedures generally result in improved performance because the database can optimize the data access plan used by the procedure and cache it for subsequent reuse.
  • Stored procedures can be individually secured within the database. A client can be granted permissions to execute a stored procedure without having any permissions on the underlying tables.
  • Stored procedures result in easier maintenance because it is generally easier to modify a stored procedure than it is to change a hard-coded SQL statement within a deployed component.
  • Stored procedures add an extra level of abstraction from the underlying database schema. The client of the stored procedure is isolated from the implementation details of the stored procedure and from the underlying schema.
  • Stored procedures can reduce network traffic, because SQL statements can be executed in batches rather than sending multiple requests from the client.

However, there’s one small problem: none of these things are true in practice. The benefits are marginal, but the pain is substantial. And I’m not the only person that feels this way.

The new (as of SQL Server 7.0) cached execution plan optimization in SQL Server looks to me a lot like JIT compilation. If this is, in fact, the case it seems to me that the only overhead that would be associated with dynamic SQL would be:

  1. The amount of bandwidth + time it takes to transmit the dynamic SQL text to the database.
  2. The amount of time it takes to calculate the hash of the dynamic SQL text to look up the cached execution plan.

I can imagine quite a few scenarios where the above overhead would disappear into the noise of the network roundtrip. What upsets me are the folks who spout forth anecdotal arguments that claim stored procedures have “much better” performance than dynamic SQL.

For modern databases and real world usage scenarios, I believe a Stored Procedure architecture has serious downsides and little practical benefit. Stored Procedures should be considered database assembly language: for use in only the most performance critical situations. There are plenty of ways to design a solid, high performing data access layer without resorting to Stored Procedures; you’ll realize a lot of benefits if you stick with parameterized SQL and a single coherent development environment.

Follow

Get every new post delivered to your Inbox.

Join 132 other followers