In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context.
That needs repeating:
You decide what data is about the moment you define its schema.
With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, Big Data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected—sometimes called a schema-less query. This means we collect information long before we decide what it’s for.
And this is a dangerous thing.