Data Lake Top Tips

I’ve called this Data Lake Top Tips because if I called it Data Warehouse Top Tips, then nobody would read it (because nobody is building Data Warehouses any more, right?). The reality is that Data Lakes, Data Warehouses, Data Marts and Data Science (and others) all have to work together to deliver the business need, and so the real top tips are how all those can work together.

Architecture Strategy

  1. Always focus on solving specific business problems. The old adage of “collect the data and they will come” doesn’t work.
  2. Your data lake will not be enough on its own – use whatever other tools to complement your data lake that are needed to get the job done – especially for business user access
  3. View your data lake as a continually evolving system – if it isn’t changing, then it’s probably already dead – so build in change control from day 1
  4. Work with key business end-users from the start and build them into the project. They will become your best advocates.
  5. Governance isn’t something that can be left until later – it will be too late. Start with the basics – every dataset must have a business and technical owner – and work on from there.
The Data Ecosystem
The Data Ecosystem

Data Acquisition

  1. Your lake should have two discrete area – “raw” data and “enriched” data
  2. Acquire data in as ‘raw’ a form as possible
  3. Use change data capture from the source where possible
  4. Use a complete operational copy to source data where appropriate (you may need it to identify daily changes)
  5. Retain your raw source data if possible and if space permits
  6. Run operational reporting off the source copy (ignore the old rules about not reporting off a “staging” area), but be aware of potential model changes in the source
  7. Design for real-time (or micro-batch) loading. Use daily loads at the last resort
  8. Provide a easily accessible dashboard for business users for them to see what and when data has been loaded (and if possible include data quality metrics too)

Data Enrichment

  1. Don’t be prescriptive about data enrichment – this may be augmentation, summarisation, aggregation, enrichment through machine learning or anything else that works for you
  2. You may need to duplicate data for ease of access. For example. The data models for business dashboards and the data model for data science will be very different but may be exactly the same data.
  3. Don’t limit yourself to a purist view of technology – use whatever storage, database structure and database types are are best – and there will likely be more than one needed
  4. Standardise and expose the business rules that you apply in the enrichment process

Data Model

  1. Use a standard industry data model if it makes sense (3NF models are very hard to maintain, but sometimes they are worth the effort)
  2. Recognise critical data entities and make them the core of your data mart. For example, Customer and Product are common across most businesses, but identify the right ones for your business
  3. Map the sources to the high level model to provide clear data lineage (because your business users will want it)
  4. Use just-in-time modelling – only model the data as you acquire it – don’t attempt to model the entire business
  5. Star schemas and OLAP models are still relevant for performance and ease of user access (the hierarchical drill model of OLAP is closer to the way that business users think than any technical model)
  6. Use a reference data management tool outside the data lake (a data lake is not a reference data management system)

Data Quality

  1. Build data quality in from the start
  2. Check data quality before any enrichment occurs – if the data quality is unusually low, then reject the data
  3. Agree standard business definitions for derived data

Business Users

  1. For business user dashboards, build a star-schema based representation layer in a data mart, or export to an OLAP engine
  2. Provide an enterprise class dash-boarding & drill tool that can transparently access data from your data lake or data marts or OLAP
  3. Provide a easily searchable report catalogue
  4. Ensure the dash-boarding tool can link to the business glossary, data lineage, data quality dashboard, business rules

Data Science & Power Users

  1. Provide an ad-hoc and data visualisation tools
  2. Provide sufficient capacity for ad-hoc analysts
  3. Segregate users into expertise levels, with different complexity views
  4. Build a data lab experimentation area for power users

Downstream Feeds

  1. Don’t try to do everything in the data lake – be open about providing downstream feeds
  2. Provide an automated data extract service that can be requested by business users
  3. Provide automated process for building data marts
  4. Build standard summary data sets for statistical analysis & downstream feeds for critical or widely shared data

Metadata & Governance

  1. Ensure data lineage across the model is both available and accessible
  2. Provide a business glossary
  3. Enable the business glossary to be extended through crowd sourcing
  4. Use a single metadata repository to cover all your data stores, not just the data lake
  5. Always provide a (very) high level data model of the complete data lake to help users understand the scope and context
  6. Work with business users to establish a governance framework

One thought on “Data Lake Top Tips

Leave a comment