Wednesday, October 4, 2023

Approaching Unfamiliar Data Sets

I asked ChatGPT, "what is the best way for a data scientist to approach a large, structured data set when they don't necessarily understand the contents of the data?"

To which it gave a lengthy answer (as it tends to do):


Approaching a large, structured data set when you don't understand its contents can be a challenging but manageable task for a data scientist. Here's a step-by-step guide on how to approach this situation:

  1. Initial Data Exploration:

    • Begin by obtaining a high-level understanding of the data. Review any available documentation, data dictionaries, or metadata that might provide context about the dataset. 
  1. Data Profiling:

    • Perform basic data profiling to get a sense of the data's characteristics. This can include examining the number of rows and columns, data types, missing values, and basic summary statistics. 
  1. Data Sampling:

    • Take a random sample of the data to get a smaller, manageable subset for initial exploration. This can help you explore the data without overwhelming yourself. 
  1. Visualizations:

    • Create simple data visualizations such as histograms, scatter plots, or bar charts to visualize the distribution of variables. Visualization can reveal patterns and outliers that might not be obvious from raw data. 
  1. Statistical Analysis:

    • Conduct basic statistical analyses like mean, median, standard deviation, and correlations between variables. These can provide insights into relationships within the data. 
  1. Data Cleaning:

    • Address missing data, outliers, and inconsistencies. Data cleaning is crucial to ensure the data is in a usable format for analysis. 
  1. Domain Expert Consultation:

    • If possible, consult with subject matter experts or individuals who might have a better understanding of the data. They can provide valuable insights and context. 
  1. Feature Engineering:

    • Create new features or derive meaningful insights from existing ones based on your initial explorations and domain knowledge. 
  1. Unsupervised Learning:

    • Apply unsupervised learning techniques like clustering to identify natural groupings within the data. Clustering can help uncover hidden patterns. 
  1. Iterative Exploration:

    • Continue to explore the data iteratively, refining your understanding as you go. This may involve repeating steps 3-9 as you gain more insights. 
  1. Documentation:

    • Keep detailed notes throughout your exploration process. Document assumptions, findings, and any insights gained. This documentation will be valuable as you gain a deeper understanding of the data. 
  1. Collaboration:

    • Collaborate with colleagues or team members, especially if they have expertise in data analysis or the domain the data comes from. Different perspectives can lead to better insights. 
  1. Modeling and Hypothesis Testing:

    • As your understanding of the data improves, start building predictive or explanatory models and test hypotheses related to the data. 
  1. Validation and Testing:

    • Validate your models and hypotheses rigorously, and assess their performance on appropriate metrics. 
  1. Feedback Loop:

    • Continuously incorporate feedback and insights gained from modeling and analysis into your understanding of the data. 
  1. Document Your Process:

    • Document your entire data exploration process, including any assumptions, methodologies, and insights. This documentation will be crucial for reproducibility and for communicating your findings to others.

Remember that exploring and understanding the contents of a large, unfamiliar dataset can be time-consuming, so patience and persistence are key. The goal is to progressively build a deeper understanding of the data as you work through it.


All good suggestions, but it still doesn't answer the basic question of how to gain an understanding of unfamiliar data. Not really. Yes, doing the above will begin to give some insight, but I still need more.

The problem we're trying to solve here is, how to deliver real insights that can power a company using vast amounts of data, but make it simple to consume without taking a lot of time educating the consumer of that data. Yes, there's a data dictionary, and yes it talks about how the tables are related but even when working with very skilled data scientists, they still require a lot of time, explanation, and hand-holding -- even creating the queries for them. And while that's something I am willing to do for these clients, it just isn't scalable.

So how do you do it?