Data Morph: A Cautionary Tale of Summary Statistics

Friday 2:30 PM–3:00 PM in Door 12 / Goldfields Theatre

Part of the Scientific Python specialist track

Statistics do not come intuitively to humans; they always try to find simple ways to describe complex things. Given a complex dataset, they may feel tempted to use simple summary statistics like the mean, median, or standard deviation to describe it. However, these numbers are not a replacement for visualizing the distribution.

To illustrate this fact, researchers have generated many datasets that are very different visually, but share the same summary statistics. In this talk, I will discuss Data Morph, an open source package that builds on previous research using simulated annealing to perturb an arbitrary input dataset into a variety of shapes, while preserving the mean, standard deviation, and correlation to multiple decimal points. I will showcase how it works, discuss the challenges faced during development, and explore the limitations of this approach.

This talk introduces Data Morph, a new open source Python package I built that can be used to morph an input dataset of 2D points into select shapes, while preserving the summary statistics to a given number of decimal points through simulated annealing. Data Morph extends research from Autodesk to create the Datasaurus Dozen, and is intended to be used as a teaching tool for illustrating why you can’t rely solely on summary statistics. Come learn how it works and what it took to translate the research into an open-source library.

Stefanie Molin she/her • @StefanieMolin

Stefanie Molin is a software engineer at Bloomberg in New York City, where she tackles tough problems in information security, particularly those revolving around data wrangling/visualization, building tools for gathering data, and knowledge sharing. She is also a core developer of numpydoc, creator of the numpydoc-validation pre-commit hook, and the author of “Hands-On Data Analysis with Pandas: A Python data science handbook for data collection, wrangling, analysis, and visualization,” which is currently in its second edition and has been translated into Korean and Chinese. She holds a bachelor’s of science degree in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, as well as a master’s degree in computer science, with a specialization in machine learning, from Georgia Tech. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.