← Return to program

Projit: An Open Source Python Tool for Decoupled Data Science

Friday 4:10 PM–4:20 PM in Door 12 / Goldfields Theatre

Part of the Scientific Python specialist track

Data science projects occupy an unsual space between coding/hacking and methodologically rigorous experimentation. They require careful discipline to prevent problems like target leakage, over-fitting or p-hacking. Typically, data scientists use custom workflows, or proprietary cloud systems to automate and standardise certain elements like management of data sets, scripts, model artefacts and results. The result is a general absence of both standardisation and easy migration of processes for flexible and repeatable data science work.

In this talk we will outline a light weight open source python package that can be used to manage project meta-data in a way that allows easy sharing, migration and collaboration for data scientists working in the python ecosystem. We will discuss some of the design principles, inspired by a combination of the UNIX command line and the git source control utility. We will then demonstrate basic usage of the package with examples from scientific research papers it has been used for.

https://pypi.org/project/projit/

John Hawkins

John Hawkins is an Australian data scientist and the author of the book Getting Data Science Done. He is the Chief Scientist for Ad Tech company Playground XYZ, and an affiliate researcher with the Transitional AI Group at UNSW. He has 20 years of experience in solving problems in industry and academia, delivering data science solutions for organisations in software development, banking, insurance, media, ad-tech, and bio-medical research. He holds a PhD in Computer Science from the University of Queensland and a Bachelor of Arts (Honours I) in Philosophy of Science from the University of Newcastle. He has written more than 30 peer-reviewed academic articles and presented at academic and industry conferences around the world.