How do you get those insights? The Bardess Zero2Hero stack is one way, but the stack needs a skilled data scientist behind it to really make things happen.
Serena Peruzzo is the newest data scientist on the Bardess team, but she brings a wealth of high-level expertise from around the globe.
We recently had the opportunity to sit down with Peruzzo and get insight into the work she’s done and where things go from here now that she’s joined the Bardess team.
Tell us a little about your background
I am originally from Italy, where I did my studies in statistics. After finishing my master’s in 2011, I moved to Australia, where I spent 3 years working for a Melbourne-based management consulting firm specialized in data and predictive analytics. The company was small but well established and we worked with a broad range of industries across the entire country. At this point data science wasn’t even a thing yet, but a lot of what I know in terms of practical data science and business I learned there.
After moving back to Europe in mid-2015, I spent most of my time in London, working with startups and helping them setting up their data science practice. I enjoyed the challenges of greenfield problems and the fast-paced environment. I also spent a year as a researcher at Eindhoven Technical University in the Netherlands, where I mostly focused on machine learning for predictive maintenance.
In your opinion what is the practice of data science and what do you believe makes a good data scientist?
There isn’t a straight forward answer to this question. Data science is an umbrella term that covers a lot of different roles and specializations. For instance, some data scientists focus on the mathematical aspects of the job, researching solutions and building models, others are stronger coders and focus on productionizing those models. Some other roles blend with data engineering.
Most data scientists specialize in one of these areas, but have at least some understanding of the other two. It’s a very cross functional role, so the most successful people in the industry have a good mix of math and code skills. But that’s not enough. Communication is fundamental to succeed in a role that requires constant collaborations with different figures, from high-level strategic minds to software engineers.
Another skill that I think is very important, is business acumen. Understanding what the business priorities are allows a data scientist to tackle problems that have a significant impact on the company. There is a myriad of interesting problems to solve, but not all of them necessarily create value for a business. Being able to identify the ones that do is incredibly valuable.
Last but not least, curiosity and out-of-the-box thinking. Data science is a fast-moving industry, with huge amount of innovations and research coming out every day. A drive to learn new things and constantly improve your skillset is definitely a common trait of many successful data scientists.
What is the most interesting project you have worked on? How did you process it and what were the results?
One of my absolute favorites is a project I’ve worked on while I was learning about Natural Language Processing (NLP). I was working through Natural Language Processing with Python, which introduces fundamentals of NLP using a popular library called nltk, but I wanted something more challenging then the toy problems in the book to test my knowledge, so I gave myself a challenge: could NLP techniques be used to identify the difference between Shakespeare’s comedies and tragedies?
The solution uses a technique called Latent Dirichlet Allocation to describe each play as a mixture of topics, then clusters them in two groups based on this representation. LDA is a generative statistical model, i.e. it describes the probabilistic procedure use to assign words to a document, and it does so identifying topics (or themes) in the corpus and assigning a probability for each of them to each document. The results were very interesting. Instead of identifying the split between comedies and tragedies, the algorithm identified the stylistic difference between plays written under Elizabeth I and the ones written under James.
In terms of industry applications, a project I enjoyed at my first consulting job was mapping the three-phase network for an electricity wholesaler. This project was interesting both because it required some level of understanding of how electricity production and distribution work, and because it was greenfield and we had to come up with an ad hoc solution. The goal was to identify which customers were attached to the same phase of the network using the voltage time series from their smart meters. We designed a custom distance metric based on the correlation between trend and noise components of the time series, and used hierarchical clustering to identify similar patterns of usage and group them together.
How do you handle situations when data is missing or more data is needed and, what techniques do you recommend? What are the typical pain points in a data science project?
There are several hurdles in a data science project, from setting expectations to the choice of validation scheme. However, the problem that can undermine everything else is missing data, followed closely by poor data quality.
As a data scientist, there is only so much you can do about missing data. There are several statistical techniques that can be applied to fill in missing values, but most of them rely on the assumption that there is no underling mechanism causing the data to be incomplete (this is called a missing at random, or MAR, assumption). The good news here is that once you’ve decided what approach to take, the problem is no longer time consuming. On the other hand, when the MAR assumption doesn’t hold, i.e. there is a specific mechanism causing the data to be missing, there isn’t much you can do about it and the only options are to integrate with other data sources or work with an incomplete representation of the problem.
The other pain point is messy data. In this case, a lot can be done about it and it’s what we call data cleaning. Examples of this are different unit of measurements used for the same variable, or different spellings for the same name across different records. Data quality issues add a certain degree of uncertainty to a project, especially if it’s the first time that a dataset is used. There is no way of knowing in advance what inconsistencies will be in the data and there is no silver-bullet so the process of eliminating them tends to be time consuming.
What are the organizational pain points for setting up a data science team?
In the early days of data science we encountered a lot of diffidence. I’m referring to my previous consulting experience, back in 2012-13. There was significant interest in data driven decision making, but data science wasn’t mature as a field and certainly not in the spotlight like it is now, so we worked with organizations that came to us for help getting started, but we also encountered varying degrees of resistance. It wasn’t just about the solution, but also helping them with change management. There were always skeptics, and part of our job was convincing them that yes, math works, and no, it’s not black magic.
With data science in the spotlight this has now changed. Organizations are more aware of the benefits of data science and we encounter less resistance. In my experience, communication and integration with the rest of the team are the key issues nowadays. They really are two aspects of the same problem. A data science team never operates as an isolated unit, so being able to establish relationships with various stakeholders in the business is vital to its success. Some of these stakeholders are technical figures, e.g. developers that need to integrate data science models in their own work, but others might be on the opposite side of the spectrum e.g. internal users with little to no math background whose primary requirements are interpretability and usability. In this context it becomes fundamental that the team can receive information and translate it into model requirements, as much as clearly communicating the outcomes of the process, whether it’s a stand-alone analysis or a productionized solution.
A data scientist spends 80% of their time on data cleaning/wrangling: good or bad?
I don’t disagree with the numbers: if you’re starting a new project, you’ll be spending an awful lot of time collecting, cleaning and pre-processing data. What I disagree with is the tendency to use this statement to argue that this step is a waste of time.
As a data scientist, the truth is when you’re cleaning your data, you’re not just dealing with missing values, errors, noise or inconsistencies, you’re fulfilling the two main requirements for building a good model: understanding the problem you’re trying to solve, and gaining perspective of the data that is available in that domain.
Preprocessing may seem like a series of trivial tasks but this is the very first step towards transforming data into actionable insight: it is at this stage that raw data is transformed into variables that can be analyzed and reveal statistical relationships and the decisions made can make the difference between introducing biases or facilitating the training of the final model.
What are some challenges you encountered in your role and how did you overcome them?
The first thing that comes to mind is communication issues. I have worked in several countries and each of them enforced some level of language barrier. In my first job after university, the challenge came from getting comfortable using English in a professional environment and understanding the differences with my native language. The difference between business jargon and conversational language is deeper in Italian than it is in English, so it took a little time and discomfort to understand what was the appropriate way of addressing a client. More recently, I worked for a Korean company and found myself on the other side of the fence, being more comfortable with English than most of my colleagues.
A lesson I learned early on in my career as a data scientist is that you don’t always have the answer and that’s okay. It might be more acute in consulting because you take on the role of an advisor for your clients, but I think every young data scientist struggles a bit with it. It takes experience to internalize the fact that it isn’t about knowing all the answers, but rather knowing how to find them. Expecting to always have the answer can lead to imposter syndrome, I.e. the inability to recognize one’s achievements and fear of being exposed as a fraud. Falling into this negative pattern is especially easy in an industry that is so tightly connected to research, and full of talented people solving hard problems.
On a more practical level, something I found challenging is shifting priorities. In my experience this is a frequent problem in startups, but many established companies might experience the same issue when establishing a data science team. When the direction isn’t clear, priorities are subject to frequent and abrupt changes, and that makes identifying high-value projects much harder, and this without even taking in account the waste of resources.
There are many success factors in this industry but the ability to liaise with different types of people and establishing functional and productive relationships is one of the less obvious. Emotional intelligence draws the line between someone that can apply complex methods and someone that can take problem solving a step further, identifying the problems that can be prioritized, and impacting multiple areas of the organization.
Thank you for your time Serena. Your experience is a great fit here at Bardess and you’re bringing amazing value to our clients. Data science is an absolute must in modern business.
Bardess is a technology consulting and solutions company focused exclusively on leading-edge data analytics. With scientists like Peruzzo and our full team of solution experts and data architects, you can trust Bardess to bring you the valuable insights you need from business intelligence.
To learn more about Bardess or schedule a consultation, call us at 973-584-9100, email firstname.lastname@example.org, or visit the website at www.bardess.com.
Sara Gizinski, is part of the Hero Enablement team at Bardess Ltd. She is passionate and committed to enabling data heroes with our clients.