The data science is an area that sells very well.
It has generated great expectations for organizations, businesses, and also for professionals, as well as presenting new products and solutions for improvement through the data exploration. And the Big Data capabilities are there to be exploited.
My challenge in this post is present ideas of how to extract information to apply improved quality of software through the techniques and data science tools.
Where the story begins
After many years of existence of a software (or ERP) is very common that there are hundreds (or thousands) of file lines logs that are generated daily.
Logs of different types, sizes, shapes, each to address a situation, a layer, a service, an integration, a webservice, etc.
This data can be like treasure ready to be discovered. I began to bet that may have a hidden treasure, reflecting:
- What information are hidden?
- How to validate them in order to extract relevant information?
- This information may serve to software quality?
- Which new data can be generated to help improve the operations of my software?
The log data generated by the system may serve to give a “pulse” in the performance of the system itself, as well as gases generated by the engines of cars are harnessed by the turbo charger, providing that “vruuum”.
Does this abstraction makes sense? Let’s continue…
How Data Science sold himself ?
“Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured, which is a continuation of the field data mining and predictive analytics, also known as knowledge discovery and data mining (KDD). “Unstructured data” can include emails, videos, photos, social media, and other user-generated content. Data science often requires sorting through a great amount of information and writing algorithms to extract insights from this data.”
Jeff Leek says the data is the second most important thing:
“The second most important is the data. The most important thing in data science is the question. Often the data will limit or enable the questions. But having data can’t save you if you don’t have a question.”
Source: Data Scientist Toolbox.
We are then talking about a science whose challenge is to ask the right questions using the data to look for answers. The scientific approach takes us and delimit the first category, and then leave for the possible answers.
My initial approach was already wrong I was leaving the data to generate results. But we can see that the data is the second most important thing. The most important thing are the questions.
How to improve the software quality?
My initial approach was wrong, the data are the second more important thing. The more important thing are the questions.
So after “back for the drawing board”, t
The questions that motivate this challenge are:
- Which system places are more accessed by the users?
- What are the slower parts of the system?
- The exchange of data between applications (clients app, server apps, integrations) is efficient?
- Which SQL queries are most executed?
- Which SQL queries are most slow?
And the challenge continues …
In future posts I intend to continue presenting these ideas:
- More details of the challenge;
- What is expected of a data scientist;
- Tools of data collection;
- Techniques cleanup data;
- Storage of evidences;
- Generation of the dashboards and graphics;
- Automation of collection operations;
Did you think these before? Fell free to leave your comment or yet help me in my challenge.
I end the post leaving a quote from John Turkey:
“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data..”
Fabiano de Freitas Silva
Latest posts by Fabiano de Freitas Silva (see all)
- DevOps: It will be more one “fashion” or matter of SURVIVAL. - October 12, 2015
- Make Sense of your Logs: From Zero to Hero in less than an Hour! by Britta Weber - October 3, 2015
- ElasticSearch (ELK Stack) helping software quality - September 28, 2015