/
/
Data Science Landscape
Education
13.07.2023

Data Science Landscape

A walkthrough the data science landscape - roles, algorithms, tools, pipelines, and processes, all summed up in a high level picture.

Share on Facebook
Send on E-Mail

What is Data Science?

  • Data science is not a field of is own but a unification of statistics, analytical models and programming.
  • A data scientist makes sense of data in order to provide added value for a given use case.
  • It is common for data scientists to work closely with the business people. The role of the data scientist is to understand the need and transpose it to a data science solution.
  • Data science is not an ordinary programming field (e.g. Developing a mobile application) but looks forward more towards research & development or trial and error.

Data

  • A Data scientist usually deals with two main categories of data which are:
    • Structured (e.g. tables, jsons, csvs etc.)
    • Unstructured (e.g. free text, images, videos etc.)
  • Another important thing prior to applying any algorithm is if data is labeled. Labelling refers to assigning a semantic value to a given data instance (e.g. an Image of a cat, a scientific text etc)
  • For most algorithms data need to be broadly organized in data points with a label or not. These need to be homogenous in terms of having the same “features” for structured data or being of a certain kind for unstructured (e.g. transactional data for structured and only images not mixed with text data points for unstructured).

Algorithms

  • The algorithms for data science fall into two main categories: supervised and unsupervised learning. Supervised learning means trying to “learn” a given human annotated label and unsupervised learning also “learns” but not towards a given “grund truth” aka the label.
  • To oversimplify all algorithms are advanced statistical models aka “machine learning” .
  • They further diversify into regression and classification for supervised learning and clustering/anomaly detection for unsupervised.
  • Some notable algorithms are SVM/Decision Tree for “classical” machine learning as well as Neural Networks for novel deep learning which have a multitude of variations being a field of its own.

Processes

  • This section refers to the methods which are applied after data is established for a model development. Real life data is usually messy and needs to be further processed in order to be usable.
  • Such processes are imputing which is dealing with missing values, smoothing which is most of the time transformation of the data when it is too variant, cleaning which deals with tweaking different features (e.g. deleting un meaningful columns in tabular data, fixing upper/lower case in text, splitting a column in two etc).
  • Another important process is feature generation (e.g. aggregating data, TFIDF, PCA, ONE-HOT-ENCODING etc).

Modelling

  • Modelling is the core activity of the data scientist and the “end product”. In order to model something multiple algorithms are put on “trial” for “learning”.
  • Usually this is not straight forward, and multiple iterations are necessary as well as going back to previous steps.
  • In a nutshell, a model is a “black box” that is able to generalize a given topic and provide high probable and qualitative result (e.g. which images are likely cats, is a customer going to churn on the business, is this person not like the others etc).

Pipelines

  • Apart from different processed that may arise with a problem at hand, the data science process has a string of pipelines that can be generalized.
  • ETL stands for Extract, Transform, Load and it is the data acquisition step that bundles all data from eventual multiple sources.
  • EDA stands for Exploratory Data Analysis and is getting a visual introspection into the data (e.g. graphics, statistics, dependencies etc).
  • DQ stands for Data Quality and the scope is to fix/drop/evaluate the quality of the data which the data scientist got.
  • Serving and Deployment are usually done by MLOps team, however most of the time in tight collaboration with the datascientist.

Tools

  • In term of tools there is a big open-source market of it, however there are some recommended state-of-the-art ones and which have the biggest communities.
  • The backbone of data science is Python programming language.
  • For algorithms there is tensorflow for deep learning and sklearn for multiple other.
  • Pandas and SQL are musts for ETL.
  • Plotly/Matplotlib are very nice graphical libraries for EDA.
  • Spark/Hadoop/Kafka are solutions for handling big chunks of data and streaming as well as ETL.
  • Airflow is best in class solution for orchestrating/deploying/serving/automating production ready models.
Share on Facebook
Send on E-Mail

More articles

data strategy

Building a Data Strategy — Aligning it with your Business Goals

In this article, we'll explore practical steps to ensure your data strategy is not just a plan, but a catalyst for business success.

Business
Cloud Data Management

Cloud-Based Data Management deep dive

This article delves into the world of Cloud-Based Data Management, outlining its key benefits, potential risks, and essential best practices.

Business
Data Integration

Merging Disparate Data Sources for a Unified System

In the landscape of modern business, data integration stands as a strategic imperative. Let's guide you through this intricate process.

Education
Metadata

Unveiling the Power of Metadata in Data Management

In this article, we will delve into the pivotal role of metadata in effective data management, shedding light on how IDS Consulting can guide your organization towards a

Business
ISO 27701 Security Techniques

We are ISO/IEC 27701 Security Techniques Certified

In a significant milestone, we proudly announce our achievement of ISO/IEC 27701 Security techniques certification.

Business
google cloud partner no outline

Meet your Google Cloud Partners

IDS Consulting has partnered with Google Cloud to help its customers across Europe accelerate their cloud adoption journeys.

Business
Data Security and Privacy

Data Security and Privacy: Safeguarding Against Unauthorized Access and Breaches

In an era where data fuels business operations, ensuring robust data security and privacy measures is paramount. Let's delve into strategies that organizations can employ to fortify their

Business
Large Datasets Seturilor de date voluminoase

Large Datasets Management: Storage and Retrieval Strategies

This article explores the strategies and best practices for managing large datasets effectively, in the world of Data Management.

Business
data quality

The Importance of Data Quality and How to Ensure It

In this article, we delve into the importance of data quality and provide actionable strategies to ensure it within your organization

Education
DevTalks Cluj Winner

Celebrating Success at DevTalks Cluj – Who is the winner of our prize?

Check out who is the winner of the 100E voucher at any retailer, that solved our math quiz at DevTalks Cluj!

Business, cluj, devtalks
DevTalks Cluj

Stand out from the crowd at DevTalks Cluj 2023!

We're thrilled to announce that IDS Consulting is all set to be the Data Management Partner at DevTalks Cluj on September 27th, 2023!

Business
QA analyst

Get to know our team – meet Ionel Ene, our QA Analyst

Get to know Ionel Ene, our QA Analyst. Apart from his technical skills, he is our cup of good mood whenever we get together. He knows when a

Business, Meet the team
Laptop with data coming out

Data Management Best Practices

In today's digital age, effective data management is a critical cornerstone of successful business operations. In this article, we'll delve into some best practices, tips, and tricks to

Education

Data Governance: Policies and Procedures for Decision Making and Data Management

In today's data-driven world, organizations must prioritize effective data governance to ensure data integrity, compliance and reliable decision-making.

Business

IDS Consulting: See you at DevTalks 2023!

IDS Consulting is pleased to announce our participation as Data Management partners at DevTalks 2023, one of the most prestigious technology conferences in the industry.

Business

The rise of Small Open Source in-house Analytics systems

The Analytics space is an ever-changing subject which requires a fast pace and a mindset focused on building pilots, testing new features and analysing compatibility with present infrastructure

Business

Achieving Excellence: Our Successful ISO Standards Certification

We are ISO Certified! We just received the certifications in ISO 9001 (Quality Management), ISO 27001 (Information Security), and ISO 20000-1 (IT Service Management)!

Business

Maximizing Business Success: Understanding the Key Components of Business Intelligence

How Business Intelligence Components Drive Informed Decision-Making and Enhance Operational Efficiency

Business

Boosting Performance and Profits: How Data Warehousing Helps Banks Meet Customer Needs

In today’s data-driven world, banks are facing increased pressure to provide faster, more personalized, and more efficient services to their customers.

Business

Find out all about our 2023 plans

Every end of the year brings summons the need of a retrospective. Thus, Gabriel Tataru, Managing Director of Integration Data Systems, helped us to satisfy our curiosity, telling

Business

Meet us @DevCon 2022!

This year, you can find us @DevCon 2022 , between the 9th and 10th of November 2022, at our virtual booth.

Business

The Romanian Banking System in the new data-driven movement

The Romanian Banking System has undergone serious digital transformation in the past years, especially following the 2020 COVID-19 crisis, with full remote work backing and digital products offering.

Business

The challenges of Testing in a changing world

Since business is continuously changing very fast, and we might find that what was crucial yesterday might not be that important today, the solutions designed for supporting the

Education

Letter from the PM Team

A debate between Project Managers around which one of the two methodologies, waterfall or agile, is the best.

Education

BI Sources and Consumers

What can be a source of data for a BI system and what can consume a BI data in your company? Find out!

Education

Data Science Landscape

A walkthrough the data science landscape - roles, algorithms, tools, pipelines, and processes, all summed up in a high level picture.

Education

Analysis in Business Intelligence

A selection of the best analysis techniques for a business intelligence solution, chosen to maximize your organization's value.

Education

Data Management

Testing and Quality Assurance

Application Management

Business Processes Management

Cloud Engineering

Program and Project Management

IT Operations

Technologies and Tool Stack

Scan the code