Data Bytes

Monday, August 2, 2021

Saturday, June 5, 2021

Basics Statistics

Population: This is the totality, the complete list of observations, or all the data points about the subject under study.

Sample: A sample is a subset of a population, usually a small portion of the population that is being analyzed.

Parameter versus statistic: Any measure that is calculated on the population is a parameter, whereas on a sample it is called a statistic.

Mean: This is a simple arithmetic average, which is computed by taking the aggregated sum of values divided by a count of those values. The mean is sensitive to outliers in the data. An outlier is the value of a set or column that is highly deviant from the many other values in the same data; it usually has very high or low values.

Median: This is the midpoint of the data, and is calculated by either arranging it in ascending or descending order. If there are N observations.

Mode: This is the most repetitive data point in the data:

Measure of variation: Dispersion is the variation in the data, and measures the inconsistencies in the value of variables in the data. Dispersion actually provides an idea about the spread rather than central values.

Range: This is the difference between the maximum and minimum of the value.

Variance: This is the mean of squared deviations from the mean (xi = data points, µ = mean of the data, N = number of data points). The dimension of variance is the square of the actual values. The reason to use denominator N-1 for a sample instead of N in the population is due the degree of freedom. 1 degree of freedom lost in a sample by the time of calculating variance is due to extraction of substitution of sample:

Standard deviation: This is the square root of variance. By applying the square root on variance, we measure the dispersion with respect to the original variable rather than square of the dimension:

Quantiles: These are simply identical fragments of the data. Quantiles cover percentiles, deciles, quartiles, and so on. These measures are calculated after arranging the data in ascending order:

Percentile: This is nothing but the percentage of data points below the value of the original whole data. The median is the 50th percentile, as the number of data points below the median is about 50 percent of the data.

Decile: This is 10th percentile, which means the number of data points below the decile is 10 percent of the whole data.

Quartile: This is one-fourth of the data, and also is the 25th percentile. The first quartile is 25 percent of the data, the second quartile is 50 percent of the data, the third quartile is 75 percent of the data. The second quartile is also known as the median or 50th percentile or 5th decile.

Interquartile range: This is the difference between the third quartile and first quartile. It is effective in identifying outliers in data. The interquartile range describes the middle 50 percent of the data points.

Thursday, June 3, 2021

Typical Machine Learning Process

Steps in machine learning model development and deployment

The development and deployment of machine learning models involves a series of steps that are almost similar to the statistical modeling process, in order to develop, validate, and implement machine learning models. The steps are as follows:

Collection of data: Data for machine learning is collected directly from structured source data, web scrapping, API, chat interaction, and so on, as machine learning can work on both structured and unstructured data (voice, image, and text).

Data preparation and missing/outlier treatment: Data is to be formatted as per the chosen machine learning algorithm; also, missing value treatment needs to be performed by replacing missing and outlier values with the mean/median, and so on.

Data analysis and feature engineering: Data needs to be analyzed in order to find any hidden patterns and relations between variables, and so on. Correct feature engineering with appropriate business knowledge will solve 70 percent of the problems. Also, in practice, 70 percent of the data scientist's time is spent on feature engineering tasks.

Train algorithm on training and validation data: Post feature engineering, data will be divided into three chunks (train, validation, and test data) rather than two (train and test) in statistical modeling. Machine learning are applied on training data and the hyperparameters of the model are tuned based on validation data to avoid overfitting.

Test the algorithm on test data: Once the model has shown a good enough performance on train and validation data, its performance will be checked against unseen test data. If the performance is still good enough, we can proceed to the next and final step.

Deploy the algorithm: Trained machine learning algorithms will be deployed on live streaming data to classify the outcomes. One example could be recommender systems implemented by e-commerce websites.

Machine Learning Intro

Machine Learning

Machine learning is the branch of computer science that utilizes past experience to learn from and use its knowledge to make future decisions. Machine learning is at the intersection of computer science, engineering, and statistics. The goal of machine learning is to generalize a detectable pattern or to create an unknown rule from given examples. An overview of machine learning landscape is as follows:

Machine learning is broadly classified into three categories but nonetheless, based on the situation, these categories can be combined to achieve the desired results for particular applications:

Supervised learning: This is teaching machines to learn the relationship between other variables and a target variable, similar to the way in which a teacher provides feedback to students on their performance. The major segments within supervised learning are as follows:
- Classification problem
- Regression problem
Unsupervised learning: In unsupervised learning, algorithms learn by themselves without any supervision or without any target variable provided. It is a question of finding hidden patterns and relations in the given data. The categories in unsupervised learning are as follows:
- Dimensionality reduction
- Clustering
Reinforcement learning: This allows the machine or agent to learn its behavior based on feedback from the environment. In reinforcement learning, the agent takes a series of decisive actions without supervision and, in the end, a reward will be given, either +1 or -1. Based on the final payoff/reward, the agent reevaluates its paths. Reinforcement learning problems are closer to the artificial intelligence methodology rather than frequently used machine learning algorithms.

In some cases, we initially perform unsupervised learning to reduce the dimensions followed by supervised learning when the number of variables is very high. Similarly, in some artificial intelligence applications, supervised learning combined with reinforcement learning could be utilized for solving a problem; an example is self-driving cars in which, initially, images are converted to some numeric format using supervised learning and combined with driving actions (left, forward, right, and backward).

MSBI Tech Stack

Sunday, May 30, 2021

Apache Airflow up and running

Extract Transform Load (ETL) and Extract Load Transform (ELT): ETL is normally a continuous, ongoing process with a well-defined workflow. ETL first extracts data from homogeneous or heterogeneous data sources. Then, data is cleansed, enriched, transformed, and stored either back in the lake or in a data warehouse.

ELT (Extract, Load, Transform) is a variant of ETL wherein the extracted data is first loaded into the target system. Transformations are performed after the data is loaded into the data warehouse. ELT typically works well when the target system is powerful enough to handle transformations. Analytical databases like Amazon Redshift and Google BigQ.

What is S3?

Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites.

What is Kafka?

Apache Kafka is an open-source stream-processing software platform developed by Linkedin and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a massively scalable pub/sub message queue designed as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data.

What is RedShift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you provision your cluster, you can upload your data set and then perform data analysis queries. Regardless of the size of the data set, Amazon Redshift offers fast query performance using the same SQL-based tools and business intelligence applications that you use today. So in other words, S3 is an example of the final data store where data might be loaded (e.g. ETL). While Redshift is an example of a data warehouse product, provided specifically by Amazon.

Data Validation

Data Validation is the process of ensuring that data is present, correct & meaningful. Ensuring the quality of your data through automated validation checks is a critical step in building data pipelines at any organization.

• Directed Acyclic Graphs (DAGs): DAGs are a special subset of graphs in which the edges between nodes have a specific direction, and no cycles exist. When we say “no cycles exist” what we mean is the nodes cant create a path back to themselves.

• Nodes: A step in the data pipeline process.

• Edges: The dependencies or relationships other between nodes.

Apache Airflow

Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Scheduler orchestrates the execution of jobs on a trigger or schedule. The Scheduler chooses how to prioritize the running and execution of tasks within the system. You can learn more about the Scheduler from the official Apache Airflow documentation.

Work Queue is used by the scheduler in most Airflow installations to deliver tasks that need to be run to the Workers.

Worker processes execute the operations defined in each DAG. In most Airflow installations, workers pull from the work queue when it is ready to process a task. When the worker completes the execution of the task, it will attempt to process more work from the work queue until there is no further work remaining. When work in the queue arrives, the worker will begin to process it.

Database saves credentials, connections, history, and configuration. The database, often referred to as the metadata database, also stores the state of all tasks in the system. Airflow components interact with the database with the Python ORM, SQLAlchemy.

Web Interface provides a control dashboard for users and maintainers. Throughout this course you will see how the web interface allows users to perform tasks such as stopping and starting DAGs, retrying failed tasks, configuring credentials, The web interface is built using the Flask web-development microframework.

Order of Operations For an Airflow DAG

• The Airflow Scheduler starts DAGs based on time or external triggers.

• Once a DAG is started, the Scheduler looks at the steps within the DAG and determines which steps can run by looking at their dependencies.

• The Scheduler places runnable steps in the queue.

• Workers pick up those tasks and run them.

• Once the worker has finished running the step, the final status of the task is recorded and additional tasks are placed by the scheduler until all tasks are complete.

• Once all tasks have been completed, the DAG is complete.

Creating a DAG Creating a DAG is easy. Give it a name, a description, a start date, and an interval. Creating Operators to Perform Tasks Operators define the atomic steps of work that make up a DAG. Instantiated operators are referred to as Tasks.

Schedules Schedules are optional, and may be defined with cron strings or Airflow Presets. Airflow provides the following presets:

• @once - Run a DAG once and then never again

• @hourly - Run the DAG every hour

• @daily - Run the DAG every day

• @weekly - Run the DAG every week

• @monthly - Run the DAG every month

• @yearly- Run the DAG every year

• None - Only run the DAG when the user initiates it Start Date: If your start date is in the past, Airflow will run your DAG as many times as there are schedule intervals between that start date and the current date. End Date: Unless you specify an optional end date, Airflow will continue to run your DAGs until you disable or delete the DAG.

Operators Operators define the atomic steps of work that make up a DAG. Airflow comes with many Operators that can perform common operations. Here are a handful of common ones:

• PythonOperator

• PostgresOperator c

• RedshiftToS3Operator

• S3ToRedshiftOperator

• BashOperator

• SimpleHttpOperator

• Sensor

Task Dependencies In Airflow DAGs:

• Nodes = Tasks

• Edges = Ordering and dependencies between tasks Task dependencies can be described programmatically in Airflow using >> and <<

• a >> b means a comes before b

• a << b means a comes after b

hello_world_task = PythonOperator(task_id=’hello_world’, ...) goodbye_world_task = PythonOperator(task_id=’goodbye_world’, ...) ... # Use >> to denote that goodbye_world_task depends on hello_world_task hello_world_task >> goodbye_world_task Tasks dependencies can also be set with “set_downstream” and “set_upstream” • a.set_downstream(b) means a comes before b

• a.set_upstream(b) means a comes after b hello_world_task = PythonOperator(task_id=’hello_world’, ...) goodbye_world_task = PythonOperator(task_id=’goodbye_world’, ...) ... hello_world_task.set_downstream(goodbye_world_task)

Connection via Airflow Hooks Connections can be accessed in code via hooks. Hooks provide a reusable interface to external systems and databases. With hooks, you don’t have to worry about how and where to store these connection strings and secrets in your code. Airflow comes with many Hooks that can integrate with common systems. Here are a few common ones:

• HttpHook

• PostgresHook (works with RedShift)

• MySqlHook

• SlackHook

• PrestoHook