Data Bytes

Wednesday, August 12, 2020

Python Language Notes

Jubiter Link:

https://hub.gke.mybinder.org/user/pdeitel-pythonfullthrottle-0jjlesh2/lab

Forked GIT repo

https://github.com/selvarajruban/PythonFullThrottle.git

a - Insert Cell Before selected on

b - Insert Cell After selected one

dd - Delete Cell

Magic command

%save to save .ipynb notebook as .py script

Tuesday, August 11, 2020

Five books every data scientist should read that are not about data science

1) Incerto: This book is a collection of writings by Nassim Taleb, the most famous of which is ‘The Black Swan’ and the best, IMO, of which is ‘Antifragile.’ Taleb is the greatest modern thinker on risk, uncertainty and the problems with quantitative modeling. He is also a Twitter troll known for calling out people who are ‘intellectual yet idiots’ IYI. By background, he is an immigrant derivative trader turned mathematical philosopher. You will either love him or hate him because he will consistently challenge your assumptions in all of his writing. If he writes anything, you should put it on your reading list immediately.

2) Fortune’s Formula: The story of the birth of a formula (The Kelly Criterion) during MIT’s early days that claims to be behind an enormous amount of financial success. You will learn about the father of information theory (Claude Shannon) and the beginnings of the card counting shenanigans that later become famous in Ed Thorpe’s ‘Beat the Dealer.’ Thorpe is now considered the godfather of quantitative hedge funds. Most importantly this book shows how a good model cannot be ignored forever but bad ones can burn you. The story is also one of the first times in history where computer science and mathematics team up to solve a real-world problem (it just happens to be for gambling). This story is a foreshadowing of the data science industry 60 years before its creation.

3) Chaos - Making a New Science: The detailed history of the youngest of sciences. Both a history of chaos and an accessible review of the topic. This book will give the reader an understanding of the limitations of our ability to model the real world. Many of the deep learning models being developed and deployed today cannot be genuinely understood due to the nature of non-linear processes. This book will help you comprehend these limitations. Also, a comprehensive review of the life and work of Benoit Mandelbrot alone make this a must read for any data scientist. James Gleick is a fantastic author and has many other excellent books you can add to your reading list.

4) Dark Pools: The story of a programmer that changed stock market trading forever. Today prediction models are deployed in the world of high-frequency trading where decisions are made at nanosecond speeds. This book walks through the creation of this hidden but powerful ecosystem. The fantastic thing about this story is that it illuminates how a great many problems can be solved when you know some code. It also demonstrates that creating real value is doing something truly innovative and not relying on existing assumptions. Sometimes you have to be a little crazy to solve a hard problem.

5) The Theory That Would Not Die: The history of Bayes formula and Bayesian statistics as well as its competing rival, the frequentist. Both a history of statistics and a plain language review of critical technical topics make this book vital. You will learn about some of the greatest minds in history like Pierre Laplace and R.A. Fischer along with how their philosophies shaped the world’s approach to data for centuries.

These five books, while not exhaustive, will help to build a philosophical foundation for a data scientist working on real-world problems. Do not make the same mistakes the quants did a decade ago. Seek to understand techniques and models philosophically, not just mechanically, and our profession will become invaluable

Thursday, June 20, 2019

Machine Learning for twitter sentiment analysis

Main Post : Click this link in order to view the details on data loading for machine learning

Abstract :

This is a mini-project being worked together by a small team for getting the opinion analysis of twitter data about some xyz company. This blog post will cover a part of the project where machine learning is used for deriving the opinion analysis. Please click the above link for viewing the main post which covers the entire project detail.

Description :

The best way to start with a machine learning is to start working on it with a simple machine learning model. Let's start with the sentiment analysis with the simple model and we will move over to more advanced model them over time.

Monday, April 29, 2019

Kafka Quick Start Guide

Installing Kafka in Linux(Ubuntu) machines. This simple step-by-step by guide elaborates how to install kafka in local linux flavored machines and do a simple testing of working flow.

----------------------------------------
1. Create kafka user in Server
----------------------------------------
1. Create linux user for Kafka setup. From root user create user as kafka
useradd kafka -m

2. Set password for user kafka
passwd kafka

3. Add the user in sudo group to give all privileges required to install Kafka binaries and it's dependencies.
adduser kafka sudo

----------------------------------------
2. Install Java
----------------------------------------

1. Update all current binaries in linux
sudo apt-get update

2. Install Java 8.0 version
sudo apt-get install default-jre

3. Verify Java version
java -version

----------------------------------------
3. Install Zookeeper
----------------------------------------

1. Install zookeeper from repository
sudo apt-get install zookeeperd

2. Verify zookeeperd
telnet localhost 2181

---------------------------------------------
3. Downloading and installing Kafka Binaries
---------------------------------------------
1. Create directory Downloads in home
mkdir -p ~/Downloads

2. Download Kafka from Apache repository to localhost
wget "http://www-eu.apache.org/dist/kafka/0.11.0.1/kafka_2.11-0.11.0.1.tgz" -O ~/Downloads/kafka.tgz

3. Create directory kafka and uncompress the downloaded tar file into kafka folder
mkdir -p ~/kafka && cd ~/kafka

tar -xvzf ~/Downloads/kafka.tgz --strip 1

---------------------------------------------
4. Configuring Kafka server
---------------------------------------------

vi ~/kafka/config/server.properties

To allow the deletion of topics in Kafka server add the below line in server.properties file

delete.topic.enable = true

---------------------------------------------
5. Start the Kafka server
---------------------------------------------
nohup ~/kafka/bin/kafka-server-start.sh ~/kafka/config/server.properties > ~/kafka/kafka.log 2>&1 &

---------------------------------------------
6. Create a topic in Kafka server
---------------------------------------------
~/kafka/bin/kafka-create-topic.sh --zookeeper localhost:2181 --replica 1 --partition 1 --topic test

~/kafka/bin/kafka-list-topic.sh --zookeeper localhost:2181

---------------------------------------------
7. Start the kafka producer console
---------------------------------------------

Start producer console and send messages to topic test. Parallel you can start the consumer in another terminal and see real time message flow.

~/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This is a test message

---------------------------------------------
6. Start the kafka consumer console
---------------------------------------------
Start the consumer console and see the messages from topic test displayed in terminal

~/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
This is a test message