Introduction1
About This Book 1
Foolish Assumptions 3
Icons Used in This Book 4
Beyond the Book 4
Where to Go from Here 5
Book 1: Defining Data Science 7
Chapter 1: Considering the History and Uses of Data Science 9
Considering the Elements of Data Science 10
Considering the emergence of data science 10
Outlining the core competencies of a data scientist 11
Linking data science, big data, and AI 12
Understanding the role of programming 12
Defining the Role of Data in the World 13
Enticing people to buy products 13
Keeping people safer 14
Creating new technologies 15
Performing analysis for research 16
Providing art and entertainment 17
Making life more interesting in other ways 18
Creating the Data Science Pipeline 18
Preparing the data 18
Performing exploratory data analysis 18
Learning from data 19
Visualizing 19
Obtaining insights and data products 19
Comparing Different Languages Used for Data Science 20
Obtaining an overview of data science languages 20
Defining the pros and cons of using Python 22
Defining the pros and cons of using R 23
Learning to Perform Data Science Tasks Fast 25
Loading data 26
Training a model 26
Viewing a result 26
Chapter 2: Placing Data Science within the Realm of AI 29
Seeing the Data to Data Science Relationship 30
Considering the data architecture 30
Acquiring data from various sources 31
Performing data analysis 32
Archiving the data 33
Defining the Levels of AI 33
Beginning with AI 34
Advancing to machine learning 39
Getting detailed with deep learning 43
Creating a Pipeline from Data to AI 47
Considering the desired output 47
Defining a data architecture 47
Combining various data sources 47
Checking for errors and fixing them 48
Performing the analysis 48
Validating the result 49
Enhancing application performance 49
Chapter 3: Creating a Data Science Lab of Your Own 51
Considering the Analysis Platform Options 52
Using a desktop system 53
Working with an online IDE 53
Considering the need for a GPU 54
Choosing a Development Language 56
Obtaining and Using Python 58
Working with Python in this book 58
Obtaining and installing Anaconda for Python 59
Defining a Python code repository 64
Working with Python using Google Colaboratory 69
Defining the limits of using Azure Notebooks with Python and R 71
Obtaining and Using R 72
Obtaining and installing Anaconda for R 72
Starting the R environment 73
Defining an R code repository 75
Presenting Frameworks 76
Defining the differences 76
Explaining the popularity of frameworks 77
Choosing a particular library 79
Accessing the Downloadable Code 80
Chapter 4: Considering Additional Packages and Libraries You Might Want 81
Considering the Uses for Third-Party Code 82
Obtaining Useful Python Packages 83
Accessing scientific tools using SciPy 84
Performing fundamental scientific computing using NumPy 85
Performing data analysis using pandas 85
Implementing machine learning using Scikit-learn 86
Going for deep learning with Keras and TensorFlow 86
Plotting the data using matplotlib 87
Creating graphs with NetworkX 88
Parsing HTML documents using Beautiful Soup 88
Locating Useful R Libraries 89
Using your Python code in R with reticulate 89
Conducting advanced training using caret 90
Performing machine learning tasks using mlr 90
Visualizing data using ggplot2 91
Enhancing ggplot2 using esquisse 91
Creating graphs with igraph 91
Parsing HTML documents using rvest 92
Wrangling dates using lubridate 92
Making big data simpler using dplyr and purrr 93
Chapter 5: Leveraging a Deep Learning Framework 95
Understanding Deep Learning Framework Usage 96
Working with Low-End Frameworks 97
Chainer 97
PyTorch 98
MXNet 98
Microsoft Cognitive Toolkit/CNTK 99
Understanding TensorFlow 100
Grasping why TensorFlow is so good 101
Making TensorFlow easier by using TFLearn 102
Using Keras as the best simplifier 102
Getting your copy of TensorFlow and Keras 103
Fixing the C++ build tools error in Windows 106
Accessing your new environment in Notebook 108
Book 2: Interacting with Data Storage 109
Chapter 1: Manipulating Raw Data 111
Defining the Data Sources 112
Obtaining data locally 112
Using online data sources 117
Employing dynamic data sources 121
Considering other kinds of data sources 123
Considering the Data Forms 124
Working with pure text 124
Accessing formatted text 125
Deciphering binary data 126
Understanding the Need for Data Reliability 128
Chapter 2: Using Functional Programming Techniques 131
Defining Functional Programming 132
Differences with other programming paradigms 132
Understanding its goals 133
Understanding Pure and Impure Languages 134
Using the pure approach 134
Using the impure approach 134
Comparing the Functional Paradigm 135
Imperative 135
Procedural 136
Object-oriented 136
Declarative 136
Using Python for Functional Programming Needs 137
Understanding How Functional Data Works 138
Working with immutable data 139
Considering the role of state 139
Eliminating side effects 140
Passing by reference versus by value 140
Working with Lists and Strings 142
Creating lists 144
Evaluating lists 144
Performing common list manipulations 146
Understanding the Dict and Set alternatives 147
Considering the use of strings 148
Employing Pattern Matching 150
Looking for patterns in data 150
Understanding regular expressions 152
Using pattern matching in analysis 155
Working with pattern matching 156
Working with Recursion 159
Performing tasks more than once 159
Understanding recursion 161
Using recursion on lists 162
Considering advanced recursive tasks 163
Passing functions instead of variables 164
Performing Functional Data Manipulation 165
Slicing and dicing 166
Mapping your data 167
Filtering data 168
Organizing data 169
Chapter 3: Working with Scalars, Vectors, and Matrices 171
Considering the Data Forms 172
Defining Data Type through Scalars 173
Creating Organized Data with Vectors 174
Defining a vector 175
Creating vectors of a specific type 175
Performing math on vectors 176
Performing logical and comparison tasks on vectors 176
Multiplying vectors 177
Creating and Using Matrices 178
Creating a matrix 178
Creating matrices of a specific type 179
Using the matrix class 181
Performing matrix multiplication 181
Executing advanced matrix operations 183
Extending Analysis to Tensors 185
Using Vectorization Effectively 186
Selecting and Shaping Data 187
Slicing rows 188
Slicing columns 188
Dicing 189
Concatenating 189
Aggregating 194
Working with Trees 195
Understanding the basics of trees 195
Building a tree 196
Representing Relations in a Graph 198
Going beyond trees 198
Arranging graphs 199
Chapter 4: Accessing Data in Files 201
Understanding Flat File Data Sources 202
Working with Positional Data Files 203
Accessing Data in CSV Files 205
Working with a simple CSV file 205
Making use of header information 208
Moving On to XML Files 209
Working with a simple XML file 209
Parsing XML 211
Using XPath for data extraction 212
Considering Other Flat-File Data Sources 214
Working with Nontext Data 215
Downloading Online Datasets 218
Working with package datasets 218
Using public domain datasets 219
Chapter 5: Working with a Relational DBMS 223
Considering RDBMS Issues 224
Defining the use of tables 225
Understanding keys and indexes 226
Using local versus online databases 227
Working in read-only mode 228
Accessing the RDBMS Data 228
Using the SQL language 229
Relying on scripts 231
Relying on views 231
Relying on functions 232
Creating a Dataset 233
Combining data from multiple tables 233
Ensuring data completeness 234
Slicing and dicing the data as needed 234
Mixing RDBMS Products 234
Chapter 6: Working with a NoSQL DMBS 237
Considering the Ramifications of Hierarchical Data 238
Understanding hierarchical organization 238
Developing strategies for freeform data 239
Performing an analysis 240
Working around dangling data 241
Accessing the Data 243
Creating a picture of the data form 243
Employing the correct transiting strategy 244
Ordering the data 247
Interacting with Data from NoSQL Databases 248
Working with Dictionaries 249
Developing Datasets from Hierarchical Data 250
Processing Hierarchical Data into Other Forms 251
Book 3: Manipulating Data Using Basic Algorithms 253
Chapter 1: Working with Linear Regression 255
Considering the History of Linear Regression 256
Combining Variables 257
Working through simple linear regression 257
Advancing to multiple linear regression 260
Considering which question to ask 262
Reducing independent variable complexity 263
Manipulating Categorical Variables 265
Creating categorical variables 266
Renaming levels 267
Combining levels 268
Using Linear Regression to Guess Numbers 269
Defining the family of linear models 270
Using more variables in a larger dataset 271
Understanding variable transformations 274
Doing variable transformations 275
Creating interactions between variables 277
Understanding limitations and problems 282
Learning One Example at a Time 283
Using Gradient Descent 283
Implementing Stochastic Gradient Descent 283
Considering the effects of regularization 287
Chapter 2: Moving Forward with Logistic Regression 289
Considering the History of Logistic Regression 290
Differentiating between Linear and Logistic Regression 291
Considering the model 291
Defining the logistic function 292
Understanding the problems that logistic regression solves 294
Fitting the curve 295
Considering a pass/fail example 296
Using Logistic Regression to Guess Classes 297
Applying logistic regression 297
Considering when classes are more 298
Defining logistic regression performance 300
Switching to Probabilities 301
Specifying a binary response 301
Transforming numeric estimates into probabilities 302
Working through Multiclass Regression 305
Understanding multiclass regression 305
Developing a multiclass regression implementation 306
Chapter 3: Predicting Outcomes Using Bayes 309
Understanding Bayes Theorem 310
Delving into Bayes history 310
Considering the basic theorem 312
Using Naïve Bayes for Predictions 313
Finding out that Naïve Bayes isnt so naïve 314
Predicting text classifications 315
Getting an overview of Bayesian inference 318
Working with Networked Bayes 324
Considering the network types and uses 324
Understanding Directed Acyclic Graphs (DAGs) 327
Employing networked Bayes in predictions 328
Deciding between automated and guided learning 332
Considering the Use of Bayesian Linear Regression 332
Considering the Use of Bayesian Logistic Regression 333
Chapter 4: Learning with K-Nearest Neighbors 335
Considering the History of K-Nearest Neighbors 336
Learning Lazily with K-Nearest Neighbors 337
Understanding the basis of KNN 337
Predicting after observing neighbors 338
Choosing the k parameter wisely 341
Leveraging the Correct k Parameter 342
Understanding the k parameter 342
Experimenting with a flexible algorithm 343
Implementing KNN Regression 345
Implementing KNN Classification 347
Book 4: Performing Advanced Data Manipulation 351
Chapter 1: Leveraging Ensembles of Learners 353
Leveraging Decision Trees 354
Growing a forest of trees 356
Seeing Random Forests in action 358
Understanding the importance measures 360
Configuring your system for importance measures with Python 361
Seeing importance measures in action 361
Working with Almost Random Guesses 364
Understanding the premise 365
Bagging predictors with AdaBoost 366
Meeting Again with Gradient Descent 369
Understanding the GBM difference 369
Seeing GBM in action 371
Averaging Different Predictors 372
Chapter 2: Building Deep Learning Models 373
Discovering the Incredible Perceptron 374
Understanding perceptron functionality 375
Touching the nonseparability limit 376
Hitting Complexity with Neural Networks 378
Considering the neuron 379
Pushing data with feed-forward 381
Defining hidden layers 383
Executing operations 384
Considering the details of data movement through the neural network 386
Using backpropagation to adjust learning 387
Understanding More about Neural Networks 390
Getting an overview of the neural network process 391
Defining the basic architecture 391
Documenting the essential modules 393
Solving a simple problem 396
Looking Under the Hood of Neural Networks 399
Choosing the right activation function 399
Relying on a smart optimizer 401
Setting a working learning rate 402
Explaining Deep Learning Differences with Other Forms of AI 402
Adding more layers 403
Changing the activations 405
Adding regularization by dropout 406
Using online learning 407
Transferring learning 407
Learning end to end 408
Chapter 3: Recognizing Images with CNNs 409
Beginning with Simple Image Recognition 410
Considering the ramifications of sight 410
Working with a set of images 411
Extracting visual features 417
Recognizing faces using Eigenfaces 419
Classifying images 423
Understanding CNN Image Basics 427
Moving to CNNs with Character Recognition 429
Accessing the dataset 430
Reshaping the dataset 431
Encoding the categories 432
Defining the model 432
Using the model 433
Explaining How Convolutions Work 435
Understanding convolutions 435
Simplifying the use of pooling 439
Describing the LeNet architecture 440
Detecting Edges and Shapes from Images 446
Visualizing convolutions 447
Unveiling successful architectures 449
Discussing transfer learning 450
Chapter 4: Processing Text and Other Sequences 453
Introducing Natural Language Processing 454
Defining the human perspective as it relates to data science 454
Considering the computer perspective as it relates to data science 455
Understanding How Machines Read 456
Creating a corpus 457
Performing feature extraction 457
Understanding the BoW 458
Processing and enhancing text 459
Maintaining order using n-grams 461
Stemming and removing stop words 462
Scraping textual datasets from the web 465
Handling problems with raw text 470
Storing processed text data in sparse matrices 473
Understanding Semantics Using Word Embeddings 478
Using Scoring and Classification 482
Performing classification tasks 482
Analyzing reviews from e-commerce 485
Book 5: Performing Data-Related Tasks 491
Chapter 1: Making Recommendations 493
Realizing the Recommendation Revolution 494
Downloading Rating Data 495
Navigating through anonymous web data 496
Encountering the limits of rating data 499
Leveraging SVD 506
Considering the origins of SVD 506
Understanding the SVD connection 508
Chapter 2: Performing Complex Classifications 509
Using Image Classification Challenges 510
Delving into ImageNet and Coco 511
Learning the magic of data augmentation 513
Distinguishing Traffic Signs 516
Preparing the image data 517
Running a classification task 520
Chapter 3: Identifying Objects 525
Distinguishing Classification Tasks 526
Understanding the problem 526
Performing localization 527
Classifying multiple objects 528
Annotating multiple objects in images 529
Segmenting images 530
Perceiving Objects in Their Surroundings 531
Considering vision needs in self-driving cars 531
Discovering how RetinaNet works 532
Using the Keras-RetinaNet code 534
Overcoming Adversarial Attacks on Deep Learning Applications 538
Tricking pixels 539
Hacking with stickers and other artifacts 541
Chapter 4: Analyzing Music and Video543
Learning to Imitate Art and Life 544
Transferring an artistic style 545
Reducing the problem to statistics 546
Understanding that deep learning doesnt create 548
Mimicking an Artist 548
Defining a new piece based on a single artist 549
Combining styles to create new art 550
Visualizing how neural networks dream 551
Using a network to compose music 551
Other creative avenues 552
Moving toward GANs 553
Finding the key in the competition 554
Considering a growing field 556
Chapter 5: Considering Other Task Types 559
Processing Language in Texts 560
Considering the processing methodologies 560
Defining understanding as tokenization 561
Putting all the documents into a bag 562
Using AI for sentiment analysis 566
Processing Time Series 574
Defining sequences of events 574
Performing a prediction using LSTM 575
Chapter 6: Developing Impressive Charts and Plots 579
Starting a Graph, Chart, or Plot 580
Understanding the differences between graphs, charts, and plots 580
Considering the graph, chart, and plot types 582
Defining the plot 583
Drawing multiple lines 584
Drawing multiple plots 584
Saving your work 586
Setting the Axis, Ticks, and Grids 587
Getting the axis 587
Formatting the ticks 590
Adding grids 590
Defining the Line Appearance 591
Working with line styles 592
Adding markers 593
Using Labels, Annotations, and Legends 594
Adding labels 595
Annotating the chart 596
Creating a legend 598
Creating Scatterplots 599
Depicting groups 599
Showing correlations 600
Plotting Time Series 603
Representing time on axes 604
Plotting trends over time 605
Plotting Geographical Data 608
Getting the toolkit 608
Drawing the map 609
Plotting the data 613
Visualizing Graphs 615
Understanding the adjacency matrix 615
Using NetworkX basics 615
Book 6: Diagnosing and Fixing Errors 619
Chapter 1: Locating Errors in Your Data 621
Considering the Types of Data Errors 622
Obtaining the Required Data 624
Considering the data sources 624
Obtaining reliable data 625
Making human input more reliable 626
Using automated data collection 628
Validating Your Data 629
Figuring out whats in your data 629
Removing duplicates 631
Creating a data map and a data plan 632
Manicuring the Data 634
Dealing with missing data 634
Considering data misalignments 639
Separating out useful data 640
Dealing with Dates in Your Data 640
Formatting date and time values 641
Using the right time transformation 641
Chapter 2: Considering Outrageous Outcomes 643
Deciding What Outrageous Means 644
Considering the Five Mistruths in Data 645
Commission 645
Omission 646
Perspective 646
Bias 647
Frame-of-reference 648
Considering Detection of Outliers 649
Understanding outlier basics 649
Finding more things that can go wrong 651
Understanding anomalies and novel data 651
Examining a Simple Univariate Method 653
Using the pandas package 653
Leveraging the Gaussian distribution 655
Making assumptions and checking out 656
Developing a Multivariate Approach 657
Using principle component analysis 658
Using cluster analysis 659
Automating outliers detection with Isolation Forests 661
Chapter 3: Dealing with Model Overfitting and Underfitting 663
Understanding the Causes 664
Considering the problem 664
Looking at underfitting 665
Looking at overfitting 666
Plotting learning curves for insights 668
Determining the Sources of Overfitting and Underfitting 670
Understanding bias and variance 671
Having insufficient data 671
Being fooled by data leakage 672
Guessing the Right Features 672
Selecting variables like a pro 673
Using nonlinear transformations 676
Regularizing linear models 684
Chapter 4: Obtaining the Correct Output Presentation 689
Considering the Meaning of Correct 690
Determining a Presentation Type 691
Considering the audience 691
Defining a depth of detail 692
Ensuring that the data is consistent with audience needs 693
Understanding timeliness 693
Choosing the Right Graph 694
Telling a story with your graphs 694
Showing parts of a whole with pie charts 694
Creating comparisons with bar charts 695
Showing distributions using histograms 697
Depicting groups using boxplots 699
Defining a data flow using line graphs 700
Seeing data patterns using scatterplots 701
Working with External Data 702
Embedding plots and other images 703
Loading examples from online sites 703
Obtaining online graphics and multimedia 704
Chapter 5: Developing Consistent Strategies 707
Standardizing Data Collection Techniques 707
Using Reliable Sources 709
Verifying Dynamic Data Sources 711
Considering the problem 712
Analyzing streams with the right recipe 714
Looking for New Data Collection Trends 715
Weeding Old Data 716
Considering the Need for Randomness 717
Considering why randomization is needed 718
Understanding how probability works 718
Index 721