AI Geek Programmer

Welcome tomorrow – how AI will shape the world by 2032

AI Geek Programmer — Sat, 17 Dec 2022 15:19:13 +0000

When in 2019 I started this blog, I allowed myself to be a little controversial by writing “AI will change the world more than the industrial revolution.” Of course, prediction is very difficult, especially if it’s about the future (Niels Bohr), but recent developments in machine learning have made me think about how AI will change the world in the next 10 years. And it will change quite a lot and you can take it to the bank. After all, 10 years is a long time, especially if technology is on an exponential growth path.

It is worth noting at this point that there is a potential trap – it is very easy to be overly optimistic in predictions. Just look at movies or articles predicting 30 years ago what the world will look like in 2020s. They were full of predictions about robots, flying vehicles and the colonization of the Moon. None of this came true. It is therefore a good idea to approach my predictions with a grain of salt.

Before we come back to the future, let’s first take a look at some important AI developments in this 2022. Machine learning is already such a wide and dynamically developing field that it is basically impossible to track all valuable projects. When writing about important developments, I meant the following three projects that stood out from the crowd for me: Stable Diffusion, GitHub Copilot and ChatGPT, which was made available only in the last days of November 2022. This post is not about any of these projects. However, I’d like to take a quick look at each of them. If you haven’t seen them in action yet, it’s worth taking a closer look – the future is knocking on the door.

Stable Diffusion allows you to generate high-quality images from text. It was built by several organizations with the leading role of CompVis – Computer Vision & Learning research group at the Ludwig Maximilian University of Munich and Stability AI. Below some impressive samples from https://stability.ai/blog/stable-diffusion-v2-release. Text-to-image models are nothing new, but only Stable Diffusion offered the code and the model parameters in the open source formula. This and similar models have a chance to revolutionize creative work. BTW: there are already models that generate videos.

Example from https://stability.ai/blog/stable-diffusion-v2-release

GitHub Copilot is an AI tool that might be used by software developers. In a nutshell: you start writing code or indicate what you want to create and Copilot prompts you with a proposal for the rest of the code. At this moment, it’s a more interactive, more intelligent and easier to use Stack Overflow, but obviously without people commenting and engaging in discussions. It takes some time to get used to it, but it seems that it has great potential to increase efficiency. Of course, there are some controversies related to copyrights, how much the generated code should be checked and supervised, some people do not feel comfortable constantly inspecting code that is not their own, etc. Of course, there are also concerns about whether such tools will take programmers’ job. I’m “from the industry”, so this is where the topic starts to get a bit awkward ;-). So let’s move on.

An example from the GitHub Copilot website at https://github.com/features/copilot

ChatGPT is so new that it doesn’t even have a Wikipedia page as of this writing. It is worth mentioning that this is a model based on GPT-3.5, but with an interface optimized for conversation. According to the authors: “The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.” ChatGPT went viral pretty fast, especially on Twitter. In a nutshell: it is a conversational model, which means you can talk to it about various topics, you can ask him to write a short essay, enter into a conversation about difficult existential issues, etc. There are plenty of interesting examples of model interactions on Twitter. Some of them are very funny, some are very thought-provoking. A large part of them, including mine below, is not a conversation, but rather a request to generate a text that meets specific, sometimes strange requirements.

You can try the chat by registering at https://openai.com/blog/chatgpt/. There are millions of ways to play with the model. For example, I asked it to write a short story about a gecko who went to Mars, using a private acquaintance with Elon Musk.

Generated on https://chat.openai.com/chat BTW: 96 words, lol

Do you want to write an essay in English? No problem! Write down ideas / instructions / expectations …

Just got back from UK holidays. After returning, you realized that you took, by mistake, some thing belonging to a friend from Denmark, with whom you rented a room.
Write a letter in which:
• Explain what the mistake is and how it happened.
• You will describe the mistakenly taken item, giving at least two of its characteristics.
• You will ask to check whether a colleague has taken a similar item belonging to you and specify the element that makes them different
• Promise to return the wrongly taken item and ask for the same.
Remember to keep the proper form and style of the letter. Do not include any addresses. Sign as XYZ. The length of the letter should be between 120 and 150 words.

… and ask ChatGPT to write a letter according to these guidelines. Here is the result:

Generated on https://chat.openai.com/chat

My son (who attends Polish high school) and I wondered for a moment how much we could charge other students in his high school for writing an English homework, but we quickly came to the conclusion that the news would spread quickly and the business would collapse. Plus that could be received badly by the school officials.

Okay, we know what’s on the table today, so let’s go on a journey through time…

Due to some stupid experiments with space-time that you conduct in the garage on weekend evenings, you woke up unexpectedly in the year 2032. You nervously type google.com on your phone to check what’s going on in the world. “Hmm. We’re having trouble finding that site.” – WHAT?!? Has Google gone out of business?

Just kidding! Everything will be fine with Google. They have great AI teams and they are for sure working on the new search model. One that can be monetized very effectively – like the current model. Theoretically, however, if Google does little about new search paradigm, then if someone in 2032 wants to find out who and why became the president of the United States in 2028, they will basically be able to ask ChatGPT in its version 7, or whatever version will be valid then. He will be able to ask about it, just as we can already ask the current version of ChatGPT about the 2016 elections.

Generated on https://chat.openai.com/chat

But seriously, the way we search for information will be completely different in 2032 and will be mostly AI-based, probably with a conversational interface.

The way we solve complex problems in virtually all fields of science will also change significantly. Currently, a commonly accepted scheme is to create a team of high-class specialists in a given field who build or use the theoretical foundations of a given science to practically create a new artifact of the physical world: a new drug, a new chemical substance, or a new device. At the beginning of the next decade, we should see a shift in this pattern towards first building an AI model that will be able to propose a theoretical or practical solution to a given problem, and only then taking over, checking and implementing the idea in practice by a team of specialists.

The use of AI advisors will be widespread. Our ChatGPT is already able to solve problems of various nature (or at least try to solve – see below). Stable Diffusion can inspire us with an interesting idea for the graphic design. And GitHub Copilot can hint a code. Below, ChatGPT made a fairly obvious and simple mistake. But this is a linguistic-conversational model, not a mathematical one. What’s the most impressive is the iterative approach to problem-solving.

Generated on https://chat.openai.com/chat

In the coming years, specialized models will be created that will be able to support us in solving difficult problems. Those models that will be a bit simpler and cheaper or free (e.g. simple mathematical, language and historical models) will limit the amount of school homework. Since a teenager will be able to solve a set of equations or generate an essay on a mobile phone, there will be no point in continuing assigning homework. This will eventually force a change in the way children and young people are taught. There will be a departure from the model based on memorizing information, often useless at a given moment for the child, towards a model that promotes the ability to find information and use it to solve a given problem. Obviously, there are already countries that have reformed education towards this direction. Hopefully, this will become more common.

The wave of specialized models (so-called Narrow AI) will also affect other industries, often less technical than IT, physics, chemistry or biology. I can imagine an AI robot analyzing the legal status in a given area in 2032 and advising a professional lawyer on the approach to some criminal case. In countries where access to health care is limited, the use of AI medical services will be common. This may seem unacceptable to the people of the Western world today, but people in many regions of our planet simply may not have the physical opportunity to consult a dermatologist or psychologist directly, or simply cannot afford it. We are currently seeing a similar shift from traditional finance to next-generation services in countries that are experiencing financial repression for various reasons. Hyperinflation, oppressive governments, lack of simple access to banking make people naturally turn to Bitcoin payments. Life abhors a vacuum, so if there is no access to a doctor and there is an access to the Internet, you know what will happen.

Generated on https://chat.openai.com/chat

Obviously, the AI revolution is not going to spare the IT industry itself. Are we sawing off the branch we are comfortably sitting on? The general pattern of how we develop software has changed little in the last 20 years. Business requirements still need to be agreed with the sponsor/client, then translated into technical requirements, the architecture needs to designed and we may start coding. Code is the source of truth. Truth about how the system works and truth about how good a programmer is. Can you code well? You will live well! Obviously, programming languages have changed, thousands of advanced tools have been created, agile has dominated the style of work, but the general methodology has remained unchanged. In addition, AD 2022 we have an employee market, the demand for IT services remains consistently high and almost everyone is doing well. What can go wrong?

Currently, copilot-type solutions acts more like a local Stack Overflow. However, whether we like it or not, within a few years, various code assistants will become more and more advanced. Coding will gradually evolve from typing almost every character, interrupted by periodic copy-pasta, towards describing the expected result in comments and analyzing the copilot’s proposed solution. A very good knowledge of language still will be necessary, it still will be necessary to understand and manually program the information flow at a higher level, but the effectiveness of creating simple functionalities, be it business-backend or front-end, will be much higher. Much more advanced will also be the AI-based QA tools and debugging. Let’s remember that in 2032, this type of conversation like below going to be much more effective, faster and precise.

Source: https://openai.com/blog/chatgpt/

“It’s 2022, I’m a programmer, what do I do now? How will I be able to support my family?” I think that fears that AI will take away significant amount of work from us – IT specialists – are a bit exaggerated or premature. Nevertheless, as a programmer, I intend to watch these tools with curiosity and try to include them in my skillset. And as a manager, I’m going to watch these tools with interest and encourage developers to gradually incorporate them into their skillset.

Back to the future. A very large country (let’s call it C) that wanted to regain control of certain small island, but feared the military domination of another very large country (let’s call it A), is in the process of carrying out a major reform of its army. This reform was initiated after detailed notes C took during Russia’s invasion of Ukraine in 2022. This war ended with the Russia’s defeat and the fall of the regime, but it also drew everyone’s attention to the efficiency of small semi-autonomous combat units, which were called drones at that time. Country C, whether we westerners like it or not, spent the entire 2020s building more and more AI supported military systems. Now – in 2032 – C is believed to be militarily capable of gaining control over the small island, even though the island is still supported by A and by other Western countries.

The situation described above took the Western establishment somewhat by surprise. Although country A also has very advanced military solutions, even more advanced than C, due to the distance separating the small island from A, assistance can be very limited. Europe basically doesn’t matter at that point. The social protests that emerged in Europe in the late 2020s due to concerns about AI taking jobs, and due to the huge demand of AI for the energy necessary to train ever larger models, drew the attention of European populist politicians. Those politicians wanted to gain popularity. By invoking the social good and ecology they won the support of a large part of European voters and began to inhibit the development of AI in Europe.

Have you read Dune? It’s one of the best sci-fi novels for me. The brilliant Herbert in 1963 (sic!) described the profession of Mentat as the human equivalent of computer. Humans were trained to be Mentats because after the war between humans and machines, it was forbidden by law to create machines in the image of the human mind. I hope that Frank was just a genius and not some kind of a prophet.

Big changes await in the field of creativity. The use of generative models that will create images, video and music will become commonplace. Personally, I don’t think generative models are a big threat to artists and creators. Rather, they will support their natural predispositions and become an additional very important tool. However, they can lead to less demand for human labor in areas that do not require super high quality. Graphic design for simple websites, blog images instead of photos taken by photographers, graphic elements for Indie games. I see no obstacles for all of them to be generated by AI models, under the supervision of a graphic designer. I would also bet that AI will increase accessibility for creative people who currently do not have sufficient manual or technical skills. Something like the rap music, which opened music for people who can’t necessarily sing well, but can create interesting sample bits and write good lyrics. Generated art will be a recognized and accepted branch of art. And works made traditionally / by hand will be perceived as more valuable.

Example from https://stability.ai/blog/stable-diffusion-v2-release

AI solutions will approach the Holy Grail of artificial intelligence in the next decade – General Artificial Intelligence (AGI). Currently, most models operate in a fairly narrow field: language models, image recognition, generative models, expert models – this is the so-called Narrow AI. The next natural step is to choose the bottom-up strategy, i.e. an attempt to create even more elaborate solutions that will combine the functionality of many narrow models and thus will be able to perform much more complex and multi-domain tasks. But will this allow them to be called AGI?

The industry is still struggling with a strict definition of AGI. There are at least a few tests proposed, like: Turing Test, Coffee Test, Robot College Student Test, Employment Test, but there is no consensus on this matter. With the emergence of more and more advanced solutions, more formal definitions will probably emerge and big players will enter the final phase of the race to be the very first organization to create AGI.

I’m also wondering if the digital kingdom is enough for us to create AGI. Will training not have to go beyond compressing the internet and rather into the physical world? Perhaps models that will be able to smell and recognize tastes and also touch will naturally gain new skills and thus expand their artificial consciousness? Perhaps in the 2030s we will see the first restaurant with meals prepared according to AI recipes. Will it get a Michelin star?

Does it make any sens? It’s time for the short summary…

The pace of change and the complexity of processes in virtually every field is so high that predicting what will happen in three quarters is very difficult. Just look at economics – a science with nearly 250 years of history, counting from Adam Smith – with its current (December 2022) doubts about what awaits us in the global economy in 2023. And what about predicting changes over the decade in such an innovative field as machine learning. Perhaps my predictions will turn out to be totally inaccurate. So does they make any sens? I just think that from time to time it’s good to let your imagination run wild.

Recent developments in the field of artificial intelligence, some of which I have highlighted above (Stable Diffusion, GitHub Copilot, ChatGPT), confirm my belief that the AI revolution has begun for good. Doubts about whether another AI winter awaits us can be considered dispelled for some time. Of course, the industry still faces many challenges. One of the most fundamental is how to monetize AI? How to use models built by engineers in the real economy? This is neither obvious nor simple, and it can also meet considerable social resistance. And yet, the hopes associated with AI in the economic context are huge. The three decades since the 1990s have been disinflationary mainly due to three forces: technological progress, the availability of cheap energy and globalization resulting from the opening economies of China and Eastern Europe. They flooded the world with cheap labor and, in parallel, created strong developing markets for the products and services of the old economies. In the early 2020s, with the saturation of these markets, along with Covid and the war in Ukraine, it seems that at least two of these forces have been temporarily or even irreversibly lost. Will AI take over? Yet another issue for AI is how we train models. Is Gradient Descent the main roadblock to AGI? Perhaps until a better method is found, AI will be in some kind of limbo.

Please, take my predictions with a grain of salt. They were just an interesting intellectual exercise for me. They also were an easy excuse to “get out of the box” and step away from everyday routine. I ask for forgiveness especially those for whom the above predictions have caused some level of discomfort or anxiety. ;-). Finally, I hope that these less positive predictions will not come true, and those more optimistic will surprise us with the scale of the manifestation.

The post Welcome tomorrow – how AI will shape the world by 2032 appeared first on AI Geek Programmer.

PyTorch: dividing dataset, transformations, training on GPU and metric visualization

AI Geek Programmer — Sun, 10 Apr 2022 08:27:51 +0000

In machine learning designing the structure of the model and training the neural network are relatively small elements of a longer chain of activities. We usually start with understanding business requirements, collecting and curating data, dividing it into training, validation and test subsets, and finally serving data to the model. Along the way, there are things like data loading, transformations, training on GPU, as well as metrics collection and visualization to determine the accuracy of our model. In this post, I would like to focus not so much on the model architecture and the learning itself, but on those few “along the way” activities that often require quite a lot of time and effort from us. I’ll be using PyTorch library for coding.

In this post you will learn:

How a dataset can be divided into training, validation and test subsets?
How to transform a dataset (like normalize data) and what if you need to reverse this process?
How to use a GPU in PyTorch?
How to calculate the most popular metric – accuracy – for the training, validation and test loops?

Load and transform

At the beginning, some “formalities”, i.e. necessary imports with short explanations in the comments.

# main libraries
import torch
import torchvision

# All datasets in torchvision.dataset are subclasses
# of torch.utils.data.Dataset, thus we may use MNIST directly in DataLoader
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader

# an optimizer and a loss function
from torch.optim import Adam
from torch.nn import CrossEntropyLoss

# required for creating a model
from torch.nn import Conv2d, BatchNorm2d, MaxPool2d, Linear, Dropout
import torch.nn.functional as F

# tools and helpers
import numpy as np
from timeit import default_timer as timer
import matplotlib.pyplot as plt
from torch.utils.data import random_split
from torchvision import transforms
import matplotlib.pyplot as plt

We use the torchvision library, which offers classes for loading the most popular datasets. This is the easiest way to experiment. The class that loads the CIFAR10 dataset, which we are about to use, takes the torchvision.transforms object as one of the parameters. It allows us to perform a series of transformations on the loaded dataset, such as converting data to tensors, normalizing, adding paddings, cutting out image fragments, rotations, perspective transformations, etc. They are useful both in simple cases and in more complex ones, e.g. when you want to make a data augmentation. Additionally, transformations can be serialized using torchvision.transforms.Compose.

Here we just need to transform the data into a tensor and normalize it, hence:

# Transformations, including normalization based on mean and std
mean = torch.tensor([0.4915, 0.4823, 0.4468])
std = torch.tensor([0.2470, 0.2435, 0.2616])
transform_train = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean, std)
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean, std)
])

Three notes to the above:

mean and std are the calculated mean values and their standard deviation for each image channel,
as you can see, we define the transform separately for the training and the test datasets. The transforms are the same because we do not use data augmentation for the training set. Theoretically, it was possible to use one transform in both cases, but for the sake of clarity, and also in case we want to change it later, we keep both transforms separately,
mean and std values will be useful for later de-normalization for the purposes of displaying sample images. That’s because after the normalization images will no longer be readable to the human eye.

Having transforms, we can load the dataset. The CIFAR10 class is a subclass of torch.utils.data.Dataset – more about what does it mean in this post.

# Download CIFAR10 dataset
dataset = CIFAR10('./', train=True, download=True, transform=transform_train)
test_dataset = CIFAR10('./', train=False, download=True, transform=transform_test)

dataset_length = len(dataset)
print(f'Train and validation dataset size: {dataset_length}')
print(f'Test dataset size: {len(test_dataset)}')
>>>Train and validation dataset size: 50000
>>>Test dataset size: 10000

# The output of torchvision datasets are PILImage images of range [0, 1].
# but they have been normalized and converted to tensors
dataset[0][0][0]
>>> tensor([
>>> [-1.0531, -1.3072, -1.1960, ..., 0.5187, 0.4234, 0.3599],
>>> [-1.7358, -1.9899, -1.7041, ..., -0.0370, -0.1005, -0.0529],
>>> [-1.5930, -1.7358, -1.2119, ..., -0.1164, -0.0847, -0.2593],
>>> ...,
>>> [ 1.3125, 1.2014, 1.1537, ..., 0.5504, -1.1008, -1.1484],
>>> [ 0.8679, 0.7568, 0.9632, ..., 0.9315, -0.4498, -0.6721],
>>> [ 0.8203, 0.6774, 0.8521, ..., 1.4395, 0.4075, -0.0370]])

De-normalize and display

It is a good practice to preview some images from the training dataset. The problem, however, is that they have been normalized, i.e. the values of individual pixels have been changed and converted to a tensor, which in turn changed the order of the image channels. Below is the class that de-normalize and restores original shape of data.

# Helper callable class that will un-normalize image and
# change the order of tensor elements to display image using pyplot.
class Detransform():
  def __init__(self, mean, std):
    self.mean = mean
    self.std = std

  # PIL images loaded into dataset are normalized.
  # In order to display them correctly we need to un-normalize them first
  def un_normalize_image(self, image):
    un_normalize = transforms.Normalize(
        (-self.mean / self.std).tolist(), (1.0 / self.std).tolist()
    )
    return un_normalize(image)

  # If 'ToTensor' transformation was applied then the PIL images have CHW format.
  # To show them using pyplot.imshow(), we need to change it to HWC with
  # permute() function.
  def reshape(self, image):
    return image.permute(1,2,0)

def __call__(self, image):
return self.reshape(self.un_normalize_image(image))

# Create de-transformer to be used while printing images
detransformer = Detransform(mean, std)

We also need a dictionary that would translate class numbers into their names, as defined by the CIFAR10 set, and a function to display a few randomly selected images.

# Translation between class id and name
class_translator = {
    0 : 'airplane',
    1 : 'automobile',
    2 : 'bird',
    3 : 'cat',
    4 : 'deer',
    5 : 'dog',
    6 : 'frog',
    7 : 'horse',
    8 : 'ship',
    9 : 'truck',
}

# Helper function printing 9 randomly selected pictures from the dataset
def print_images():
  fig = plt.figure()
  fig.set_size_inches(fig.get_size_inches() * 2)
  for i in range(9):
    idx = torch.randint(0, 50000, (1,)).item()
    picture = detransformer(dataset[idx][0])
    ax = plt.subplot(3, 3, i + 1)
    ax.set_title(class_translator[dataset[idx][1]] + ' - #' + str(idx))
    ax.axis('off')
    plt.imshow(picture)
  plt.show()

Well, let’s take a look at a few elements of this dataset …

print(f'The first element of the dataset is a {class_translator[dataset[0][1]]}.')
>>>The first element of the dataset is a frog.

image = detransformer(dataset[0][0])
plt.imshow(image)

This is a frog, right? RIGHT?!?

print_images()

A few randomly selected images from the CIFAR10 dataset

Divide into training, testing and validation subsets

Note that the CIFAR10 dataset constructor allows us to retrieve either a training or a test subset. But what if we want to separate a validation subset that will allow us to determine the accuracy during training? We have to separate it out from the training dataset ourselves. The random_split function from the torch.utils.data package will be helpful here.

validation_length = 5000
# Split training dataset between actual train and validation datasets
train_dataset, validation_dataset = random_split(dataset, [(dataset_length - validation_length), validation_length])

Having three objects od the Dataset class: train_dataset, validation_dataset and test_dataset, we can define DataLoaders that will enable serving data in batches.

# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
validation_dataloader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Print some statistics
print(f'Batch size: {batch_size} data points')
print(f'Train dataset (# of batches): {len(train_dataloader)}')
print(f'Validation dataset (# of batches): {len(validation_dataloader)}')
print(f'Test dataset (# of batches): {len(test_dataloader)}')
>>> Batch size: 256 data points
>>> Train dataset (# of batches): 176
>>> Validation dataset (# of batches): 20
>>> Test dataset (# of batches): 40

Build a model

In order not to focus too much on the network architecture – as that is not the purpose of this post – we will use the network designed in this post on convolutional neural networks. It is worth noting, however, that one of the issues in network design is data dimensioning. To check the size of the input vector that will be served by a DataLoader, the following code can be run:

# Before CNN definition, let's check the sizing of input tensor
data, label = next(iter(train_dataloader))
print(data.size())
print(label.size())
>>> torch.Size([256, 3, 32, 32])
>>> torch.Size([256])

So here we have a batch of size 256, then three RBG channels of the image, each with the size of 32 by 32.

This script can help in dimensioning a convolutional network. Of course, it is necessary to adapt it to the needs of a given model, but it’s a good start.

Eventually, our architecture will look like this:

class CifarNN(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.conv1 = Conv2d(3, 128, kernel_size=(5,5), stride=1, padding='same')  # [B, 128, 32, 32]
    self.bnorm1 = BatchNorm2d(128)
    self.conv2 = Conv2d(128, 128, kernel_size=(5,5), stride=1, padding='same')  # [B, 128, 32, 32]
    self.bnorm2 = BatchNorm2d(128)
    self.pool1 = MaxPool2d((2,2))  # [B, 128, 16, 16]
    self.conv3 = Conv2d(128, 64, kernel_size=(5,5), stride=1, padding='same')  # [B, 64, 16, 16]
    self.bnorm3 = BatchNorm2d(64)
    self.conv4 = Conv2d(64, 64, kernel_size=(5,5), stride=1, padding='same')  # [B, 64, 16, 16]
    self.bnorm4 = BatchNorm2d(64)
    self.pool2 = MaxPool2d((2,2))  # [B, 64, 8, 8]
    self.conv5 = Conv2d(64, 32, kernel_size=(5,5), stride=1, padding='same')  # [B, 32, 8, 8]
    self.bnorm5 = BatchNorm2d(32)
    self.conv6 = Conv2d(32, 32, kernel_size=(5,5), stride=1, padding='same')  # [B, 32, 8, 8]
    self.bnorm6 = BatchNorm2d(32)
    self.pool3 = MaxPool2d((2,2))  # [B, 32, 4, 4]
    self.conv7 = Conv2d(32, 16, kernel_size=(3,3), stride=1, padding='same')  # [B, 16, 4, 4]
    self.bnorm7 = BatchNorm2d(16)
    self.conv8 = Conv2d(16, 16, kernel_size=(3,3), stride=1, padding='same')  # [B, 16, 4, 4]
    self.bnorm8 = BatchNorm2d(16)
    self.linear1 = Linear(16*4*4, 32)
    self.drop1 = Dropout(0.15)
    self.linear2 = Linear(32, 16)
    self.drop2 = Dropout(0.05)
    self.linear3 = Linear(16, 10)
  def forward(self, x):
    # the first conv group
    x = self.bnorm1(self.conv1(x))
    x = self.bnorm2(self.conv2(x))
    x = self.pool1(x)
    # the second conv group
    x = self.bnorm3(self.conv3(x))
    x = self.bnorm4(self.conv4(x))
    x = self.pool2(x)
    # the third conv group
    x = self.bnorm5(self.conv5(x))
    x = self.bnorm6(self.conv6(x))
    x = self.pool3(x)
    # the fourth conv group (no maxpooling at the end)
    x = self.bnorm7(self.conv7(x))
    x = self.bnorm8(self.conv8(x))
    # flatten
    x = x.reshape( -1, 16*4*4)
    # the first linear layer with ReLU
    x = self.linear1(x)
    x = F.relu(x)
    # the first dropout
    x = self.drop1(x)
    # the second linear layer with ReLU
    x = self.linear2(x)
    x = F.relu(x)
    # the second dropout
    x = self.drop2(x)
    # the output layer logits (10 neurons)
    x = self.linear3(x)

    return x

Move to a GPU and calculate accuracy

Training a convnet on a CPU doesn’t make much sense. A simple test I did on the Google Colab showed that it takes around 2600 seconds to complete one epoch on a CPU, while on a GPU it took 66 seconds to do the same work. These times obviously depend on many factors beyond our control, such as what machines the Google Colab engine will direct us to, but the conclusions will always be similar – learning on a GPU can be much faster.

So what’s the easiest way to switch from CPU to GPU in PyTorch? Of course, in Google Colab we have to go through Runtime-> Change runtime type and change it to GPU. But the major changes need to be made in the code. Fortunately, there aren’t a lot of them, and they’re pretty straightforward.

The main issue is to establish what kind of environment we are dealing with. The code snippet below is a common good practice.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
>>> cuda

In the next step, we create a model and move it to the currently available device.

model = CifarNN()
model = model.to(device)

Before we start the training we need to define few parameters: number of epochs, learning_rate and optimizer, as well as what method will be used to calculate error. Here, there are also two lists, in which we are going to record accuracy for each epoch. This will allow us to draw a nice graph afterwards.

epochs = 40
learning_rate = 0.001
train_accuracies = [] # cumulated accuracies from training dataset for each epoch
val_accuracies = [] # cumulated accuracies from validation dataset for each epoch
optimizer = Adam( model.parameters(), lr=learning_rate)
criterion = CrossEntropyLoss()

The next code snippet is the main training loop. There are few things happening here that may be of interest to us in the context of this post, so I assigned indexes to some lines and commented them below:

(1) – we are going to measure time elapsed.
(2) – it is a good practice to signal to the PyTorch engine when a training takes place, and when we only evaluate on validation or test datasets. This significantly improves the performance of evaluation parts.
(3) – we move the data (batch) to a GPU.
(4) – values we get after passing the input through the network (here: yhat) are the so-called logits, i.e. values theoretically ranging from plus to minus infinity. Target on the other hand (here: y) contains the numbers 0 through 9 indicating a correct class. The function that calculates error (here: CrossEntropyLoss) internally handles the corresponding comparison of those values. However, when calculating accuracy – in point (5) – we must first calculate the network answer ourselves. We use the argmax function for this. It returns the index where the strongest (highest in value) network response occurs. This index will also be the number of the class to which the network assigned the input value. This way we get prediction – a vector containing the class assignment for each element of the currently processed batch.
(5) – the most convenient way to calculate accuracy based on data in two vectors: y and prediction is to use numpy – excellent for vectorized operations. In order for the data on a GPU to end up in the list processed in a CPU environment, we need to use .detach().cpu().numpy() command.
(6) – for each training epoch we process the validation subset and calculate its accuracy to compare it with the accuracy calculated for the training subset. This way we’ll see whether the training process is overfitting or not.

start = timer()    # (1)
for epoch in range(epochs):
   model.train() # (2)
   train_accuracy = []
   for x, y in train_dataloader:
      x = x.to(device) # (3)
      y = y.to(device) # (3)
      optimizer.zero_grad()
      yhat = model.forward(x)
      loss = criterion(yhat, y)
      loss.backward()
      optimizer.step()
      prediction = torch.argmax(yhat, dim=1) # (4)
      train_accuracy.extend((y == prediction).detach().cpu().numpy()) # (5)
   train_accuracies.append(np.mean(train_accuracy)*100)

   # for every epoch we do a validation step to asses accuracy and overfitting
   model.eval() # (2)
   with torch.no_grad(): # (2)
      val_accuracy = []  # accuracies for each batch of validation dataset
      for vx, vy in validation_dataloader: (6)
         vx = vx.to(device) # (3)
         vy = vy.to(device) # (3)
         yhat = model.forward(vx)
         prediction = torch.argmax(yhat, dim=1) (4)
# to numpy in order to use next the vectorized np.mean
val_accuracy.extend((vy == prediction).detach().cpu().numpy()) (5)
val_accuracies.append(np.mean(val_accuracy)*100)
   # simple logging during training
   print(f'Epoch #{epoch+1}. Train accuracy: {np.mean(train_accuracy)*100:.2f}. \
Validation accuracy: {np.mean(val_accuracy)*100:.2f}')
end = timer() # (1)

As a result of training on 40 epochs, we get the following metrics:

>>> Epoch #1. Train accuracy: 34.20. Validation accuracy: 47.32
>>> Epoch #2. Train accuracy: 51.58. Validation accuracy: 57.00
>>> Epoch #3. Train accuracy: 58.11. Validation accuracy: 61.56
>>> Epoch #4. Train accuracy: 62.18. Validation accuracy: 64.16
................................................................
>>> Epoch #38. Train accuracy: 90.86. Validation accuracy: 73.86
>>> Epoch #39. Train accuracy: 91.42. Validation accuracy: 73.30
>>> Epoch #40. Train accuracy: 91.68. Validation accuracy: 73.40

As you can see, the difference between the training (91%) and the validation accuracy (73%) is considerable. The model fell into overfitting, which is visible on the graph below.

print(f'Processing time on a GPU: {end-start:.2f}s.')
>>> Processing time on a GPU: 3113.36s.
plt.plot(train_accuracies, label="Train accuracy")
plt.plot(val_accuracies, label="Validation accuracy")
leg = plt.legend(loc='lower right')
plt.show()

The problem of overfitting is significant and there are several methods that can be used to reduce it. Some of them have been already applied to the above model (like the Dropout layer). More about preventing overfitting in this post.

At the end of the training process, we check how the model is doing on the test dataset, i.e. on the data that the model has never seen before.

# calculate accuracy on the test dataset that the model has never seen before
model.eval()
with torch.no_grad():
  test_accuracies = []
  for x, y in test_dataloader:
    x = x.to(device)
    y = y.to(device)
    yhat = model.forward(x)
    prediction = torch.argmax(yhat, dim=1)
    test_accuracies.extend((prediction == y).detach().cpu().numpy())  # we store accuracy using numpy
  test_accuracy = np.mean(test_accuracies)*100 # to easily compute mean on boolean values
print(f'Accuracy on the test set: {test_accuracy:.2f}%')
>>>Accuracy on the test set: 72.78%

Quick summary

This post was all about tools and techniques. We focused on few areas that are sometimes technically more difficult than the process of building the model architecture itself. We saw how to load data and divide it into three subsets: training, validation and test. We took a quick look at the transformations that can be applied using the transforms class and how one can display images from the dataset by inverting transformations. After a short stop at network dimensioning, our attention shifted to training in the GPU environment and calculating accuracy. The final accuracy of 72% on the test dataset obviously is not a premium result for the CIFAR10, but that was not the goal of the post. For those of you interested in increasing the accuracy and fighting overfitting, I recommend my post on data augmentation. BTW: the script used in this post is available in my github repo.

The post PyTorch: dividing dataset, transformations, training on GPU and metric visualization appeared first on AI Geek Programmer.

Data preparation with Dataset and DataLoader in Pytorch

AI Geek Programmer — Fri, 03 Sep 2021 11:02:51 +0000

Preparing your data for machine learning is not a task that most AI professionals miss. Data are of different quality, most often they require very thorough analysis, sometimes manual review, and certainly selection and initial preprocessing. In the case of classification tasks, the division of a dataset into classes may be inappropriate or insufficiently balanced. Often, a dataset is also simply too small and have to be artificially augmented. To sum up: it is not easy.

Nevertheless, it is a necessary step, and often more important one than the subsequent tuning of the training algorithm. Part of the data manipulation stage is serving them from a previously prepared dataset, most often in batches, to the training algorithm. In this post I would like to take a closer look at some methods of data manipulation that are offered by the PyTorch library.

In this post you will learn:

How to prepare the conda environment to work with PyTorch?
What are the Dataset and DataLoader classes for?
How to use them to work with one of the predefined datasets provided by the PyTorch library?
How to use Dataset and DataLoader to import your own dataset?
How to deal with a dataset consisting of multiple files, such as image files?

But first, let’s prepare an environment…

Below I enclose a short instruction on how to prepare the environment for PyTorch. This makes it easier to recreate the code presented below. Of course, if you already have the environment ready, you can skip this part of the post.

I use conda and in my local environment I do not have a CUDA compatible card, so I need a build for the CPU. At the beginning, it is always worth checking the current version of conda:

> conda -V
> conda update -n base -c defaults conda

Then we create the environment for our work, activate it and install the necessary packages. Note the cpuonly – if you have CUDA, the installation should not include this parameter. The -c pytorch parameter may also be important, indicating the dedicated channel that will contain the appropriate versions of PyTorch and torchvision.

> conda create --name pytorch_env
> conda activate my_env
> conda install jupyter pytorch torchvision numpy matplotlib cpuonly -c pytorch

Finally, we go to the working directory where we intend to save the scripts, run Jupyter Notebook and perform a simple test.

> cd <>
> jupyter notebook

import torch
x = torch.rand(5, 3)
print(x)
>>> tensor([[0.3425, 0.0880, 0.5301],
            [0.5414, 0.2990, 0.5740],
            [0.3530, 0.0147, 0.5289],
            [0.2170, 0.3744, 0.7805],
            [0.6985, 0.5344, 0.7144]])

If you encounter any problems while doing the above, do not have Anaconda installed, or want to use CUDA, please refer to this manual.

Data preparation – the simplest scenario.

PyTorch offers two classes for data processing: torch.utils.data.Dataset and torch.utils.data.DataLoader. To simplify somewhat, Dataset‘s task is to retrieve a single data point together with its label from a dataset, while DataLoader wraps the data retrieved by Dataset with an iterator, ensures that it is served in batches, runs in multiple threads to speed up the retrieval of data for training if necessary, and supports such operations as data shuffling.

PyTorch also provides many sample datasets you can easily use in your learning time. So let’s start with such a scenario and prepare the data for training for the already known MNIST dataset. Below, we import the torch library, the Dataset class and the torchvision.datasets package containing many sample datasets from the computer vision space. Each dataset in the torchvision.datasets is a subclass of Dataset, which means that the __getitem__ and __len__ methods are implemented for us, more on that later.

import torch
from torch.utils.data import Dataset
from torchvision import datasets

When we import data from any dataset, we most often need to transform it in some way (e.g. normalize). The torchvision package, as well as other packages with sample datasets available in PyTorch, have defined transforms that are available in the transforms package. In our example, we will use one of them that converts the data taken from the dataset to the PyTorch tensor.

from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt

The dataset download is very simple: we create a class object of a given dataset (in our example MNIST) by passing a few parameters. Here: the local directory to which the data will be downloaded, indication whether we download the test or the training subset, transforms we want to apply – and we can provide several of them – and the flag telling if we want to download dataset to a disk, so that you do not have to download it every time you execute this instruction.

training_dataset = datasets.MNIST(root='mnistdata', train=True, transform=ToTensor(), download=True)

Further use of the dataset boils down to calling an object that will return a pair (tuple) – data and label:

image, label = training_dataset[100]
print(type(image))
print(image.size())
print(type(label))

>>>
>>> torch.Size([1, 28, 28])
>>>

We can display the retrieved image:

plt.imshow(image.squeeze())
plt.title(label)
plt.show()

Enters DataLoader. We use the created Dataset class object to wrap it up with additional functions useful for machine learning:

from torch.utils.data import DataLoader
dataloader = DataLoader(
dataset=training_dataset,
batch_size=5
)

After the DataLoader object was created, we can freely iterate, and each iteration will provide us with the appropriate amount of data – in our case, a batch of 5:

images, labels = next(iter(dataloader))
print(type(images), type(labels))
print(images.size(), labels.size())

>>>
>>> torch.Size([5, 1, 28, 28]) torch.Size([5])

Let’s assume that we want to display the content of the second image in the batch:

idx = 2
label = labels[idx].item()
image = images[idx]
plt.imshow(image.squeeze())
plt.title(label)
plt.show()

Now, we may use such data structures in the training process. We will discuss how to do this in another post, and now let’s see how we can use Datasets and DataLoaders in more practical scenarios.

How to create your own Dataset?

More practical would be for example to use your own dataset, not a sample dataset embedded in the PyTorch package. For simplicity, let’s assume that our dataset will be 500 readings of 10 integers, along with their classification into 10 classes, marked with numbers from 0 to 9.

The first step in the process of preparing your own dataset is to define our own class, which inherits from the “abstract” Dataset class. The implementation is simple because such a class requires only two methods to be overwritten: __getitem__ and __len__. Plus of course you should provide the code for the method that initializes the object (__init__). Since our dataset will be randomly generated, the constructor will accept 4 parameters: the beginning and the end of the integer interval for the number generator and the size of the dataset, here 500 rows with 10 values each. We also initialize labels in the constructor – randomly as well:

import torch
from torch.utils.data import Dataset, DataLoader

class RandomIntDataset(Dataset):
def __init__(self, start, stop, x, y):
# we randomly generate an array of ints that will act as data
self.data = torch.randint(start, stop, (x,y))
# we randomly generate a vector of ints that act as labels
self.labels = torch.randint(0, 10, (x,))

def __len__(self):
# the size of the set is equal to the length of the vector
return len(self.labels)

def __str__(self):
# we combine both data structures to present them in the form of a single table
return str(torch.cat((self.data, self.labels.unsqueeze(1)), 1))

def __getitem__(self, i):
# the method returns a pair: given - label for the index number i
return self.data[i], self.labels[i]

In the next step, we create an object of the RandomIntDataset class by providing the appropriate parameters and we check the size of the generated dateset:

dataset = RandomIntDataset(100, 1000, 500, 10)
len(dataset)
>>> 500

Let’s see what our newly created dataset looks like – the last column shows the class of a single data sample:

print(dataset)
>>> tensor([[627, 160, 881, ..., 485, 457, 9],
[705, 511, 947, ..., 744, 465, 5],
[692, 427, 701, ..., 639, 378, 9],
...,
[601, 228, 749, ..., 155, 823, 4],
[599, 627, 802, ..., 179, 693, 4],
[740, 861, 697, ..., 286, 160, 4]])

After we have created the object, we may use it by surrounding it, as in the previous example, with a DataLoader, and then iterate over the batches of data – in our case, 4-element ones.

dataset_loader = DataLoader(dataset, batch_size=4, shuffle=True)
data, labels = next(iter(dataset_loader))
data
>>> tensor([[724, 232, 501, 555, 369, 142, 504, 226, 849, 924],
            [170, 510, 711, 502, 641, 458, 378, 927, 324, 701],
      [838, 482, 299, 379, 181, 394, 473, 739, 888, 265],
      [945, 421, 983, 531, 237, 106, 261, 399, 161, 459]])
labels
>>> tensor([3, 6, 9, 7])

Retrieving data from files

In computer vision tasks, we often deal with data that are provided as files. Inheriting from the Dataset abstract class and overwriting its methods will allow us to process such files exactly the same way:

create a class inheriting from Dataset,
define __init__, __getitem__ and __len__ methods, plus any other helper methods, if necessary,
create an object of this class and pass it to the DataLoader.

Now let’s look at how you can implement data retrieval for the Facial Key Point Detection Dataset. After downloading and unpacking the file, we will get the images directory containing 5000 files, cut to the same size, and a json file containing the coordinates of 68 key face points for each of the files. These key points usually identify the eyes, lip line, eyebrows, and the oval of a face.

The dataset was prepared by Prashant Arora as a subset of the original, much larger Flickr-Faces-HQ dataset, created by the NVIDIA team and made available under the Creative Commons BY-NC-SA 4.0 license.

We import the necessary libraries and create the class inheriting from Dataset, in which we implement the three required methods. The __init__ method sets a variable pointing to the name of the data directory. It should be in the directory from which we run this script. We should unpack the downloaded file to this data directory. The __len__ method returns the size of the variable with the coordinates of key points, which happens to be also the size of the entire dataset. The __getitem__ method first gets the name of the file with index i from the variable with coordinates, and then loads the image from the appropriate file located in the images directory.

import torch
from torch.utils.data import Dataset, DataLoader
import json # we need to import json file with key points coordinates
import numpy as np
import matplotlib.image as img
import matplotlib.pyplot as plt

class FacialDetection(Dataset):
def __init__(self, dataset_directory="FacialKeyPoint"):
# set root directory for your dataset
self.dataset_directory = dataset_directory

# read json file with annotations
annotations_file = open(self.dataset_directory + "\\all_data.json")
self.annotations = json.load(annotations_file)

def __len__(self):
return len(self.annotations)

def __getitem__(self, i):
image_filename = self.annotations[str(i)]['file_name']
image_path = self.dataset_directory + "\\images\\" + image_filename
image = img.imread(image_path)

points = self.annotations[str(i)]['face_landmarks']

return image, np.array(points)

We can now create an object of our new class and check if we really have a set of 5000 elements:

dataset = FacialDetection()
len(dataset)
>>> 5000

Let’s retrieve one of the images, with the index of 888 and display the content:

image, key_points = dataset.__getitem__(888)
plt.imshow(image)
plt.show()

And the same picture with key points applied:

plt.imshow(image)
plt.scatter(key_points[:, 0], key_points[:, 1], marker='o', c='y', s=5)
plt.show()

If you want to serve this file for machine learning, just wrap it with the DataLoader class and iterate over the returned object.

dataset_loader = DataLoader(dataset, batch_size=4, shuffle=True)
data, labels = next(iter(dataset_loader))
data.size()
>>> torch.Size([4, 512, 512, 3])

labels.size()
>>> torch.Size([4, 68, 2])

Time for a short summary

The Dataset and DataLoader classes offer a simple and, what’s very important, standardized way of accessing data and its further processing in machine learning. Apart from the very fact of standardization (which greatly simplifies programming in many applications), these classes are also used for easy access to datasets available in the PyTorch library. Importantly, torchvision, torchtext and torchaudio allow you to use predefined transforms (here an example for the torchvision) and use them in the DataLoader. You can also use these transforms in your own class or write your transforms – I did not mention this topic in the post, but it is worth mentioning as an additional advantage. Yet another benefit of using Dataset and DataLoader is a possibility to parameterize parallel processing on many CPUs, or on the GPU as well as to optimize data transfer between the CPU and GPU, which is critical when processing very large amounts of data.

The post Data preparation with Dataset and DataLoader in Pytorch appeared first on AI Geek Programmer.

YOLO fast object detection and classification

AI Geek Programmer — Fri, 04 Jun 2021 20:20:24 +0000

Computer Vision is one of the most interesting and my favorite application area for artificial intelligence. A big challenge for image analysis algorithms is fast detection and classification of objects in real time. The problem of detecting objects is much more difficult than the classification that I have discussed many times on my blog. That’s because not only do we have to indicate what the object is, but also where it is located. You can easily list dozens of applications for the object detection algorithms, but in general it can be assumed that a machine (e.g. autonomous car, industrial robot, detection / evaluation system) should be able to identify visible objects in real time (people, signage, industrial facilities, other machines, etc.) in order to adapt its subsequent behavior or generated signals to the situation in the environment. This is where You Only Look Once (YOLO) comes in.

YOLO was proposed by Joseph Redmon et al., and its most recent, as of the day of writing this post, version 3 is described in YOLOv3: An Incremental Improvement. I also recommend the following video of Redmon’s TEDx speech.

The three most important features of the YOLO algorithm that distinguish it from the competition are:

Using a grid instead of a single window moving across the image – as in the case of Fast(er) R-CNN. Thanks to this approach, the neural network can see the entire picture at once, not just a small part of it. Consequently, it can not only analyze the entire image faster, but also draw conclusions from the entire informational content of the image, and not only from its fragment, which doesn’t always carry contextual information. By dint of the latter feature, YOLO generates much fewer mistakes of taking a background for an object – one of the main problems of the competing Fast(er) R-CNN algorithm.
Reducing the complex problem of classification and localization of an object to one regression problem, when the output vector contains both the class probabilities and the coordinates of the area containing the object (the so-called bounding box).
Very effective generalization of knowledge. As a curiosity confirming this feature, the authors show that YOLO trained on pictures showing nature is perfectly capable of detecting objects in works of art.

As a result, we get a statistical model that is not only able to process over 45 frames per second, but also gives a similar (though slightly lower) detection efficiency to definitely slower solutions.

Source: YOLOv3: An Incremental Improvement. Joseph Redmon, Ali Farhadi, University of Washington

YOLO fast object detection and classification – how does it work?

Traditional methods of detecting objects most often divide the entire process into several stages. For example, Faster R-CNN first uses a convolutional neural network to extract the desired features of the image (the so-called feature extraction). Then the output in the form of the feature map is an input to another neural network, the task of which is to suggest image regions where objects may be located. Such a network is called Region Proposal Network (RPN) and it is both a classifier (indicating the probability that a given region contains an object) and a regression model (describing a region of an image containing a potential object). The output of the RPN is passed to the third neural network, whose task is to predict classes of objects and bounding boxes. As you can see, it is quite a complicated, multi-stage process, that has to take quite a long time, at least compared to YOLO.

YOLO takes a completely different approach. First of all, it treats the detection and classification problems as a single regression problem. It does not divide the analysis into stages. Instead, a single convolutional neural network simultaneously predicts multiple bounding boxes and determines the class probabilities for each of the areas in which the object has been detected.

In the first step, YOLO puts a grid with the size of S x S on the image. For example, for S = 4, we get 16 cells, as in the image below.

The YOLO grid for S=4

For each cell YOLO predicts B objects. B is usually a small number, like 2. This means that we assume that in each cell YOLO will identify at most 2 objects. Of course, there may be more objects overlapping in the image. But when there are 3 or more overlapping objects in a given cell, they become very difficult to identify – especially when we consider the fact that S is usually greater than the one we used in our example (4), and therefore the applied grid has a higher granulation.

Thus, for each cell of the grid, YOLO predicts whether there are objects in it and for each of them the coordinates of the rectangle surrounding the object are determined (bounding box). This means that in the output vector we need to predict a place for B * 5 values. Why 5? This is because a bounding box is defined with 4 values: two coordinates of the center of the object (relative to the coordinates of the analyzed cell) and the width and height of the rectangle (as a fraction of the image size). In addition, there is a fifth value that logically determines whether or not there is an object in a given cell.

Finally, the class probabilities should be added to the output vector – for each object separately. For example, if we are going to predict C = 10 classes for a maximum of B = 2 objects in one grid cell, then the final output tensor will be: S x S x (B * (5 + C)). In our example: 4 x 4 x (2 * (5 + 10)) = 4 x 4 x 30.

So, we train the network and make predictions assuming a tensor on the output – hence it is easy to understand why the problem of object identification and classification was reduced by YOLO to the regression problem.

Two important points that may have caught your attention:

If the algorithm identifies an object in a grid cell, is the bounding box somehow related to the grid cell? Yes and no. Yes, because this cell includes the center of the bounding box. No, because of course the actual surrounding rectangle will hardly ever coincide with the boundaries of the grid cell.
Because the grid has the same number of rows and columns, the analyzed images must be square and have the size appropriate to the given YOLO implementation. If images are rectangular or do not correspond to the size expected by the network, many YOLO implementations resize the input images and transform them to aspect ratio = 1 (e.g. replenishing images with black padding).

YOLO – the architecture

In the third version of the algorithm, the authors used the pretty extensive convolutional neural network with the architecture presented below. BTW: if you want to explore convolutional neural networks from scratch, there is a four-part tutorial on my blog.

The network has 53 convolutional layers, hence it was called Darknet-53.

Darnknet-53: convolutional neural network – the architecture

YOLO in practice

Let’s see how YOLO behaves in practice. Installation instructions are available on the author’s website, but I was not able to successfully complete it on Windows 10. So I recommend switching to Linux or Mac right away. The following tips are for Ubuntu version 20.04.

To build the classifier you will need to be able to compile the code with gcc. If the compiler is not installed on the operating system (gcc – -version) then I suggest executing those three commands:

$ sudo apt update
$ sudo apt install build-essential
$ gcc --version

If the compiler version is displayed correctly, we can proceed with downloading and compiling Darknet:

$ git clone https://github.com/pjreddie/darknet
$ cd darknet
$ make

In the repository you will find the code, obviously, but the weights for the trained network have to be downloaded separately. The following command should be executed from the darknet directory and the file with network weights should also be saved there.

$ wget https://pjreddie.com/media/files/yolov3.weights

Depending on how much RAM we have (my Ubuntu only had 512MB RAM), the swapfile would have to be increased. I increased it to 2GB, but probably lower values will also be enough:

$ sudo fallocate -l 2G /swapfile
$ sudo chmod 600 /swapfile
$ sudo mkswap /swapfile
$ sudo swapon /swapfile

At this point, we should be able to run a prediction for one of the sample photos in the data directory. Below is an example for the kite.jpg file. Note: it may be necessary to run prediction as a super user – hence sudo has been added:

$ sudo ./darknet detect cfg/yolov3.cfg yolov3.weights data/kite.jpg

The result is saved on the server in the predictions.jpg file. Here is the result for kites – input and output pictures combined:

YOLO in practice – example no 1

And below my photo with horses on the paddock:

$ sudo ./darknet detect cfg/yolov3.cfg yolov3.weights data/horses-square.jpg

YOLO in practice – example no 2

As you can see, the horses were mostly located and correctly classified. Two of the three horses in the far background were not found, but it must be admitted that it would not be easy for the human eye either. A humorous accent is the qualification of the white gas tank as a horse, but it is also difficult to disagree that the tank in this perspective looks a bit like a horse’s rump from a distance.

This mistake with the gas tank is paradoxically a confirmation of one of the strong features of YOLO – the algorithm looks at the image as a whole and infer contextually from the entire content of the image, not from a narrow fragment. YOLO has some problems with detecting small objects and will do worse with scenes with many overlapping objects, but overall it’s a brilliant architecture, written in C using CUDA, which makes it a very fast and effective tool for detecting and classifying objects in real time.

The post YOLO fast object detection and classification appeared first on AI Geek Programmer.

Artificial intelligence and blockchain

AI Geek Programmer — Sat, 14 Nov 2020 16:07:42 +0000

Looking at the advances in technology over the past few years, it’s hard to name two more breakthrough technologies than artificial intelligence and blockchain. The former opened up entirely new possibilities in the fields of data analysis, predictions and robotics. The later one elevated decentralization, transparency and security as a result of the immutability built into the blockchain. Both technologies are still in a fairly early stage of adaptation, but the decade that begins may be a period of full bloom for them. The new economy based on data, artificial intelligence and decentralized systems is a promise both tempting and giving rise to certain concerns. Can artificial intelligence and blockchain cross their paths permanently? Can their cooperation bring a synergy effect, or will their further development be more silo, with only few points of contact? What are the fields for potential collaboration and what are the benefits of using both technologies simultaneously?

Artificial intelligence – technology of the (near) future

Paradoxically, AI is a pretty old science – its origins date back to the 1950s. The field has gone through many difficult times, mainly for two reasons: too little data and insufficient computing power. In the meantime, it managed to develop many methods that relied heavily on statistics and mathematical models – everyone heard about machine learning, right? However, more advanced AI has flourished in recent years. First, huge amounts of data have emerged, mainly collected by corporations and state organizations in the US and China. Second, the availability of high computing power is at its highest ever. We can even get it for free, for non-commercial and scientific use, or buy it in any quantity in the cloud.

Finally, the idea of Deep Learning emerged. Using neural networks powered by huge amounts of data,

Figure 1 – AI vs. machine learning vs. neural networks vs. Deep Learning

processed in highly efficient environments, Deep Learning allowed to obtain amazing results in many fields where human dominance had been unquestionable so far: chess, GO, computer games, image recognition, speech processing, data analytics, medical diagnosis, production of new drugs – the list could be much longer. Interestingly, deep neural networks partially resolved one of the fundamental problems of machine learning, which was the laborious process of human describing data before presenting it to an algorithm. Neural networks do it largely by themselves during the learning process (so-called feature extraction).

Artificial intelligence is currently in a phase of narrow practical applications – the so-called narrow AI. This means that the practical applications of artificial intelligence are perfect for narrow areas, some of which I mentioned above. However, let’s face it, in the privacy of well-guarded corporate and government laboratories, work is underway to create a general AI that will have the autonomy and ability to generalize problems and abstract them to higher levels, similar to that of humans. This will completely change the balance of power and the nature of the global economic order to a much greater extent than covid-19 did, although the pace of implementing this change will be much slower and more controllable. This transformation is probably inevitable and is of great concern to most of us. After all, as a species, we have been used to overwhelming intellectual domination over the rest of the species on earth for millennia. The prospect of a creature that can surpass us intellectually is deeply disturbing.

If you want to learn more about AI and some basic concepts related to it, please read one of my previous posts.

Blockchain? Wait, what’s that?

When someone starts talking about blockchain, one of the first associations is likely to be cryptocurrencies, especially bitcoin. This is understandable, because the blockchain was first used in practice on a massive scale by Satoshi Nakamoto (according to many, this pseudonym was used by the now-deceased Hal Finney – but the question of Nakamoto’s identity, as well as the question of whether anyone has access to his wallet’s private key is a topic for a completely separate post). Satoshi created the first independent, decentralized and fully electronic currency. Blockchain technology is easy to explain using the example of bitcoin, but if someone thinks that blockchain equals cryptocurrencies, you have to tell them: not so fast!

The foundations of blockchain technology are cryptography, peer-to-peer network and the so-called Ditributed Ledger Technology (DLT), i.e. a distributed database, managed by many participants, without one distinguished coordinator. But let’s start from the beginning: where did this name even come from? Blockchain is one way to store information. As in traditional databases, this information is grouped into blocks – hence the phrase “block”. These blocks are, in turn, bound together in a serial chain by the cryptographic hash function – hence the phrase “chain”. To put it simply: mining a block is calculating the hash for its content (bitcoin uses SHA-256) and writing the hash value to the next block, where this value will be one of the components to calculate the hash for this new block. And so the chain goes forward. Currently (October 2020), the bitcoin blockchain stores approximately 580 million transactions, and the total blockchain size for bitcoin is approximately 300 GB.

Figure 2 – blockchain simplified diagram

Bitcoin has been operating since 2009 and during this time no one has successfully carried out any attack that would change the data of even a single transaction. Why? Because the hash for each block should meet certain specific criterias, e.g. contain N zeros at the beginning. To get the correct hash, miners change the content of the block by randomly modifying the dedicated nonce field, or if none of the nonce values yield the desired hash, by slightly changing the content of the block in some other acceptable way – more about the practice of bitcoin block mining can be found here. As a result, mining a single bitcoin block is computationally very expensive. If someone wanted to change a transaction in the bitcoin blockchain that was registered several blocks ago, he would have to change the content of this particular block (to change the transaction), recalculate the hash for it, and due to the referential binding of subsequent blocks with previous hashes, also calculate new hashes for subsequent blocks! And all of this while the network is constantly producing new blocks. Theoretically, such an attack is possible, but in practice it is technically impossible and economically unjustified.

Bitcoin’s resistance to attacks is a confirmation of one of the most important features of blockchain technology – the guarantee of the immutability of entries in DLT. This is of great importance, obviously for financial transactions, but also for all other potential applications where the certainty that we see true and unchanged data is crucial, e.g. medical test results, voting results, the content of the land and mortgage register of the real estate that we plan to purchase, forensic evidence, data analysis on the basis of which the company’s management makes key decisions, etc.

When discussing the most important features of blockchain technology, it is also worth noting that:

The public blockchain is a fully distributed and independent structure in the sense that it does not depend on any central organization (central bank, government, corporation), so these organizations cannot influence DLT records in any way. This does not mean, of course, that governments, central banks and corporations cannot use this technology. They are using it now and the prospects for using it are becoming more and more far-reaching. The dispersion of the blockchain network also provides resistance to failure and potentially too much influence of a single network participant on its operation.
All transactions in the public blockchain are public and available to everyone. This does not mean that everyone can look into our wallet, unless they know its address. There are actually blockchains available that ensure that some data is hidden in the block, which significantly increases privacy.
All transactions are not only immutable, but also indisputably time-stamped – this is due to the nature of the blockchain, which is a structure that only adds further entries. Archival records cannot be modified.

The above is only a modest part of the entire ecosystem of blockchain technology, and discussing the properties of this technology in detail is beyond the scope of this article. For those interested in deepening their knowledge about bitcoin in particular, I recommend this article, in which the author managed to explain the basics of this technology in an very accessible and consistent way.

As an interesting fact, I would like to add that the content of my blog is protected by the WordProof plugin, which uses blockchain technology to protect intellectual property.

Artificial intelligence and blockchain – how to combine it?

Let’s take a moment to consider how the two technologies fit together in terms of key features.

One of the main uses of AI is to predict the future. Assessing the situation on the road of an autonomous car, predicting the next move of a chess player, or forecasting economic behavior. Blockchain, meanwhile, focuses its attention on a permanent record of the past – entries in DLT, once agreed by the network, will remain unchanged until the end of its existence.

AI is an extreme data consumer. To be effective, it needs data just like people need oxygen to live. Data in huge amount. Blockchain is basically a database whose main task is to record data and serve them in a decentralized and transparent manner. The amount of data consumed by blockchain is relatively small.

Artificial intelligence is extremely dynamic in the sense that it adapts its operation and the results it generates to the changing environment. It is largely based on statistics and probability. Blockchain is rather static. Could be even described as deterministic or predictable.

The feature that definitely binds both technologies is their huge appetite for more and more computing power.

I don’t know how about you, but it seems to me that artificial intelligence and blockchain have features that can perfectly complement each other. Their skillful use in one solution can give a strong synergistic effect. You may say that artificial intelligence and blockchain are like fire and water. Although, in my opinion, hydrogen and oxygen will be a more accurate comparison. The first is dynamic and highly explosive. The latter itself supports the chemical reactions. Combined together, they make water – a substance necessary for life.

However, abandoning the philosophical and chemical refelctions, let us consider in which areas you can count on the greatest synergy.

Is centralization good for the development of artificial intelligence?

Let me start with the most important area in my opinion – the need for greater decentralization and democratization of AI. The current state, when most of the revolutionary AI solutions are created by the largest corporations and few governments can lead to nothing good. At best, it will further increase social inequality and accumulate most of the world’s wealth in the hands of a small percentage of the total population. Social inequalities are inevitable in a capitalist economy, but over-exaggeration generally leads to social unrest, populist governments, and sometimes revolution.

The risk of AI centralization comes from the concentration of data, computing power, and the most fertile minds in the hands of the largest corporations and few governments. As a result, solutions will be created that will serve those governments (for control) and those corporations (for further capital accumulation). Do we want AI development to go in this direction? Or would we prefer AI development to be more democratized?

One of the proponents of decentralized AI, Ben Goretzel, humorously noted that a few years ago, people were convinced that artificial intelligence would kill us as soon as it surpassed the intellectual level of its creators. Partially it was the aftermath of movies like The Matrix or Terminator. Nowadays, people are more worried about AI taking their jobs. This interesting shift in perceptions of AI may mean that humans have come to terms with the vision of AI destroying our species, but they just wouldn’t like to be unemployed in those last few years. I recommend watching Ben Goertzel’s speech on TEDx.

Goertzel and others like him promote the ideas of solutions based on AI and blockchain in search of decentralized artificial intelligence and ultimately Artificial General Intelligence (AGI) – a creature with a level of thinking equal to human beings. Such developed artificial intelligence will not be controlled by one organization, but could be managed and developed in a democratic and decentralized manner.

On the basis of these beliefs, many projects have been created, out of which I have selected two that should be briefly mentioned. The first is the DAIA foundation, which brings together organizations working on decentralized artificial intelligence using blockchain. This is intended to counteract the concentration of AI power in one place. If someone is looking for information about Decentralized Artificial Intelligence projects, DAIA may be a good place to start. One of the assumptions of its founders is the conviction that AGI can be achieved not through one network, but through many distributed networks, communicating with each other and sharing data. And these possibilities can obviously be provided by blockchain technology. This assumption is also adopted by the second project – SingularityNET. One of its goals is to provide a decentralized exchange of artificial intelligence solutions to which you can upload an AI model and announce it in a decentralized network so that the company that needs such a service can easily, quickly and safely use it.

While these projects appear to be somewhat underinvested and attract little attention at present, they are extremely important in terms of democratizing AI and the impact of the wider social group on the shape and form of further AI development. If it could be compared to something, it might be (to some extent) to the open-source movement in its early stages of development and the later impact it had on the democratization and demonopolization of many IT solutions.

How to protect your data and their privacy while making money on them?

Artificial intelligence feeds on data. In today’s economy data can be more valuable than gold. Especially the data that is scarce, contains sensitive data, and is well prepared for use in machine learning. Why? One can risk a statement that in machine learning projects between 30% and 60% of the time and costs are spent on the initial data collection, initial analysis, business discussion, checking whether data meets various criteria (e.g. related to the protection of sensitive data), then data cleaning, quality assurance and transformation to the desired form.

What if we have a properly developed set of rare medical data that we would like to share, so that other scientists can use them in their research? What if we would like to sell such a dataset? Will a dataset be copied or changed in an unauthorized way? How will unauthorized changes affect results of research by other teams? Several projects in the field of artificial intelligence and blockchain have addressed the above problems. One of such projects is the Ocean Protocol. In a nutshell, using the exchange provided by Ocean Protocol, you can create something like an AI asset on it. Such an asset may be a properly prepared dataset, but it may also be an already trained model, ready for use. After publishing it in Ocean Protocol, assets are registered in a blockchain and receive an unique identifier. All assets have metadata describing them and the owner obviously receives a private key confirming ownership. Ocean Protocol also provides resources to store AI assets. Interestingly, in order to protect privacy, the model can be processed in the Ocean Protocol infrastructure without disclosing the data to any third party, which is especially important for sensitive and private data.

Unused computing power

As I mentioned above, artificial intelligence and blockchain certainly share one trait – the greedy consumption of computing power. One of the potential fields for cooperation seems to be the sharing of unused computational power. Currently, most of the AI computation takes place in centralized data centers. They have a specific performance limits, and corporations that provide computing power regulate access to it with pricing mechanisms. This can severely restrict innovation outside the world of wealthy organizations, especially among academics and businessmen in countries with currencies with lower purchasing power. And the demand for computing power related to artificial intelligence is growing more than linearly every year. Blockchain could naturally democratize access to computing power based on the fact that the vast majority of blockchains act as computationally powerful peer-to-peer networks.

Testing smart contracts

In the area of cryptocurrencies, 2020 will be remembered as the year of decentralized financial solutions. This movement is known as Decentralized Finance or DeFI for short. It is an interesting technological branch as it democratizes one of the basic needs of modern man – financial management. By democratization, I mean independence from a single central institution that supervises and controls all financial operations. This control is multi-level, because the banks are governed by supervisory institutions and central banks. This, of course, has huge advantages, but also considerable disadvantages. Going into the pros & cons of DeFI is beyond the scope of this article. It is important from our point of view that DeFI, like many other blockchain-based applications, bases its operation on the so-called smart contracts, i.e. applications published on the blockchain, the task of which is to automatically execute transactions of a specific type. They are saved in the blockchain, and thus do not require supervision of a trusted third party. DeFI in 2020 are at a very early stage of development. Solutions are popping up everywhere, and quite frequent implementation errors are costly for developers and for so far mainly speculative investors. There are currently many hacker attacks in the sector based on either backdoors or deliberately or unintentionally hidden vulnerabilities, often ending in so-called rug pull or exit scam. Nevertheless, the DeFI market is booming and more and more money is flowing into it. Some projects can attract millions of dollars in speculative investments in a short time. This is where AI can enter the scene, opening up opportunities in two areas. Firstly, fast and automated testing of new solutions. Especially that those solutions seem to be heavily underinvested in terms of quality assurance. Testing with artificial intelligence can rely on machine-assisted formal verification or the support of automated testing. A second area where AI can work great for applications published on the blockchain is actively monitoring the behavior of smart contracts to detect threats and data leaks in real time. Of course, no AI can replace human testing for now, but the implementation of such tools may mean that at least we will not have to deal with total testing in production.

Artificial intelligence and blockchain – the future is at hand

In this article, I tried to introduce the nature of AI and blockchain technology, point out the similarities and differences and give examples of fields for potential cooperation. Both artificial intelligence and blockchain are still in early stages of business use. The degree of their joint use in practical solutions is in its infancy. This stage can be compared to the internet in the early 90s, when it was still a technological curiosity and only a small part of companies and organizations tried to use the internet for business purposes. Currently, nobody questions the role of the internet in the democratization of knowledge. And the largest companies base their power on the internet. Will this also happen with artificial intelligence and blockchain? I think so. Will it be the result of close cooperation of both technologies in the area of Decentralized Artificial Intelligence – time will tell.

Did you like the post? I’d be grateful if you could recommend it and/or share it.

If you want to be informed about new articles on my blog, please follow my Twitter account.

The post Artificial intelligence and blockchain appeared first on AI Geek Programmer.

Artificial intelligence – a few key concepts

AI Geek Programmer — Tue, 25 Aug 2020 20:43:00 +0000

Until recently, a large part of the key concepts in the field of artificial intelligence was not so clearly defined. Some of them, such as Deep Learning, were even referred to as “buzzwords”, term used mainly by marketing and not strictly translated into scientific areas. Now, the basic concepts seem to have taken hold, and most AI professionals agree on what they mean. As certain definitions appear throughout most of my blog posts, as well as in articles, tutorials and courses available on the web, I decided to bring them to you as clearly as possible.

After reading this post:

You will organize your knowledge of four key concepts of AI: artificial intelligence, machine learning, neural networks and deep learning.
You will learn what are the differences between supervised and unsupervised learning.
You will be able to tell what the training, testing and validation datasets are and what is the overfitting.
You will be able to define what are hyperparameters and model parameters.

Artificial intelligence vs machine learning vs deep learning

The most questionable is the distinction between the concepts of artificial intelligence, machine learning, deep learning, their relationships, and how extremely popular neural networks fit into it all.

Artificial intelligence is the broadest concept that defines a de facto new field of science. Similar to mathematics, physics or chemistry. This concept has a more theoretical and philosophical meaning than a practical one and has existed since the mid-1950s, when the first unsuccessful attempts at mathematical modeling of the functioning of the human brain were made. I will give you my favorite, simple definition:

AI is the science of how to build a machine that will be able to perform tasks in a way that can be called intelligent

Currently operating artificial intelligence systems cover narrow domains and are often referred to as “narrow” or “applied”. For example, artificial intelligence is able to win GO or chess against a grandmaster. It can handle speech and writing very well. It will enable quick and effective recognition of the surrounding environment, e.g. objects on and around the road. However, the holy grail of AI is the so-called “general AI”, which is to enable the machine to effectively solve a large group of various problems, i.e. to behave largely like a human. So far, no one has even managed to get close to this goal. Some predict that the first general AI models will emerge as a combination of multiple “narrow” models, which seems to be the only reasonable and feasible route at the moment. Although who knows what is happening in the privacy of the strictly protected rooms of Chinese and American corporations. Fortunately, the road to the real Skynet is very far, if at all achievable.

Machine learning is only a part of the broader scientific domain of artificial intelligence. The most distinctive feature of machine learning is the ability to automatically learn and improve through the acquisition of new knowledge to solve a problem, but without implementing a dedicated algorithm.

I find this last part of the sentence particularly important in distinguishing machine learning from other “intelligent” software. You can write a very effective, dedicated algorithm that will predict the occurrence of precipitation based on the current look of the sky, but it will not be machine learning. Machine learning will be collecting a large number of photos of the sky, along with information whether there has been precipitation or not, and processing of this data by one of the machine learning algorithms (e.g. logistic regression, KNN, neural network, etc.), in order to obtain a model effectively predicting the occurrence of rainfall.

Neural networks are one of the algorithms / ways of machine learning. This algorithm uses mathematical structures in behavior similar to the actions of human neurons. Such artificial neurons, connected in a network, receive signals at the input and perform a relatively simple operation on them, emit an output signal, which is sent to the next layer of neurons or to the output of the network as a result of its operation. The network entry can be, for example, the values of individual pixels from a photo of the sky or the data of a credit application, appropriately processed into the digital form.

Each input signal to an artificial neuron has a weight that either strengthens or weakens the output of that neuron. During network training, using the feedback, the neural network training algorithm modifies the weights assigned to individual neurons so that the network response is burdened with the lowest possible error. The error is calculated by comparing the network response to the dataset (e.g. sky appearance) with the correct response – this is the so-called supervised learning. Computational methods, including gradient descent algorithms, are used to implement the feedback, i.e. the appropriate correction of neuron weights, thanks to which the network “learns” to better recognize data. All this in the hope that by showing new data to the network, e.g. the current appearance of the sky, not previously seen by the network, we will get a correct forecast.

And finally we come to deep learning – a fascinating branch of machine learning, that uses very complex neural networks, huge – let me emphasize it again – HUGE amounts of data and unattainable, until recently, computing powers, to teach the computer things that seemed only within human mind reach. In recent years, deep learning has proved to be very effective in solving problems related to image recognition, speech, broadly understood interaction with the environment (robots, cars), but also in medicine, game testing and fraud detection.

The relationship between the above-mentioned terms is illustrated in the figure below.

Supervised vs. unsupervised learning

Supervised learning is like a teacher working with a child. A child receives a set of pictures with information about what is in the picture: here is a horse, here is a tree, here is a car – let’s call this stage the training phase. After enough trials, you can ask a child “what’s in the picture?”, showing a slightly different car, a different species of tree, or a horse with a different color. We could call it the testing phase whether a child has learned the concepts correctly and, more importantly, is able to generalize the acquired knowledge. In other words, is it still able to recognize the object as a tree by seeing the tree, but not the same tree as in the training phase? If it is not, we repeat the training process until it is successful.

For humans, supervised learning is natural, and we are very good at it. In the digital world, supervised learning is based on a training dataset which, on the one hand, shows the input data and, on the other, its description, often called a label or target. The input data is usually laboriously prepared in advance by a human. For example, a human describes each of the tens of thousands of pictures that she/he intends to use in the training process. The algorithm learns to classify or predict values based on the input data of the training set and labels. Training takes place on the basis of sequential trials, error evaluation and feedback so that the algorithm can correct its operation, if necessary. Then, after a machine is trained, we can present the dataset without a target and ask for a class or value prediction.

Unsupervised learning is like working without a teacher. We receive a set of data and our task is to group them or find certain structures and dependencies (sometimes hidden), without first indicating what we are looking for or what classes we should divide the data into. Imagine that we have received a group of photos of various microbes and, without knowing biology, we have to group them, taking into account, for example, similarities and differences in appearance or observable behavior.

The data used for unsupervised learning does not have to be labeled. Therefore, they are much easier to prepare than data for supervised learning. Unfortunately, unsupervised learning can be effectively used in fairly narrow applications. Examples include clustering, which enables the division of a data set into groups of similar data or so-called autoencoders, thanks to which, without knowing about the specificity of the analyzed dataset, we can compress it into a form that will contain more valuable information (noise removing).

Types of datasets and overfitting

While discussing the machine learning types above, I referred to two types of datasets: training and test. In fact, we have one more – the validation dataset. It is worth taking a moment and look closer at those datasets.

The division into such three sets can be well explained by analogy to learning the subject in college. In the first phase, students are presented with material during lectures. Students listen to lecturers, take notes, try to understand the examples given. In other words, they work on the training dataset. I will ignore the fact that the lectures are sometimes optional, here it is not the best analogy to machine learning, where the training phase simply cannot be skipped.

Then, students go to exercise classes where they solve problems on their own, but still under the supervision of the lecturer. This is work on the validation dataset. It is to show how well the student has absorbed the knowledge from the lecture, whether his/her current skills are sufficient or if the material should be additionally explained. The training process on the test (lecture) and validation (exercises) datasets is repeated many times. In the case of machine learning, it has to be repeated much more often, even tens or hundreds of thousands of times. After all, it is known for a long time that no machine is as effective as the student just before the exam. At the end of the semester comes such a well-liked moment – the exam. Final confirmation of the skills acquired during the lectures and exercises. Students receive material that they have not seen before and have to independently demonstrate the appropriate skills. This is work on the test dataset.

In the problem that we want to solve using a selected machine learning algorithm, we must have a dataset at our disposal – often very large. In practice, such a dataset is divided into a training dataset and a test dataset, usually in the proportion of 80/20. Such a procedure is necessary to be able to assess, after training the model, whether it has acquired the ability to generalize. In other words, will it also correctly evaluate data that it has not seen in the training process. If we do not separate the test dataset, we will not be able to determine whether the trained model solves the task well, or whether it has simply learned to recognize data from the training dataset and it will not do so well when presenting a completely new data.

By the way: the phenomenon that occurs when the model remembers the training dataset, instead of generalizing it, is called overfitting. The algorithm has over-adjusted to the data and thus cannot handle data outside the training dataset as well.

Often, the validation dataset is separated from the training dataset. The algorithm does not learn on the validation dataset, but only on the training one, and the validation dataset is used to check the effects during training process, in order to properly adjust the model hyperparameters (more on them below). There is generally a lot of confusion with the definition of the validation dataset. Different specialists define it a bit differently. Different libraries implement splitting differently. But the general sense remains the same: the validation dataset is to evaluate the training process as it progresses.

Can we use the validation dataset as a test one to evaluate the effectiveness of the trained model? Unfortunately not. Although the validation dataset is not directly used in the training process, by influencing the model’s hyperparameters, it has a significant impact on the way the model works and thus is not suitable for an objective assessment of its effectiveness.

What are hyperparameters actually?

Since we have already referred to hyperparameters several times, it is worth taking a closer look at them. Before we move on to hyperparameters, however, it is worth determining what the model parameters are, because distinguishing between them is often problematic.

The primary goal of machine learning is to process a dataset in such a way that a certain mathematical model is created that we can then use for prediction. During the construction of the model, we use optimization algorithms so that the model predictions become more and more accurate. These algorithms work largely by tuning a series of variables describing the model. These variables are called model parameters. They are not set manually, but result from the analysis of the training dataset and their automatic optimization by the algorithm.

An example of model parameters are weights assigned to individual neurons in the neural network, and in the case of logistic regression these will be the ω coefficients of the following polynomial that defines an n-dimensional plane that divides data into classes:

y =ω₁*x₁+ω₂*x₂+…+ω_n*x_n

Unlike model parameters, hyperparameters are external to the model and are not derived from the data in the training dataset. They are usually determined by the engineer supervising the construction of the model, and their optimal value is unknown. They are most often initially determined on the basis of best practices and then their values can be adjusted to the training outcomes accordingly.

Examples of hyperparameters include k, i.e. the number of neighbors in the k-nearest neighbors algorithm, the number of neural network layers and the number of neurons in each layer, or the learning rate, which determines how quickly we move through the loss function to find its minimum.

To sum up, the model parameters are set automatically on the basis of the training dataset, and the hyperparameters are set manually based on good practices and the results achieved by the model, and their primary role is to de facto optimize the algorithm so that it determines the best model parameters.

I hope this post has introduced you to the basic concepts of artificial intelligence and machine learning. Good understanding of them will certainly help you in effective further learning.

Do you have any questions? You can ask them in comments.

Did you like my post? I will be grateful for recommending it.

See you soon on my blog, discussing another interesting topic!

Are you an NBA fan? Check my free NBA Games Ranked service and enjoy watching good games only.

The post Artificial intelligence – a few key concepts appeared first on AI Geek Programmer.

k-nearest neighbors for handwriting recognition

AI Geek Programmer — Mon, 18 May 2020 19:37:19 +0000

If I had to indicate one algorithm in machine learning that is both very simple and highly effective, then my choice would be the k-nearest neighbors (KNN). What’s more, it’s not only simple and efficient, but it works well in surprisingly many areas of application. In this post I decided to check its effectiveness in the handwriting recognition. KNN may not be the first thought that comes to your mind when you try to recognize a handwriting, but as it turns out, using this method is not unreasonable. Intrigued just like me? I invite you to read further.

From this post you will learn:

What are the most important features of KNN and in which classes of problems can it be used?
How does the k-nearest neighbors algoritm work?
How can you effectively classify handwriting using KNN?

What do you definitely need to know about k-nearest neighbors?

KNN is a very simple algorithm to understand and implement. At the same time, it offers surprisingly high efficiency in practical applications. Its simplicity, which is certainly its advantage, is also a feature that makes it sometimes unnecessarily overlooked when solving more complex problems. Interestingly, potential use of the algorithm is very wide. We can use it to provide both unsupervised and supervised learning, and the latter in both: regression and classification.

Reading the formal definition, it can be seen that KNN is characterized as a non-parametric and “lazy” algorithm. Non-parametric in this case does not mean that we do not have any hyperparameters here, as we have at least one parameter – the number of neighbors. It means only that this algorithm does not assume that we are dealing with a certain distribution of data. This is a very useful assumption, because in the real world the data we have available is often not easily linearly separable or does not reflect normal or any other type of distribution very well. Hence KNN can be useful, especially in cases where the relationship between the input data and the result is unusual or very complex, which means that classic methods such as linear or logistic regression may not be able to do the job.

Lazy in turn means that the algorithm does not build a model that generalizes a given problem in the learning phase. Learning, you can say, is deferred until the query about new data goes into the model. Two conclusions can be drawn from the above. Firstly, the testing / prediction phase will last longer than the “training” phase. Secondly, in most cases, the KNN will need all the data from the training set to be available during the prediction – this, in turn, imposes quite large requirements on the amount of memory for large datasets.

How does it work – a simple example.

The principle of the algorithm is best explained graphically and in two-dimensional space. Suppose we have a dataset described by two features (represented here by coordinates on a plane) and classified into two classes: circles and triangles.

As you can see, this dataset is quite easily linearly separable, so it would probably be enough to use e.g. logistic regression for binary classification to get a great results, but for the purpose of explaining principles of the algorithm, this example will be perfect.

Suppose now that with such a dataset, we want to classify the new object (the red spot) as a triangle or circle:

KNN works in such a way that it looks at “the closest area” of the new data and checks what class of objects are there and on the basis of this decides to describe it either as a triangle or as a circle. Okay, but what do we mean by “the closest area”? And here comes the k parameter, which determines how many closest neighbors we want to look at. If we look at one (which is a rather rare value of the k), it turns out that our new data is a triangle.

If we consider three neighbors, our red sample will change into a circle – just like in quantum physics

For k = 5 we are again dealing with the classification as a triangle, because in the immediate vicinity there are three triangles and two circles:

There are two points to note here. As you can see, I adopted odd k values only. This is justified because we will not come across a case in which a new data point is lying in the vicinity of two circles and two triangles. This does not mean that such k values are not used in practice. Simply for the above example, this would unnecessarily complicate our reasoning. Secondly, the new data appeared on the border between two classes. Therefore, it can be either a circle or a triangle, depending on the value of the parameter k. However, if a new object appears in a different, more “triangular” place, then regardless of the value of the parameter k, it will remain a triangle:

A slightly deeper look at the k-nearest neighbors algorithm

Although KNN is extremely simple, there are a few issues that are worth exploring, because as usual the devil is in the details.

How do we calculate the distance?

It seems quite obvious, but it is worth emphasizing at this point that the necessary condition to apply the k-nearest neighbors algorithm on a given dataset is the ability to calculate the distance between elements of this set. It does not have to be the Euclidean distance, but the inability to calculate the distance basically eliminates the KNN from the analysis of such a dataset.

In the two-dimensional Euclidean space, which we considered above, the distance can be calculated from the Pythagorean theorem. However, since we rarely analyze a dataset with only two features (i.e. two-dimensional), we should generalize the problem of distance to n-dimensional space. In that case the distance function can be defined by the following metric:

\sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

This is the most common way of calculating distance, but not the only one. The scikit-learn library, which we’ll use later, offers the possibility of using several different metrics, depending on the type of data and the type of task.

How to choose the value of k?

The number k is a hyperparameter and, as with other hyperparameters in machine learning, there is no rule or formula for determining its value. Therefore, it is best to determine it experimentally by assessing the accuracy of prediction for different k values. However, for large datasets it can be time consuming. So some tips could be helpful:

Quite obviously, small k values will carry a greater risk of incorrect classification.
On the other hand, large k values will give more reliable results, but will be much more computationally demanding. In addition, the k value set let’s say to 80, intuitively is in conflict with the idea of the nearest neighbors. After all, the point is to look at the immediate surroundings of the new data.
In binary classification it is good to use odd values to avoid having to make a random selection in the event of an even distribution of votes,
One of the commonly recommended k values to try is the square root of the number of elements in the training set. For example, if our training set has 1000 elements, then we should consider 32 as one of k values. For 50,000 it will be 224. Personally, I am not convinced by this method. I did not find its origin or scientific explanation (which does not mean that it does not exist). I believe that the values obtained in this way are too high.
The method that seems more sensible to me is the 4th degree root of the amount of samples in the training set. For 500 samples it will be 5, for 1000 k = 6, for 50000 it is 15, and we would use k values above 30 for datasets with the amount of data exceeding one million.
There are also methods that vote not on the basis of the k-nearest neighbors, but on the basis of votes collected from data within a specified radius from the examined point. Such a classifier – Radius Neighbors Classifier – is also offered by the scikit-learn library.
However, it should be clearly stated that determining the optimal value of k is strongly dependent on the data and the problem under consideration, hence the above guidelines and formulas should be approached with caution and the best value should be determined experimentally.

What if the number of neighbors is the same in two or more classes?

This is an interesting case, because in such a situation it is just equally likely that the sample under study belongs to one of these two or more classes. The solution can be random selection or assigning weight to each of the nearest neighbors. The closer a neighbor is, the more important his weight is. The scikit-learn library offers a special parameter “weights”, which, set to ‘uniform’, assumes that each neighbor has the same weight, and set to ‘distance’ assigns weight to the neighbors inversely proportional to its distance from the examined data point.

And what about calculation efficiency?

With the increase in the amount of data (N) in the training set, the issue of the efficiency of calculations becomes more important. The least sophisticated approach is the use of ‘brute force’, i.e. calculating the distance between the examined point and all points in the training set. The cost of such a method increases exponentially, with the increase of N. For sets of several thousand samples and larger, the use of ‘brute force’ is basically impractical or unreasonable. Hence, more sophisticated methods are often used, based on tree structures, which boil down to the selection of points lying close to the examined data point. To put it simply: if the X point is very far from the Y point and the Z point is very close to the Y point, it can be assumed that the X and Z points are far apart and there is no need to calculate the distance for them. Scikit-learn implements two such structures: K-D Tree and Ball Tree. The library decides about their use based on the characteristics of the dataset, unless we intentionally impose one of the three abovementioned methods. This is described in detail in this section.

k-nearest neighbors in the handwriting classification

Okay, after a solid dose of theory let’s see how KNN deals with a practical issue. The scikit-learn library provides extensive tools for the k-nearest neighbors algorithm – it’s a shame not to use it. For the warm-up, we’ll take one of the toy datasets included in the scikit-learn – The Digits Dataset.

For the purposes of the following task, it’s best to create a new virtual conda environment. If you want to learn more about virtual environments, I invite you to read this post. On Windows, just run Anaconda prompt, go to the directory where you want to save the scripts and execute three simple commands:

conda create --name 
conda activate 
conda install numpy matplotlib scikit-learn jupyter keras

The environment configured in this way can be used by running jupyter notebook or by configuring a project based on this environment in your favorite IDE.

When performing the classification, we will have to use a number of components that must be imported at the beginning of the script. It is also a good test of whether the environment has been created and activated correctly:

from sklearn import datasets, neighbors
from sklearn.model_selection import train_test_split
from keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np
from random import randint
import time

In the first step, we imported test datasets, including Digits, as well as the knn classifier. Later in the post, we will also need a function to divide the dataset into a training and test one, and a more advanced MNIST dataset, which can be downloaded from Keras. Matplotlib will provide us with graphical display of results, and of course we will use numpy for some operations on the dataset. Finally, two imports of helper functions related to random number generation and time measurement.

If all imports have been made correctly, we can proceed with further coding. To download the Digits dataset we’ll use a helper method offered by scikit-learn:

# load Digits data set divided into data X and labels y
X, y = datasets.load_digits(return_X_y=True)

# check data shapes - data is already flattened
print("X shape:", X.shape[0:])
print("y shape:", y.shape[0:])
>>> X shape: (1797, 64)
>>> y shape: (1797,)

The load_digits method, properly parameterized, returned the data to the variable X and the labels to the variable y. Let’s see a dozen or so randomly selected elements of the Digits dataset:

# let's see some random data samples.
pics_count = 16
digits = np.zeros((pics_count,8,8), dtype=int)
labels = np.zeros((pics_count,1), dtype=int)
for i in range(pics_count):
idx = randint(0, X.shape[0]-1)
# as data is flattened we need them to be reshaped to the original 2D shape
digits[i] = X[idx].reshape(8,8)
labels[i] = y[idx]

# then we print them all
fig = plt.figure()
fig.suptitle("A sample from the original dataset", fontsize=18)
for n, (digit, label) in enumerate(zip(digits, labels)):
a = fig.add_subplot(4, 4, n + 1)
plt.imshow(digit)
a.set_title(label[0])
a.axis('off')
fig.set_size_inches(fig.get_size_inches() * pics_count / 7)
plt.show()

Well, it may not look like graphics in the World of Tanks, but most shapes are fairly simple to classify with a human eye. By the way, I wonder how many digits we would be able to quickly classify ourselves and whether this classification would be better than the results obtained by the algorithm. Hmmm…

We divide the dataset into training (70% of data) and test (30% of data) subsets to simulate the situation in which we receive new data – the new data will be the test set – and then we classify them based on the training set. Let’s remember that in the case of KNN we do not have classic learning phase, because the classifier is “lazy”:

# splitting into train and test data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Let’s check the shapes of the variables obtained this way:

# checking shapes
print("X train shape:", X_train.shape[0:])
print("y train shape:", y_train.shape[0:])
print("X test shape:", X_test.shape[0:])
print("y test shape:", y_test.shape[0:])
>>> X train shape: (1257, 64)
>>> y train shape: (1257,)
>>> X test shape: (540, 64)
>>> y test shape: (540,)

Having the data loaded into the variables, we will define a classification function (lets_knn), which we’ll later also use for another dataset. The function accepts both: data and labels at the input, creates a classifier, processes the training set (knn.fit), and then performs prediction and determines the quality of the classification (accuracy). Finally, we display the results of the classification, and also identify which data was classified incorrectly (wrong_pred), with what label (wrong_labels) and what is the correct one (correct_labels). The function also accepts, as parameters, the number of neighbors, the method for determining distance weights, and the parameter specifying whether to present incorrect predictions in a graphic form:

def lets_knn(X_train, y_train, X_test, y_test, n_neighbors=3, weights='uniform', print_wrong_pred=False):
    t0 = time.time()
    # creating and training knn classifier
    knn = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights)
    knn.fit(X_train, y_train)
    t1 = time.time()

    # predicting classes and comparing them with actual labels
    pred = knn.predict(X_test)
    t2 = time.time()
    # calculating accuracy
    accuracy = round(np.mean(pred == y_test)*100, 1)

    print("Accuracy of", weights ,"KNN with", n_neighbors, "neighbors:", accuracy,"%. Fit in",
          round(t1 - t0, 1), "s. Prediction in", round(t2 - t1, 1), "s")

    # selecting wrong predictions with correct and wrong labels
    wrong_pred = X_test[(pred != y_test)]
    correct_labels = y_test[(pred != y_test)]
    wrong_labels = pred[(pred != y_test)]

    if print_wrong_pred:
        # the we print first 16 of them
        fig = plt.figure()
        fig.suptitle("Incorrect predictions", fontsize=18)
        # in order to print different sized photos, we need to determine to what shape we want to reshape
        size = int(np.sqrt(X_train.shape[1]))
        for n, (digit, wrong_label, correct_label) in enumerate(zip(wrong_pred, wrong_labels, correct_labels)):
            a = fig.add_subplot(4, 4, n + 1)
            plt.imshow(digit.reshape(size,size))
            a.set_title("Correct: " + str(correct_label) + ". Predicted: " + str(wrong_label))
            a.axis('off')
            if n == 15:
                break
        fig.set_size_inches(fig.get_size_inches() * pics_count / 7)
        plt.show()

Let’s use the above function to make predictions for the Digits dataset:

lets_knn(X_train, y_train, X_test, y_test, 5, 'uniform', print_wrong_pred=True)
>>> Accuracy of uniform KNN with 5 neighbors: 98.0 %. Fit in 0.0 s. Prediction in 0.1 s

The function was called for k = 5 neighbors, with uniform weights (‘uniform’) for each distance and obtained a decent accuracy of 98%.

Let’s look at the data that was classified incorrectly:

Well, I don’t know. I would disagree with some “correct” labels or at least I would have to think about the correct answer a little bit longer. As you can see, KNN did its job pretty well, as the above are all incorrect predictions on 540 test samples.

Let’s check now how our classifier will deal with a much larger and more demanding MNIST dataset . If anyone wanted to know a little more about the MNIST dataset, I wrote about it at the very beginning of the post about the handwriting classification. The simplest way to load MNIST dataset is to use the loader from the Keras library:

# now let's play with MNIST data set
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# checking initial shapes
print("X train initial shape:", X_train.shape[0:])
print("y train initial shape:", y_train.shape[0:])
print("X test initial shape:", X_test.shape[0:])
print("y test initial shape:", y_test.shape[0:])
>>> X train initial shape: (60000, 28, 28)
>>> y train initial shape: (60000,)
>>> X test initial shape: (10000, 28, 28)
>>> y test initial shape: (10000,)

As you can see, we have 60000 samples in the training set and 10000 in the test set. In addition, each element from the set is a 28 x 28 image, which compared to the Digits dataset, with its 8 x 8, gives 12 times deeper data dimension. Classification on such an amount of data will take a lot of time. As discussed above, k-nearest neighbors algorithm is a lazy classifier, which means that most calculations take place in the prediction phase. Therefore, it is worth limiting the amount of data in the test set, which should significantly speed up the whole process. Of course, if you have a sufficiently strong computing environment or you have a lot of time, you can experiment with larger values. Unfortunately, the scikit-learn library does not offer GPU support:

# Reducing the size of testing data set, as it's the most time-consuming
X_train = X_train[:60000]
y_train = y_train[:60000]
X_test = X_test[:1000]
y_test = y_test[:1000]

We have loaded data as two-dimensional: 28 x 28. Machine learning expects a vector, hence we must flatten it:

# reshaping
X_train = X_train.reshape((-1, 28*28))
X_test = X_test.reshape((-1, 28*28))

# checking shapes
print("X train shape:", X_train.shape[0:])
print("y train shape:", y_train.shape[0:])
print("X test shape:", X_test.shape[0:])
print("y test shape:", y_test.shape[0:])
>>> X train shape: (60000, 784)
>>> y train shape: (60000,)
>>> X test shape: (1000, 784)
>>> y test shape: (1000,)

We are now ready to run our classifier. To make it more interesting, we’ll run it for different k values and for two ways of determining the weights of the nearest neighbors:

# lets run it with different parameters to check which one is the best
for weights in ['uniform', 'distance']:
for n in range(1,11):
lets_knn(X_train, y_train, X_test, y_test, n_neighbors=n, weights=weights, print_wrong_pred=True)

>>> Accuracy of uniform KNN with 1 neighbors: 96.2 %. Fit in 49.2 s. Prediction in 63.5 s
>>> Accuracy of uniform KNN with 2 neighbors: 94.8 %. Fit in 50.4 s. Prediction in 61.4 s
>>> Accuracy of uniform KNN with 3 neighbors: 96.2 %. Fit in 47.6 s. Prediction in 61.4 s
>>> Accuracy of uniform KNN with 4 neighbors: 96.4 %. Fit in 46.1 s. Prediction in 61.0 s
>>> Accuracy of uniform KNN with 5 neighbors: 96.1 %. Fit in 47.2 s. Prediction in 60.7 s
>>> Accuracy of uniform KNN with 6 neighbors: 95.9 %. Fit in 46.1 s. Prediction in 60.7 s
>>> Accuracy of uniform KNN with 7 neighbors: 96.2 %. Fit in 46.4 s. Prediction in 60.8 s
>>> Accuracy of uniform KNN with 8 neighbors: 95.8 %. Fit in 47.5 s. Prediction in 60.8 s
>>> Accuracy of uniform KNN with 9 neighbors: 95.2 %. Fit in 46.3 s. Prediction in 61.8 s
>>> Accuracy of uniform KNN with 10 neighbors: 95.4 %. Fit in 50.8 s. Prediction in 62.8 s
>>> Accuracy of distance KNN with 1 neighbors: 96.2 %. Fit in 48.7 s. Prediction in 63.1 s
>>> Accuracy of distance KNN with 2 neighbors: 96.2 %. Fit in 53.2 s. Prediction in 63.5 s
>>> Accuracy of distance KNN with 3 neighbors: 96.5 %. Fit in 51.5 s. Prediction in 63.0 s
>>> Accuracy of distance KNN with 4 neighbors: 96.4 %. Fit in 50.3 s. Prediction in 60.5 s
>>> Accuracy of distance KNN with 5 neighbors: 96.4 %. Fit in 46.0 s. Prediction in 60.4 s
>>> Accuracy of distance KNN with 6 neighbors: 96.5 %. Fit in 48.5 s. Prediction in 61.5 s
>>> Accuracy of distance KNN with 7 neighbors: 96.4 %. Fit in 48.6 s. Prediction in 61.5 s
>>> Accuracy of distance KNN with 8 neighbors: 96.4 %. Fit in 48.3 s. Prediction in 61.5 s
>>> Accuracy of distance KNN with 9 neighbors: 95.7 %. Fit in 48.1 s. Prediction in 61.6 s
>>> Accuracy of distance KNN with 10 neighbors: 95.7 %. Fit in 48.4 s. Prediction in 61.5 s

The results are very similar for all configurations. I did not paste results for larger k, but raising the parameter value, even to 20, does not change the situation. It seems that for the MNIST dataset slightly better results are given by ‘distance’ weighted measurement – for which we recorded two best results of the classification: 96.5% for k = 3 and the same accuracy for k = 6. It is also worth paying attention to the much longer classification times than in the case of the Digits dataset. The “training” phase lasts about 49s, and the prediction phase about 61s.

Finally, let’s take a look at some examples of misclassification:

With the exception of a few items: (row 1, column 3), (2, 1), (3, 3), (4, 3) and (4, 4), errors are slightly more obvious than in the case of the Digits dataset. Nevertheless, a classification of 96.5% for such a simple algorithm can make a good impression.

In this post I have introduced you to the theoretical foundations behind the k-nearest neighbors algorithm. We also used the algorithm in practice, solving two classification problems. k-nearest neighbors, despite its simplicity and ease of use, gives very good results on a number of issues and I hope that I encouraged you to use this method more often.

Do you have any questions? You can ask them in comments.

Did you like my post? I will be grateful for recommending it.

See you soon on my blog, discussing another interesting topic!

The post k-nearest neighbors for handwriting recognition appeared first on AI Geek Programmer.

Anaconda cron on Amazon Linux

AI Geek Programmer — Sat, 14 Mar 2020 16:10:00 +0000

If you are a Python programmer and use the AWS and Anaconda environments, sooner or later you will come across the need to run a Python script as a cron process on Amazon Linux in the Anaconda environment. This shouldn’t be difficult, right? Hmmm, unfortunately it is. Because I spent some time configuring the cron on Amazon Linux EC2, so that it would use the Anaconda virtual environment, and it wasn’t trivial, I would like to share with you an idea of how it can be done simply and quickly.

There is, of course, quite a lot of content on the web regarding the configuration of Python cron processes. Some even apply to EC2 and Amazon Linux, but somehow none of these posts completely solved all the problems I encountered.

OK, let’s define individual steps that are necessary to configure Python cron on AWS:

We need to have SSH access to Amazon Linux running on AWS EC2 – a trivial task, I will not describe it here.
Anaconda installed and initiated – installation is also simple. At the end of the process, the installer asks whether to initialize conda – select “yes”. As a result, after logging in via SSH, we always have an active environment (base). This can be annoying for some, but it simplifies many issues if the operating system user is mainly used to execute Python scripts. It should look like this after logging in.

First, we create a virtual conda environment (let’s name it test-env). Here’s described how to do it with a few simple commands. You can theoretically execute a script in the base environment, but this is strongly discouraged. It’s worth just creating a dedicated environment and importing the modules we need.
Then we create a Python script (let’s name it test.py) that we want to run in cron. For testing purposes it will display information about which conda environment is active. Wait, display? Cron process? Of course, the cron process does not have a terminal, but its output will then be redirected to a file and there you will be able to view the result of execution for testing purposes.

import os
print(os.environ['CONDA_DEFAULT_ENV'])

Now the most important step, not so obvious, although ultimately trivial. We create an operating system script (let’s name it test.sh), in which we activate the test-env environment and run the test.py script. The first line is crucial for proper operation, the others are quite obvious. It assumes the existence of the default Amazon Linux user (ec2-user) and the installation of Anaconda in the anaconda3 directory. Remember to check and, if necessary, grant unix permissions to run the script.

source /home/ec2-user/anaconda3/bin/activate
conda activate test-env
# If you want to check in the script which environment is active: echo $CONDA_DEFAULT_ENV
python test.py
conda deactivate

Almost at the end we create a script with the definition of a cron job (Let’s name it test.cron). The following command runs our test.sh (and thus finally test.py) every day at 10 o’clock local server time – note it depends on the data center where we will host the virtual environment. The script execution has the output redirected to the test.log file, where the execution result can be viewed.

00 10 * * * bash test.sh >> /home/ec2-user/test.log 2>&1

The last task for us is to register the job in the system cron.

crontab test.cron

The current cron settings can be checked with the command

crontab -l

And that’s all. I invite you to read my other posts and recommend my website – thanks.

The post Anaconda cron on Amazon Linux appeared first on AI Geek Programmer.

Convolutional neural network 4: data augmentation

AI Geek Programmer — Sat, 14 Mar 2020 12:39:24 +0000

In the previous three parts of the tutorial, we learned about convolutional networks in detail. We looked at the convolution operation, the convolutional network architecture, and the problem of overfitting. In the classification of the CIFAR-10 dataset we achieved 81% on the test set. To go further we would have to change the architecture of our network, experiment with hyperparameters or get more data. I leave the first two solutions for you to experiment with, and in this part of the tutorial I want to feed our network with more data. I will use the so-called data augmentation, i.e. the artificial generation of large amounts of new data.

In the fourth part of the tutorial you will learn:

What is data augmentation?
How to use the data generator from the Keras library?
How to artificially generate new data for the CIFAR-10 set?
And how well will our model do on the set of artificially generated (augmented) data?

What is data augmentation?

As I mentioned in the previous part of the tutorial, if we are dealing with a closed data set, i.e. one that cannot be significantly enlarged or enlarged is very expensive, we can reach for the so-called data augmentation. This is a particularly valuable technique for image analysis. Why? Because the images are susceptible to minor modifications, which will be a new data for the algorithm, although they will still be basically the same for the human eye. Moreover, such “minor modifications” occur in the real world. Standing in front of the car, we can look at it centrally or slightly from the side. It will still be the same vehicle and it will definitely be a car for our brain. For the algorithm, looking at the object from a different perspective is valuable information that allows to better generalize the training process.

What can we actually do with the image we want to artificially process? In theory, we have infinitely many solutions: we can slightly rotate the image, in any direction, at any angle. Move left, right, up and down. Change its colors or make other more or less subtle changes that will give the model tons of new data. In practice, a collection of tens of thousands of images can become a collection with millions of elements. This is a field where the possibilities are really great. As a curiosity: technologies related to autonomous vehicles are also trained on artificially generated data sets, e.g. using realistic game environments such as GTA.

Generating data with the Keras library

The Keras library offers a set of helpful tools for generating data. Let’s try to process the previously seen picture of the building in Crete with this generator. First, we make the necessary imports and define a function that will load the image from the file and convert it to the numpy table:

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
%matplotlib inline

def convert_image(file):
return np.array(Image.open(file))

We load a picture that you can download here, display the shape of the numpy table and the picture itself:

image = convert_image(r'<>\house-small.jpg')
image.shape
>>> (302, 403, 3)

plt.imshow(image)

To generate data we will use the flow (x, y) method from the ImageDataGenerator class. To be able to use it correctly, we have to import the class – that’s pretty obvious, but also adapt the data accordingly. The method expects the tensor x, in which the first position will be the index. In our case, there will be only one element, but the method still requires an index. Inputs y are labels that we don’t need for this simple experiment, but we must provide them. Hence:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

x = np.expand_dims(image, 0)
x.shape
>>> (1, 302, 403, 3)

y = np.asarray(['any-label'])

Then we create the generator object and we pass the appropriate parameters. In the specification there are tons of them available, below are only few examples:

datagen = ImageDataGenerator(
width_shift_range=0.2, # shift along the x axis
height_shift_range=0.2, # shift along the y axis
rotation_range=20,
horizontal_flip=True,
vertical_flip = True,
rescale=1./255,
shear_range=0.25,
zoom_range=0.25,
)

Now we just call the flow (x, y) method, passing the prepared data to it and receiving and displaying the generated images.

figure = plt.figure()
i = 0
for x_batch, y_batch in datagen.flow(x, y):
a = figure.add_subplot(5, 5, i + 1)
plt.imshow(np.squeeze(x_batch))
a.axis('off')
if i == 24: break
i += 1
figure.set_size_inches(np.array(figure.get_size_inches()) * 3)
plt.show()

The result? Literally a bit upside down and a little “exaggerated”, because some parameters are set to high values. But it well reflects the capabilities of the generator. You can experiment with the settings yourself.

Data augmentation on CIFAR-10

Armed with a generator, we can once again approach the classification of the CIFAR-10 dataset. Most of the code has already been discussed in the previous parts of the tutorial, so I will only provide it here for consistency and clarity. At the beginning we make the necessary imports, load the dataset and build a model:

import numpy as np

%tensorflow_version 2.x
import tensorflow

import matplotlib.pyplot as plt
%matplotlib inline

from tensorflow import keras
print(tensorflow.__version__)
print(keras.__version__)
>>> 1.15.0
>>> 2.2.4-tf

from tensorflow.keras.datasets import cifar10
(x_train,y_train), (x_test,y_test) = cifar10.load_data()

from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import Convolution2D, MaxPool2D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras import regularizers
from tensorflow.keras.utils import to_categorical

model = Sequential([
Convolution2D(filters=128, kernel_size=(5,5), input_shape=(32,32,3), activation='relu', padding='same'),
BatchNormalization(),
Convolution2D(filters=128, kernel_size=(5,5), activation='relu', padding='same'),
BatchNormalization(),
MaxPool2D((2,2)),
Convolution2D(filters=64, kernel_size=(5,5), activation='relu', padding='same'),
BatchNormalization(),
Convolution2D(filters=64, kernel_size=(5,5), activation='relu', padding='same'),
BatchNormalization(),
MaxPool2D((2,2)),
Convolution2D(filters=32, kernel_size=(5,5), activation='relu', padding='same'),
BatchNormalization(),
Convolution2D(filters=32, kernel_size=(5,5), activation='relu', padding='same'),
BatchNormalization(),
MaxPool2D((2,2)),
Convolution2D(filters=16, kernel_size=(3,3), activation='relu', padding='same'),
BatchNormalization(),
Convolution2D(filters=16, kernel_size=(3,3), activation='relu', padding='same'),
BatchNormalization(),
Flatten(),
Dense(units=32, activation="relu"),
Dropout(0.15),
Dense(units=16, activation="relu"),
Dropout(0.05),
Dense(units=10, activation="softmax")
])

optim = RMSprop(lr=0.001)

model.compile(optimizer=optim, loss='categorical_crossentropy', metrics=['accuracy'])

After preparing and successfully compiling the model, we define the generator. We assume that the data will be rotated by 10 degrees, we allow horizontal flip, but not vertical in order not to artificially put things upside down. The generator will also move the images vertically and horizontally by 10%. Small zoom and shear are also allowed. Let’s remember that the images are small and major modifications can make the image difficult to recognize even for a person:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(

rotation_range=10,

horizontal_flip=True,

vertical_flip = False,

width_shift_range=0.1,

height_shift_range=0.1,

rescale = 1. / 255,

shear_range=0.05,

zoom_range=0.05,

)

We also need one-hot encoding for training and test labels. We set the size of the batch and the generator is basically ready to use:

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

batch_size = 64
train_generator = datagen.flow(x_train, y_train, batch_size=batch_size)

The above generator will be the source of data for the training process. But what about the validation set that will allow us to track progress? Well, we need to define a separate generator. However, this one will not modify the source images in any way:

datagen_valid = ImageDataGenerator(
rescale = 1. / 255,
)

x_valid = x_train[:100*batch_size]
y_valid = y_train[:100*batch_size]

x_valid.shape[0]
>>>6400

valid_steps = x_valid.shape[0] // batch_size
validation_generator = datagen_valid.flow(x_valid, y_valid, batch_size=batch_size)

As you can see above, the dataset that the training process will use for validation will be 100 times batch size. We also need to calculate the number of validation steps – these data will be needed to start the training.

history = model.fit_generator(
train_generator,
steps_per_epoch=len(x_train) // batch_size,
epochs=120,
validation_data=validation_generator,
validation_freq=1,
validation_steps=valid_steps,
verbose=2
)

Note that we are not using the fit() method as before, but the fit_generator() method, which accepts a training data generator and (optionally) a validation data generator. With so much data, we will teach 120 instead of 80 epochs, hoping to avoid overfitting.

>>> Epoch 1/120
>>> Epoch 1/120
>>> 781/781 - 49s - loss: 1.8050 - acc: 0.3331 - val_loss: 1.5368 - val_acc: 0.4581
>>> Epoch 2/120
>>> Epoch 1/120
>>> 781/781 - 41s - loss: 1.3230 - acc: 0.5249 - val_loss: 1.1828 - val_acc: 0.5916
>>> Epoch 3/120

(...)

>>> 781/781 - 39s - loss: 0.1679 - acc: 0.9473 - val_loss: 0.1484 - val_acc: 0.9463
>>> Epoch 119/120
>>> Epoch 1/120
>>> 781/781 - 38s - loss: 0.1708 - acc: 0.9466 - val_loss: 0.1538 - val_acc: 0.9538
>>> Epoch 120/120
>>> Epoch 1/120
>>> 781/781 - 39s - loss: 0.1681 - acc: 0.9486 - val_loss: 0.1379 - val_acc: 0.9534

We obtained accuracy of 95% for both sets. This can also be seen in the chart below:

print(history.history.keys())
>>> dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Valid'], loc='upper left')
plt.show()

Due to the lack of overfitting, we could theoretically further increase the number of epochs.

Let’s check how the trained model will cope with the test dataset that it has not yet seen.

x_final_test = x_test / 255.0
eval = model.evaluate(x_final_test, y_test)
>>> 10000/10000 [==============================] - 3s 314us/sample - loss: 0.5128 - acc: 0.8687

We achieved accuracy of 87%, 6% more than the model version without a data generator.

Most importantly, however, the model is eager to continue learning, without compromising accuracy on the validation and test sets.

This is the last post in this tutorial. I hope that I was able to bring some interesting topics related to convolutional neural networks. If you liked the above post and the entire tutorial, please share it with people who may be interested in the subject of machine learning.

The post Convolutional neural network 4: data augmentation appeared first on AI Geek Programmer.

Convolutional neural network 3: convnets and overfitting

AI Geek Programmer — Fri, 31 Jan 2020 20:46:28 +0000

Convolutional neural network is one of the most effective neural network architecture in the field of image classification. In the first part of the tutorial, we discussed the convolution operation and built a simple densely connected neural network, which we used to classify CIFAR-10 dataset, achieving accuracy of 47%. In the second part of the tutorial, we familiarized ourselves in detail with the architecture and parameters of the convolutional neural network, we built our own network and obtained ~70% accuracy on the test set. As it turned out, however, we encountered the problem of overfitting, which prevented us from getting better results. In this part of the tutorial, we’ll take a closer look at Convnets and overfitting and inspect various techniques of regularization, i.e. preventing excessive fitting to a training set. We will end the post with a list of practical tips that can be useful when building a convolutional neural network.

From the third part of the tutorial you will learn:

What is overfitting?
How to deal with the problem of overfitting?
What is internal covariate shift?
How to apply batch normalization?
What is dropout?
And some practical tips for building convolutional neural networks.

What is overfitting?

Let’s look again at the results we got in the second part of the tutorial. Figure 1 shows the classification results on the training set, which eventually reached up to ~95% (the blue line). Below there is the classification result for the validation set (the orange line). As you can see, the results for both sets began to diverge already around the 15th epoch, and the final difference for the 80th epoch was as high as ~25%.

Figure 1 – learning outcomes for training and validation sets

We call this situation overfitting. The network has learned to classify a training set so well that it has lost the ability to effectively generalize, i.e. the ability to correctly classify data that it has not previously seen.

To better understand overfitting, imagine a real life example. A professional basketball player must have the highest quality shoes. He works with a footwear company, and this company prepares shoes that are perfectly suited to the shape and construction of his feet. This requires not only matching the shape of his feet to the shoes, but primarily it requires special insole. Now the basketball player feels great in new shoes and his game is even better. Does this mean that such shoes will be equally good for another basketball player or for amateur players? Probably not in the vast majority of cases. These shoes have been fitted so well to the feet of this particular basketball player that they will not perform well on another feet. This is overfitting, and companies that produce footwear try to design them in such a way that the shapes of the shoes and insoles fit the greatest possible number of feet, while ensuring the greatest comfort of play.

Yet another example – this time graphic. Let’s assume that we want to build a classifier that will correctly classify the data into “circular” and “triangular”.

Figure 2 – overfitting vs. better generalizing model

If we adjust the classifier too much to the training data, it will not be able to correctly classify the new data, because it is unlikely that these new data will fit perfectly into the distribution of training data. Therefore, it is better for the model to be less complex. Although it will achieve slightly worse results on the training set, it will probably generalize the problem better, so it will also classify new data more correctly.

How to solve the problem of overfitting?

OK, so how do you counteract overfitting? There are at least several effective methods for this. Below I will describe the most important ones and we will try to use some of them in our classifier.

We collect more data – this is often the most effective method to prevent overfitting. If the model sees more data it will be able to better generalize its response. Let’s remember that neural networks, and in general machine learning, loves huge amounts of data and high computing power. Unfortunately, often this method is the most difficult to use in practice or even impossible – as in our case when we have a closed dataset.

If we can’t collect more data, we can sometimes create it ourselves. Although it sounds quite breakneck and we may wonder if artificially generated data will improve the model’s response, in practice this method brings good results. Especially in image processing we have a wide range of possibilities in that area. We can slightly rotate, move, change its colors or make other more or less subtle changes that will give the model tons of new data. From a logical point of view: having an original photo of a horse, we can mirror it or change its color and it will still be a photo of a horse. This technique is called data augmentation and leading libraries offer ready-to-use tools. We will use one of them in the next part of this tutorial.

As I mentioned in the second part of the tutorial, each neural network has many so-called hyperparameters. They have a significant impact on the way the network works. They are part of the model architecture and by controlling them you can get better or worse results. When building each model, it’s worth experimenting to find architecture that gives us better results. Sometimes reducing the complexity of architecture gives surprisingly good results. Too complex architecture will be able to generate overfitting fairly quickly, because it will be easier for such a network to accurately fit to the training set.

Let’s start with this simple move. The network from the second part of the tutorial consists of convolutional and densely connected subnets. The convolutional network is not densely connected and we should rather try to increase its complexity rather than reduce it, because it will be able to capture more features of the image. Therefore, to reduce the complexity of architecture, it is good idea to start with the densely connected part in the first place.

From the model in the following form (the second part of the tutorial):

Dense(units=512, activation="relu"),
Dense(units=64, activation="relu"),
Dense(units=10, activation="softmax")

we will move to a much simpler one:

Dense(units=32, activation="relu"),
Dense(units=16, activation="relu"),
Dense(units=10, activation="softmax")

(...)
>>> Epoch 78/80
>>> loss: 0.5725 - accuracy: 0.7968 - val_loss: 0.7897 - val_accuracy: 0.7367
>>> Epoch 79/80
>>> loss: 0.5667 - accuracy: 0.8014 - val_loss: 0.8373 - val_accuracy: 0.7259
>>> Epoch 80/80
>>> loss: 0.5611 - accuracy: 0.8019 - val_loss: 0.8255 - val_accuracy: 0.7220

eval = model.evaluate(x_test, to_categorical(y_test))
>>> loss: 0.8427 - accuracy: 0.7164

Figure 3 – Training results after reducing the complexity of a densely connected network

As you can see there are some benefits. About 2% higher classification accuracy on the validation set. Faster training, because the network is less computationally demanding. And also reduced, though not eliminated, overfitting – currently at around 10%.

The architecture of our first version of the network, proposed in the second part of the tutorial, assumed the processing of each image by three convolution “modules”, with 64, 32 and 16 filters, respectively. Such a complexity of the convolutional network allowed us to obtain about 80% accuracy on the training set, which translated into ~72% on the test set. For the record, it looked like this:

Convolution2D(filters=64, kernel_size=(3,3), input_shape=(32,32,3), activation='relu', padding='same'),
Convolution2D(filters=64, kernel_size=(3,3), activation='relu', padding='same'),
MaxPool2D((2,2)),
Convolution2D(filters=32, kernel_size=(3,3), activation='relu', padding='same'),
Convolution2D(filters=32, kernel_size=(3,3), activation='relu', padding='same'),
MaxPool2D((2,2)),
Convolution2D(filters=16, kernel_size=(3,3), activation='relu', padding='same'),
Convolution2D(filters=16, kernel_size=(3,3), activation='relu', padding='same'),

In order for us to get better classification results, we should improve in two areas. First, increase the accuracy of classification on the training set, because as you can see the accuracy for the test set is always lower than the accuracy for the training set. Secondly, we should reduce overfitting. A network that learns to generalize well will achieve much better results on data it has not previously seen. In addition, we will be able to train it for more than 80 epochs. Currently, this does not make much sense, because although the accuracy on the training set can still increase, the same parameter on the validation set indicates that this learning is not generalizing, but fitting to the training set.

How to improve the accuracy of classification on the training set? One way is to deepen the convolutional subnet. By adding more layers and increasing the number of filters, we give the network the ability to capture more features and thus greater accuracy in classification. To achieve this, we will add one more convolutional “module” with an increased number of filters:

Convolution2D(filters=128, kernel_size=(5,5), input_shape=(32,32,3), activation='relu', padding='same'),
Convolution2D(filters=128, kernel_size=(5,5), activation='relu', padding='same'),
MaxPool2D((2,2)),
Convolution2D(filters=64, kernel_size=(5,5), activation='relu', padding='same'),
Convolution2D(filters=64, kernel_size=(5,5), activation='relu', padding='same'),
MaxPool2D((2,2)),
Convolution2D(filters=32, kernel_size=(5,5), activation='relu', padding='same'),
Convolution2D(filters=32, kernel_size=(5,5), activation='relu', padding='same'),
MaxPool2D((2,2)),
Convolution2D(filters=16, kernel_size=(3,3), activation='relu', padding='same'),
Convolution2D(filters=16, kernel_size=(3,3), activation='relu', padding='same'),

Attempting to reduce overfitting requires the introduction of two new elements: batch normalization and dropout technique.

Convnets and overfitting: batch normalization

Batch normalization aims to reduce so-called internal covariate shift. To understand the idea behind batch normalization, you must first understand what the internal covariate shift is.

Covariate is a fairly widely used term, mainly in statistics, and means an independent variable, in other words an input variable. On the basis of input variables, output (dependent) variables are determined. By analogy, in machine learning, covariate will mean the input variable / input / X / feature. In our example, covariates are the values of the color components of individual pixels of processed images.

Each dataset has a certain distribution of input data. For example, if in the CIFAR-10 dataset we analyzed the distribution of average brightness of images depicting aircraft, it would probably be different from the brightness of images depicting frogs. If we superimposed these two distributions, they would be shifted from each other. This shift is called covariate shift.

Although the datasets we use for machine learning are usually well-balanced, the division of the set into training, validation and test sets causes that these sets have different distribution of input data. For this reason (among others) we usually have a lower accuracy for the test set as compared to the training set.

Figure 4 – covariate shift

Covariate shift occurs not only when splitting a set or enriching it with new data, but also as a result of passing input data through subsequent layers of the neural network. The network modifies data naturally by imposing weights assigned to connections between neurons in the network. As a consequence, each subsequent layer must learn data that has a slightly different distribution than the original input. This not only slows down the training process but also makes the network more susceptible to overfitting. The phenomenon of input data distribution shift in the neural network has been described by Ioffe and Szegedy and called internal covariate shift.

Ioffe and Szegedy proposed a method of data normalization performed between layers of the neural network as part of its architecture, thanks to which the phenomenon of internal covariate shift can be minimized. It should be noted here that some researchers dealing with the issue indicate that batch normalization does not so much reduce internal covariate shift, but rather “smoothes” the target function, thus accelerating and improving the training process.

To sum up: batch normalization speeds up learning – allows fewer iterations to get the same results as network without batch normalization. It allows the use of higher learning rates without experiencing vanishing gradient problem and also helps eliminate overfitting. Most machine learning libraries, including Keras, have built-in batch normalization functions.

For those interested: a wiki entry and a scientific article by Sergey Ioffe and Christian Szegedy, who proposed and described the batch normalization method. The article is rather technical, with a large dose of mathematics, but the abstract, introduction and summary are easily understandable.

Convnets and overfitting: dropout

The second very useful technique that effectively fights overfitting is the so-called dropout. It was proposed by Geoffrey E. Hinton, et al. at work Improving neural networks by preventing co-adaptation of feature detectors. It is a relatively simple, but also very effective technique for preventing overfitting. It consists in randomly removing individual neurons from the network (from internal layers, sometimes also input layer) during training. Because complex networks (and these are undoubtedly deep neural networks), especially those with relatively small amounts of training data, tend to accurately match the data, this deregulation method forces them to learn in a more generalized way.

In each training round, each of the neurons is removed or left in the network. The chances of removal are defined by the probability with which the neuron will be removed. In the original work it was 50% for each neuron. Currently, we can independently determine this probability, and for different layers it may be different.

Figure 5 – dropout as a technique to minimize overfitting

The use of dropout in practice leads to a situation in which the network architecture changes dynamically and we get a model in which one dataset was used to teach many networks with different architectures, and then was tested on a test set with averaged weight values.

The use of dropout in Keras comes down to adding another layer called Dropout (rate), whose hyperparameter is the probability with which the neuron will be removed from the network. We add dropouts to the densely connected subnet. Its use in the convolutional subnet is less common and basically misses the idea behind the convolutions.

In the convolution layer, we will use batch normalization, which is obtained in Keras by adding the BatchNormalization () layer. As a result, we will get the following new architecture:

model = Sequential([

Convolution2D(filters=128, kernel_size=(5,5), input_shape=(32,32,3), activation='relu', padding='same'),

BatchNormalization(),

Convolution2D(filters=128, kernel_size=(5,5), activation='relu', padding='same'),

BatchNormalization(),

MaxPool2D((2,2)),

Convolution2D(filters=64, kernel_size=(5,5), activation='relu', padding='same'),

BatchNormalization(),

Convolution2D(filters=64, kernel_size=(5,5), activation='relu', padding='same'),

BatchNormalization(),

MaxPool2D((2,2)),

Convolution2D(filters=32, kernel_size=(5,5), activation='relu', padding='same'),

BatchNormalization(),

Convolution2D(filters=32, kernel_size=(5,5), activation='relu', padding='same'),

BatchNormalization(),

MaxPool2D((2,2)),

Convolution2D(filters=16, kernel_size=(3,3), activation='relu', padding='same'),

BatchNormalization(),

Convolution2D(filters=16, kernel_size=(3,3), activation='relu', padding='same'),

BatchNormalization(),

Flatten(),

Dense(units=32, activation="relu"),

Dropout(0.15),

Dense(units=16, activation="relu"),

Dropout(0.05),

Dense(units=10, activation="softmax")

])

optim = RMSprop(lr=0.001)

As you can see above, I also proposed to change the optimizer from SGD to RMSprop, which, as shown by my tests, worked slightly better for the above architecture.

Here a small digression: you may be wondering where all these changes come from? Well, they come from two sources: from collected experience and experiments with a given network. I spent at least a dozen hours on the solution that I will finally present in this tutorial, trying different architectures and hyperparameters values. This is how it looks in practice, so if you spend the second day with your model and you have no idea what to do next, then you must know that it is completely normal and in a moment (or after a short break) you will probably go on with your work.

The network has been trained for 80 epochs and as a result we have achieved a classification accuracy of 81%.

Epoch 77/80
42500/42500 - 19s - loss: 0.0493 - accuracy: 0.9888 - val_loss: 1.7957 - val_accuracy: 0.8119
Epoch 78/80
42500/42500 - 19s - loss: 0.0523 - accuracy: 0.9879 - val_loss: 1.2465 - val_accuracy: 0.8016
Epoch 79/80
42500/42500 - 19s - loss: 0.0499 - accuracy: 0.9880 - val_loss: 1.7057 - val_accuracy: 0.8137
Epoch 80/80
42500/42500 - 18s - loss: 0.0490 - accuracy: 0.9880 - val_loss: 1.5880 - val_accuracy: 0.8175

eval = model.evaluate(x_test, to_categorical(y_test))
>>> 10000/10000 [==============================] - 2s 167us/sample - loss: 1.5605 - accuracy: 0.8112

A look at the chart for the training and validation sets gives mixed feelings. On the one hand, we were able to increase the classification accuracy for all three sets, including the most important, i.e. test set, by nearly 10% (from 71% to 81%). On the other hand, strong overfitting has appeared again, which means that the network is again “learning the training set” more than generalizing the classification.

Figure 6 – the result of the classification (changed architecture)

If I wanted to get a better result than 81%, I would choose one of three ways. First, I could experiment with different architectures. I could refer to one of reference architectures that obtained very good results on the CIFAR-10 or similar dataset. Secondly, I could examine the response of the network to other hyperparameter settings – tedious and time-consuming work, but sometimes a few simple changes give good results. The third way is the further fight against overfitting, but with a slightly different method, which I have already mentioned above – data augmentation. We’ll take look at it in the next part of the tutorial.

Convnets – some practical tips

At very the end of this part of the tutorial, I put some practical, loosely coupled tips that you can take into account when building your Convnet.

If you can, use the proven network architecture and, if possible, adapt it to your needs.
Start with overfitting and introduce regularization.
Place the dropout in a densely connected layer and enter batch normalization into the convolutional subnet. However, do not stick to this rule stiffly. Sometimes, non-standard move can give unexpected results.
Kernel size should be rather much smaller than the size of the image.
Experiment with different hyperparameter settings, then experiment even more.
Use the GPU. If you don’t have a computer with a suitable graphics card, use Google Colaboratory.
Collect as much training data as possible. There is no such thing as “too much data.”
If you cannot collect more data, use the data generator when possible – more on this in the fourth part of the tutorial.
A very deep and extensive network will have a strong tendency to overfitting. Use as shallow network as possible. In particular, do not overdo with the number of neurons and layers in a densely connected subnet.
Ensure that training and test sets are well balanced and have a similar distribution. Otherwise, you will always get much worse results on the test set than on the training set.
Once you feel good in the Convnet world, start reading more advanced scientific studies. They will allow you to better understand how convolutional networks work, and they will introduce new techniques and architectures to your toolbox.

Good luck!

I hope you found the above post interesting. If so, share it with your friends.

I invite you to the fourth part of the tutorial.

The post Convolutional neural network 3: convnets and overfitting appeared first on AI Geek Programmer.