# Python's Summer of Code 2016 Updates

## June 25, 2016

### Vikram Raigur (MyHDL)

#### GSoC Mid Term summary

This post concerns with the brief summary of my GSoC experience till now.

I made a Run Length Encoder till my mid-term evaluation and Quantizer module is still under process. I have made a divider for quantizer module, I have to make a top level module to get finished with Quantizer module.

I made a PR for my work in the main repo this week. The PR had 40 commits.

The Run Length Encoder takes 8×8 pixels data and outputs the Run Length Encoded data.

Example output for Run Length Encoder Module:

Sample input:

red_pixels_1 = [
1, 12, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 10, 2, 3, 4, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0
]

red_pixels_2 = [
0, 12, 20, 0, 0, 2, 3, 4,
0, 0, 2, 3, 4, 5, 1, 0,
0, 0, 0, 0, 0, 0, 90, 0,
0, 0, 0, 10, 0, 0, 0, 9,
1, 1, 1, 1, 2, 3, 4, 5,
1, 2, 3, 4, 1, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0
]

green_pixels_1 = [
11, 12, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 10, 2, 3, 4, 0,
0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 1, 2, 3, 4, 5,
1, 2, 3, 4, 1, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0
]

green_pixels_2 = [
13, 12, 20, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 32, 4, 2
]

blue_pixels_1 = [
11, 12, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 2, 3, 4, 5,
1, 2, 3, 4, 1, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1
]

blue_pixels_2 = [
16, 12, 20, 0, 0, 2, 3, 4,
0, 0, 2, 3, 4, 5, 1, 0,
0, 0, 0, 0, 0, 0, 90, 0,
0, 0, 0, 10, 0, 0, 0, 9,
1, 1, 1, 1, 2, 3, 4, 5,
1, 2, 3, 4, 1, 2, 0, 1,
1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 32, 4, 2
]

Sample output:

============================
runlength 0 size 1 amplitude 1
runlength 0 size 4 amplitude 12
runlength 15 size 0 amplitude 0
runlength 2 size 4 amplitude 10
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 15 size 0 amplitude 0
runlength 15 size 0 amplitude 0
runlength 7 size 1 amplitude 1
runlength 0 size 0 amplitude 0
runlength 0 size 0 amplitude 0
============================
runlength 0 size 1 amplitude -2
runlength 0 size 4 amplitude 12
runlength 0 size 5 amplitude 20
runlength 2 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 2 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 7 size 7 amplitude 90
runlength 4 size 4 amplitude 10
runlength 3 size 4 amplitude 9
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 0 amplitude 0
runlength 0 size 0 amplitude 0
=============================

runlength 0 size 4 amplitude 11
runlength 0 size 4 amplitude 12
runlength 15 size 0 amplitude 0
runlength 2 size 4 amplitude 10
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 5 size 1 amplitude 1
runlength 5 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 14 size 1 amplitude 1
runlength 0 size 0 amplitude 0
runlength 0 size 0 amplitude 0
==============================
runlength 0 size 2 amplitude 2
runlength 0 size 4 amplitude 12
runlength 0 size 5 amplitude 20
runlength 15 size 0 amplitude 0
runlength 15 size 0 amplitude 0
runlength 15 size 0 amplitude 0
runlength 7 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 3 size 1 amplitude 1
runlength 0 size 6 amplitude 32
runlength 0 size 3 amplitude 4
runlength 0 size 2 amplitude 2
runlength 0 size 0 amplitude 0
==============================
runlength 0 size 4 amplitude 11
runlength 0 size 4 amplitude 12
runlength 15 size 0 amplitude 0
runlength 15 size 0 amplitude 0
runlength 3 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 15 size 0 amplitude 0
runlength 2 size 1 amplitude 1
runlength 0 size 0 amplitude 0
==============================
runlength 0 size 3 amplitude 5
runlength 0 size 4 amplitude 12
runlength 0 size 5 amplitude 20
runlength 2 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 2 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 7 size 7 amplitude 90
runlength 4 size 4 amplitude 10
runlength 3 size 4 amplitude 9
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 3 amplitude 5
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 0 size 2 amplitude 3
runlength 0 size 3 amplitude 4
runlength 0 size 1 amplitude 1
runlength 0 size 2 amplitude 2
runlength 1 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 6 size 1 amplitude 1
runlength 0 size 1 amplitude 1
runlength 3 size 1 amplitude 1
runlength 0 size 6 amplitude 32
runlength 0 size 3 amplitude 4
runlength 0 size 2 amplitude 2
runlength 3 size 4 amplitude 9
==============================

The module if it counts more than 15 zero’s, it stalls the inputs.

I tried to git rebase my repo and I messed up things. Now every thing seems fine.

As per my timeline, I have to finish Quantizer and Run Length Encoder by 28th of this month. I hope to finsih them on time.

New Checkpoints:

Quantizer : 30th June

Huffman : 7th July

Byte Stuffer : 15th July

JFIF Header Generator : 25th July

Control Unit and Documentation : Remaining time.

I came to know today that while indexing Python excludes the upper bound whereas verilog includes the upper bound.

I set a generic feature to RLE module, so that it can take a y number of pixels.

The RLE Moudle have two major parts:

1. RLE Core
2.  RLE Double Buffer

The RLE Core processes the data and stores it in RLE Double Buffer. when Huffman module reads from one buffer, we can write into the second buffer.

This week I set up travis builder for my repo. Things dint work initially well with the Travis builder, because I imported FIFO from the RHEA folder.

As soon, cfelton released a MyHDL 1.0 Version with block decorator of the RHEA files. Travis builder set things well.

RLE core have a negative number issue initially, which I set up finally.

The code coverage for RLE Moudule is 100 percent as per the pytest.

Landscape gives the code around 90 percent health.

Coveralls give around 100 percent coverage for the code.

I added conversion tests for all the modules. I made a dummy wrapper around each module so that I can check that the test converts or not.

I was facing an issue with nested interfaces, I came to know MyHDL have no support to nested interfaces as ports. They assign them as reg or wire but not input or output.

So, I made bit modifications to my interfaces.

The module have a divider at its heart made of multiplier. We send a number to rom and get its reciprocal stored in rom. We multiply reciprocal with the divisor and hence we get the output.

I have been following the same architecture as reference design for Quantizer module.

Also, I made a seperate clock, reset modules in common folder so that I can access things easily and also added a reference implementation in common folder.

I will finish the modules as per the checkpoint’s.

Stay tuned for the next update.

#### GSoC Third week

I have been suffering from viral fever. So, the whole week I was not able to work properly. Through out the week I made the first version of Run Length Encoder.

Things worth discussing :

To convert a number to unsigned in Python(MyHDL). We have to do is:

“

a = a and 0xFF

“

This will convert a to unsigned.

Also, I got myself familiar with pylint andd flake8. Flake8 uses PEP8 coding guidelines to check your code. They really help alot in making the code look good. Also, Chris told me to make the code more modular, so that we can rulength encode any size of block.

Overall, the week was a  decent one.

### SanketDG (coala)

#### Summer of Code Midterm Updates

Talking about short updates, I have successfully completed parsing routines for Python and Java. My next task would be to use the coalang functionality to implement parsing routines which are completely implicit and to provide support for C and C++ documentation styles.

So instead of passing the parameter and return symbols as strings through a function, they would be extracted from the coalang files. A strong API to access coalang files would help here.

Support for multiple params also need to be kept in mind. A documentation style can support many type of formats.

After this is done, I will start working on the big thing i.e. DocumentationBear! The first feature of capitalizing sentences has been already implemented (needs a little bit of improvement.)

The second thing to do is to implement checking the docs against a specified style. This can be done in two ways:

• The first one involves the user supplying a regex to check against the documentation.

• Another way would be to define some predefined styles that are generally followed as conventions in most projects, and then check them against the documentation. For example for python docstrings, two conventions seem to rule:

:param x:          blablabla
:param muchtolong: blablabla

:param x:
blablabla
:param muchtolong:
blablabla


Supporting these two conventions as predefined styles would avoid most projects writing a complex regex.

Then I would go forward with more functionality like indentation checking and wrapping of characters in a long line to subsequent lines. I will also check for grammar within documentation!

If there is time available after all this, I would go forward with refactoring all the classes related with documentation extraction and improve the parsing routines to make them more dynamic. I would also like to tackle the problem where languages and docstyles have different fields for extracting, not only the current three(description, parameters and return values).

On a final note, I have a issue tracker at GitLab. Also, to help me organize my work, I have opened a public Trello board. The board is empty right now, but I will start filling it up from tomorrow.

### meetshah1995 (MyHDL)

#### Let's Silicon

As the title may suggest , most part of the next part pf my GSoC will be making hardware modules for RISC-V cores and interfacing them to make a processor !.

I already have a working and tested myHDL based decoder in place. I am now in discussions with my mentor to finalize a RISC-V module which I can port to myHDL. This will embark the next phase of my coding in GSoC.

We will be selecting a RV32I based core to implement in the coming weeks as the HDL decoder fully supports RV32I at the present.

I have also shifted my development on the dev branch keeping my master up to date with the main repository.

See you next week.
MS

### tsirif (Theano)

This week I am going to present in detail my pull request in Theano’s libgpuarray project. As I referred in my previous blog post , this pull request will provide multi-gpu collectives support in libgpuarray. The API exposed for this purpose is described in two header files: collectives.h and buffer_collectives.h.

## Some libgpuarray API

Before, explaining collectives API I must refer to some libgpuarray structures that user has to handle in order to develop functioning software.

• gpucontext: This structure is declared in buffer.h.

This is used to describe what the name means, a GPU context. A context of gpu is a concept which represents a process running in gpu. In general, a context can be “pushed” to a GPU and all kernel operations scheduled while that context is active will be executed accordingly. A context keeps track of state related information to a GPU process (distinct memory address, allocations, kernel definitions). A context is “poped” out, when user does not want to use it anymore. In libgpuarray, gpucontext is assigned to a single gpu on creation and is used also to refer to the gpu which will be programmed. A call to gpucontext_init will create an instance and at least one call is necessary to make use of the rest library.

gpucontext* gpucontext_init(const char* name, int dev, int flags, int* ret);

• gpudata: This structure is declared in buffer.h.

It represent allocated data in a device which is handled by a single gpucontext. A call to gpudata_alloc will return an allocated gpudata which refers to an allocated buffer space of size sz (in bytes) in the GPU selected through the ctx provided. Optionally, pointer data in host’s memory can be provided along with GA_BUFFER_INIT as flags for copying sz bytes from host to the newly allocated buffer in GPU.

gpudata* gpudata_alloc(gpucontext* ctx, size_t sz, void* data, int flags, int* ret);

• GpuArray: This structure is declared in array.h.

It represents a ndarray in GPU. It is a container, similar to Numpy’s one, which places specific vector space attributes to a gpudata buffer. It contains number and size of dimensions, strides, offset from original device pointer in gpudata, data type and flags which indicate if a GpuArray is aligned, contiguous and well-behaved. It can be created in 4 ways: As an empty array, as an array filled with zeros, using previously allocated gpudata or using an existing host ndarray. All of them need information about number, size of dimensions, strides (the first two through data order) and data type. We will use the two following:

int GpuArray_empty(GpuArray* a, gpucontext* ctx, int typecode,
unsigned int nd, const size_t* dims,
ga_order ord);
int GpuArray_copy_from_host(GpuArray *a, gpucontext *ctx, void *buf, int typecode,
unsigned int nd, const size_t *dims,
const ssize_t *strides);


## Collectives API on GPU buffers

I will explain now how to use buffer-level API which exists in buffer_collectives.h. I am going to do this by presenting the test code as an example for convenience.

First of all, since we are going to examine a multi-gpu example, a parallel framework is used since nccl requires that some of the API must be called in parallel for each GPU to be used. In this example I am going to use MPI. I will omit the initialization of MPI and its ranks and use MPI_COMM_WORLD. Each process will handle a single GPU device and in this example the rank of an MPI process will be used to select a device hardware number.

gpucontext* ctx = gpucontext_init("cuda", rank, 0, NULL);
gpucommCliqueId comm_id;
gpucomm_gen_clique_id(ctx, &comm_id);


A gpucontext is initialized and a unique id for gpu communicators is produced with gpucomm_gen_clique_id.

MPI_Bcast(&comm_id, GA_COMM_ID_BYTES, MPI_CHAR, 0, MPI_COMM_WORLD);
gpucomm* comm;
gpucomm_new(&comm, ctx, comm_id, num_of_devs, rank);


Unique id is broadcast using MPI in order to be the same among GPU communicators. A gpucomm instance is created which represents a communicator of a single GPU in a group of GPU which will participate in collective operations. It is declared in buffer_collectives.h. gpucomm_new needs to know about the ctx to be used and the user-defined rank of ctx’s device in the newly created group. Rank in a GPU group is user defined and is independent of hardware device number or MPI process rank. For convenience of this test example they are equal.

int* A = calloc(1024, sizeof(char));
int i, count = SIZE / sizeof(int);
for (i = 0; i < count; ++i)
A[i] = comm_rank + 2;
int* RES = calloc(1024, sizeof(char));
int* EXP = calloc(1024, sizeof(char));

gpudata* Adev = gpudata_alloc(ctx, 1024, A, GA_BUFFER_INIT, &err);
gpudata* RESdev = gpudata_alloc(ctx, 1024, NULL, 0, &err);


Initialize buffers for input, expected and actual output.

gpucomm_reduce(Adev, 0, RESdev, 0, count, GA_INT, GA_PROD, 0, comm);
MPI_Reduce(A, EXP, count, MPI_INT, MPI_PROD, 0, MPI_COMM_WORLD);


For convenience, all collective operations are checked upon results of the corresponding MPI collective operations. All collectives require a gpucomm as an argument and sync implicitly so that all gpucomms that participate in a GPU group are called to a collective function. Collective operations and documentation exist in buffer_collectives.h. Also, in that file you will find definition of _gpucomm_reduce_ops, one of which is GA_PROD in example. Notice the similarity between MPI and gpucomm signature.

int gpucomm_reduce(gpudata* src, size_t offsrc, gpudata* dest,
size_t offdest, size_t count, int typecode,
int opcode, int root, gpucomm* comm);
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root,
MPI_Comm comm);


Currently supported collective operations are all operations supported by nccl, as of now:

• gpucomm_reduce
• gpucomm_all_reduce
• gpucomm_reduce_scatter
• gpucomm_broadcast
• gpucomm_all_gather
if (rank == 0) {
// Reading from RESdev gpudata to RES host pointer

int res;
MAX_ABS_DIFF(RES, EXP, count, res);
if (!(res == 0)) {
PRINT(RES, count);  // print RES array
PRINT(EXP, count);  // print EXP array
ck_abort_msg("gpudata_reduce with GA_INT type and GA_SUM op produced max "
"abs err %d", res);
}
}


Result from root’s GPU is copied back to host and then the expected and actual results are compared.

free(A);
free(RES);
free(EXP);
gpudata_release(RESdev);
gpucomm_free(comm);
gpucontext_deref(ctx);


Finally, resources are released.

Complete testing code can be found in main.c, device.c, communicator.c and check_buffer_collectives.c files. Framework libcheck is used for C testing. Actual testing code contains setup and teardown functions, as well as preprocessor macros and tricks for easily testing for all data and operation types. From the example above, crucial error checking is missing for convenience.

## Collectives API on GPU ndarrays

gpucontext* ctx = gpucontext_init("cuda", rank, 0, NULL);
gpucommCliqueId comm_id;
gpucomm_gen_clique_id(ctx, &comm_id);

MPI_Bcast(&comm_id, GA_COMM_ID_BYTES, MPI_CHAR, 0, MPI_COMM_WORLD);
gpucomm* comm;
gpucomm_new(&comm, ctx, comm_id, num_of_devs, rank);

int(*A)[16];
A = (int(*)[16])calloc(32, sizeof(*A));
int(*RES)[16];
RES = (int(*)[16])calloc(32, sizeof(*RES));
int(*EXP)[16];
EXP = (int(*)[16])calloc(32, sizeof(*EXP));

size_t indims[2] = {32, 16};
size_t outdims[2] = {32, 16};
const ssize_t instrds[ND] = {sizeof(*A), sizeof(int)};
const ssize_t outstrds[ND] = {sizeof(*RES), sizeof(int)};
size_t outsize = outdims[0] * outstrds[0];
size_t i, j;
for (i = 0; i < indims[0]; ++i)
for (j = 0; j < indims[1]; ++j)
A[i][j] = comm_rank + 2;

GpuArray_copy_from_host(&Adev, ctx, A, GA_INT, ND, indims, instrds);
GpuArray RESdev;
GpuArray_empty(&RESdev, ctx, GA_INT, ND, outdims, GA_C_ORDER);


First create a gpucomm as before. Then initialize arrays in host and device to be used in the test. The code above may seem difficult to read or a pain to be written explicitly every time an array must be made, but pygpu python interface to libgpuarray make it easy and readable.

if (rank == 0) {
} else {
}
MPI_Reduce(A, EXP, 32 * 16, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

if (rank == 0) {
// Reading from RESdev gpudata to RES host pointer
int res;
COUNT_ERRORS(RES, EXP, 32, 16, res);
ck_assert_msg(res == 0,
"GpuArray_reduce with GA_SUM op produced errors in %d places",
res);
}


As before, results are checked upon MPI collectives’ results. Collective operations for GpuArrays and documentation exist in collectives.h. In this example, GpuArray_reduce is a function used to perform the reduce collective operation on GpuArrays, while GpuArray_reduce_from is a function which can be used by non-root gpucomm ranks to participate in this collective.

int GpuArray_reduce_from(const GpuArray* src, int opcode,
int root, gpucomm* comm)
int GpuArray_reduce(const GpuArray* src, GpuArray* dest,
int opcode, int root, gpucomm* comm);


Currently supported collective operations on GpuArrays:

• GpuArray_reduce_from
• GpuArray_reduce
• GpuArray_all_reduce
• GpuArray_reduce_scatter
• GpuArray_broadcast
• GpuArray_all_gather
GpuArray_clear(&RESdev);
free(A);
free(RES);
free(EXP);
gpucomm_free(comm);
gpucontext_deref(ctx);


Again finally, resources are released.

## In general and near future

Using this part of libgpuarray requires having nccl installed, as well as CUDA >= v7.0 and GPUs of at least Kepler architecture, as suggested in nccl’s github page. Currently as there is no a collectives framework for OpenCL, collectives operations are supported only for CUDA gpucontext. If nccl exists in a default path in your system (whose bin directory that is contained in environmental variable PATH), then it will be built automatically when invoking make relc for example. Else, you need to specify through the variable NCCL_ROOT_DIR.

If you want to test, you need to have MPI and libcheck installed, as well as have Makefile.conf file properly setup to declare how many and which GPUs you want to use in order to test across many GPUs in your machine.

I want to note that testing with MPI and libcheck gave me a headache, when trying to execute test binaries for the first time. MPI processes signaled a SEGM FAULT reporting that memory address space was not correct. For anybody who may attempt a similar approach for multi-process testing: I did not know that libcheck forks and runs the tests in a subprocess. And it will happen that this subprocess will not be the “registered” MPI process, thus giving an error when a MPI command is issued with the expected MPI comm. To solve this, I turned off forking before running the tests through libcheck API. See this.

Right now I am working in completing python support for collectives libgpuarray API in pygpu. There will be a continuation blog post as soon as I finish.

Till then, have fun coding!
Tsirif, 24/06/2016

## June 24, 2016

### shrox (Tryton)

#### Working ODT file

Hurray! I can now generate an ODT file that works just fine and can be opened in LibreOffice without having to repair it! This is a great step forward for me in my project and I am really happy!

Since I wrote the last blog post, I have indeed come a long way. I have cleaned up my code, for one. It now looks like code should look and is fairy human readable.

Next I have also successfully generated the manifest.xml file that is associated with ODT files. This file keeps a list of all the files, the various xml files as well as the images that the final odt needs to display.

Two very useful, handy additions to my code are the usage of the StringIO and zipfile libraries. StringIO lets me make an “in memory” folder. Earlier I used to generate the files in the folder that my .py file was located. I used to create folders in them, the xml files as well as the images. Then I would manually zip the file and rename to odt. But now, my Python program does all of that, without using the system at all. Hence, I do not even need to ‘import os’ in my program to access the file system :)

### Shubham_Singh (italian mars society)

As planned according to the schedule ,i developed the GUI for the project first and later i interlinked the front end with the database designed using MongoDB .
There are various files developed at different stages of project ,some of them are :

a)health_index.ui : This is the basic user interface file generated using qt-designer which describes the layout of  the interface of the application .

b)health_index.py : This file is the python equivalent of health_index.py which is generated using      pyuic4 using the command :
pyuic4 health_index.py -o health_index.py

c)output.py : This is the exectable python program which is generated using the command :
pyuic4  health_index.ui -x -o output.py

To run the GUI use the following command :
a)Start the MongoDB deamon :
sudo mongod

b) Navigate to health_index directory and start the application using below command :
python test_output.py

if all the dependencies are installed correctly ,we can see the  initial GUI  window which consists of different LineEdit and Textbox for accepting the input .
Time Stamp and Time Interval are the two inputs which will be required with plotting the graph .
To accept the input values and parameters ,click on the "select the parameter button " .

Selecting the parameter and its value can be done through the input dialog window ,and time stamp is fetched from system time which can also be changed in case needed.
All the required input are then inserted into the designed database

db.hi_values.insert_one(
{   'parameter' : dataset[0]  ,
'value' : dataset[1]   ,
'timestamp' : dataset[2] ,
'timeinterval' : dataset[3]
}
)
And similarly fetched from the database in the summary tab.

Also i worked on developing the documentation of the project about the prerequisites ,dependencies and installation of the same .It also describes the steps to run GUI and the flow of control between different files in the project .
And from next week I will start with HI calculations and graph plotting with pyqtgraph .

Cheers !
Shubham

### srivatsan_r (MyHDL)

#### Completed HDMI Cores!

Finally after working for almost 3 weeks, I have completed coding HDMI cores and they are working fine. I took more time to complete the Receiver core because it had lots of modules. Testing them after writing the codes was again very tedious. I had to trace each and every signal’s waveform and find where it was going wrong.

MyHDL allows only certain classes and function to get converted into verilog code. So one doesn’t have the freedom to use all the cool functionalities of python in MyHDL if they want their code to be convertible. After I had coded the cores in MyHDL, I faced many errors when I tried to convert it to verilog. Like, I was not able to compare string inside a function decorated using MyHDL decorators. My mentor suggested me to compare the strings outside the function and assign the result to a boolean variable which can be used to see if its true inside the function.

So, after debugging all the errors the code successfully got converted to verilog. Initially I was trying to use verify_convert() function of MyHDL to check the credibility of the generated verilog code. But, since my code contained many xilinx primitives in it, I was not able to compile the code as the libraries were not available in icarus verilog simulator. So, I had to quit that idea and just use convert() function and check if the code gets successfully converted. The converted verilog code contained more than 3000 lines of code!!

Most of the Open source projects hosted on Github will contain some badges in their README file. These badges are images provided by some integration tools displaying their corresponding stats or scores.

Some important integration tools are Travis CI, landscape.io, coveralls.io and readthedocs.

Travis CI

landscape.io

Landscape integrates with your existing development process to give you continuous code metrics seamlessly. Each code push is checked automatically, making it easy to keep on top of code quality. You have to just signup at landscape.io and then get your badge’s Markdown and add it to the README file. The badge will show the health of your project. You can configure landscape also with a YAML file, you can make it ignore some warning checks. Here is a configuration file from my HDMI-Source-Sink-Modules project.

coveralls.io

Coveralls integrates with your code development and gives you the number of lines of your code is actually covered with your test files. Coveralls will be very helpful to give you the percentage of code covered by the test files. You have to just signup in their website and add your project repository. You have to modify a line in your .travis.yml file to make this work. You have to run your tests along with coverage command. Here is a Travis CI configuration file from my HDMI-Source-Sink-Modules project. The badge provided will contain the percentage of code covered by the test files.

Readthedocs lets you host the documentation of the project online. Again you have to signup and add your repository. You can use Sphinx for generating the documentation of your project. The autodoc feature of Sphinx allows you to generate the documentation of your code from the docstrings that are rST texts. For python projects which use Google style docstrings i.e. which are not written in rST format, there is a package called napoleon which automatically parses the Google style docstring into rST and Sphinx uses it. Readthedocs checks for the conf.py file generated by Sphinx and creates the documentation according to it.

All these badges helps any developer to get a quick summary of the status of your project.

### Karan_Saxena (italian mars society)

Time flies!!

It looks as if it was only yesterday that coding period started. Even my exams came and almost (2 more to go :P) passed by :D

1) PyKinect2 is [finally] working. Woosh!!
2) I am now able to ping the hardware via the .py script. Yay.
3) The coordinates are being dumped in a temporary file.

See this in action
 My body being tracked

Finally, big thanks to my sub-org admin Antonio and my mentor Ambar for allowing me to accommodate my exams in between.

Here's to next 2 months working on the project full time (y)

Onwards and upwards!!

### ghoshbishakh (dipy)

#### Google Summer of Code Progress June 24

This is midterm period and the dipy website has a proper frontend now! And more improvements coming.

### Progress so far

The custom content management system is improved and has a better frontend now.

The documentation generation script is also updated to upload documentation of different versions separately in github. And the django site is updated by checking the contents of the github repository through github API.

The current pull requests are #9 and #1082

You can visit the site under development at http://dipy.herokuapp.com/

### Details of content management system

The custom CMS now allows maximum flexibility. It is possible to edit almost every content in the website. Fixed sections, documentation versions, pages, publications, gallery images, news feeds, carousal images, everything can be edited from the admin panel.

One of the most important additions in the CMS is that now we can create any number of pages and we will get a url for that page. So this allows us to create any custom page and link it from anywhere. Also with a single click that page can be put in the nav bar.

To make the the nav bar dynamic I had to pass the same context to every template. Thankfully this can be achieved with “context_processors” in a dry way. Also this allows the documentation links to be changed without changing the template.

For now the documentations are hosted in the dipy_web github repository. Different versions of documentations are linked automatically into django by checking the content of the repository using github API.

There is also option of excluding some documentation versions from the admin panel.

### Details of the frontend

Although most parts of the website now have a basic styling, the frontend is still under constant improvement. The progress can be best visualized through some screenshots:

### What’s next

We have to automate the documentation generation process. A build server will be triggered whenever there is a new commit the documentation will be automatically updated in the website. Also there are some command line tools for which the docs must be generated.

I have to include facebook feed in the home page. Also the honeycomb gallery is cool but a carousal with up-to-date contents like upcoming events and news will be more useful. The styling of all parts of the website can be improved.

Then I have to clean up the code a bit and add some more documentation and start testing the internals. After that we have to think of deployment and things like search engine optimization and caching etc.

Will be back with more updates soon! :)

### Ranveer Aggarwal (dipy)

#### Making a Slider

The next UI element is a slider. It’s not as difficult as it looks since I already have most of the framework set up.
Basically, this time when the interactor gets the coordinates of the click, I need to pass these to the UI element in some way so that the callback can pick it up. And this is already set up as ui_params, which I implemented in the case of the text box.

Now, the slider itself is composed of two actors - a line and a disk. The disk would be moving (sliding) on the line. For doing so, I had to change DIPY’s renderer class a bit. The change now allows for nested UI elements, much like the slider - which has a sliderDisk and a sliderLine.
The click is handled by the line (the callback belongs to the sliderLine) and the disk moves.

The slider, in addition has a text box which displays the position of the disk on the line as a percentage.

This is how it looks right now.

[GIF]

It currently only works with a click.

### Enhancements and Improvements

These are things left to be done:

• Explore ways to make a circular slider
• Make it draggable
• Left key press should also increase the slider

And other enhancements to make it more futuristic. This is a line slider, there can be many more and they need to be explored too.

### This Week

I’ll be continuing my work on this slider and incorporating the above points.

### Sheikh Araf (coala)

#### Eclipse Plug-in: Dynamic menu with click listener

I’m writing an Eclipse plug-in for coala and recently I had to add a list of available bears so that the user can choose the bear to use for code analysis.

I struggled for a few days, experimenting different approaches. In this post I’ll share the method that I finally settled for.

First you have to add a dynamic menuContribution item to the org.eclipse.ui.menus extension point. So your plugin.xml should look like this:

<extension
label="Run coala with">
<dynamic
</dynamic>
</extension>


The class field points to the Java class the populates the dynamic menuContribution item. This class should extend org.eclipse.jface.action.ContributionItem. So let’s implement this class. The code looks like the following:

public class BearMenu extends ContributionItem {

}

super(id);
}

@Override
String[] bears = getBears();

for (final String bear : bears) {
public void widgetSelected(SelectionEvent event) {
System.out.println("You clicked " + bear);
// Analyze with coala here.
}
});
}
}

private String[] getBears() {
// returns an array containing names of bears
}


This way you can dynamically add items to the menu and assign them click listeners.

There are other approaches as well, e.g. a dynamic menu with a corresponding handler, etc. but I find this approach to be the easiest.

### liscju (Mercurial)

#### Coding Period - IV Week

The most important goal of this week was to change http redirection to reuse existing in mercurial code to communicate via http protocol. In the last week http redirection was using httplib library, but there were couple of reasons to reuse existing code. First thing is mercurial is able to communicate with httplib and with httplib2 according to library accessibility. The next reason for this was that there are existing code for things like authentication, communication reuse etc. In mercurial communication with http peer on the web is done mainly with:

https://selenic.com/hg/file/tip/mercurial/httppeer.py

Main function to send/get request is _callstream:

https://selenic.com/hg/file/tip/mercurial/httppeer.py#l92

The goal was accomplished and redirection reuses this communication.

The other goal of this week was to send/get data while communicating with http peer in chunks. Because largefiles is in general created to deal with really large files , sending/getting data at once leads to errors while communicating via network. I found this already implemented as httpsendfile, this is file descriptor created for dividing data into chunks. It can be provided to http request builder and thats all needed for sending chuned data, you can see how it looks like here:

https://bitbucket.org/liscju/hg-largefiles-gsoc/src/f34101d68fcc5b5e8fc7cf3d4727a9e2e08e599d/hgext/largefiles/redirection.py?at=default&fileviewer=file-view-default#redirection.py-241

Dividing file stream from server was already implemented also, this functionality is located in util.filechunkiter. Its parameters are stream, chunk sizes and length.

Another thing i did in this week was to enable generating redirection url dynamically by provided user application. I decided to use hooks for this. Hook is an external application that is ran when repository is doing actions. You can for example send email on commit with hooks, you can take a look here to read better description:

https://www.mercurial-scm.org/wiki/Hook

In case of the project we expect that hook(external application) will generate redirection target and write generated redirection to .hg/redirectiondst file from which feature will read it. To see how this works you can take a look here:

https://bitbucket.org/liscju/hg-largefiles-gsoc/src/f34101d68fcc5b5e8fc7cf3d4727a9e2e08e599d/hgext/largefiles/redirection.py?at=default&fileviewer=file-view-default#redirection.py-90

Apart from the project i sent patch to pull bookmark with 'pull -B .', it is merged already here:

https://selenic.com/hg/rev/113d0b23321a

So far largefiles was asking two times on cloning repository with largefiles, my patch to deal with this was merged as well:

https://selenic.com/hg/rev/fc777c855d66

### Pulkit Goyal (Mercurial)

#### Absolute and Relative Imports

While switching from Python 2 to 3, there are few more things you should care about. In this blog we will be talking about absolute and relative imports.

### Prayash Mohapatra (Tryton)

Last two weeks have been great. I am finally enjoying both reading and writing code. I realise that most of the problems, I had when I was stuck could be solved after a taking a break and reading the code calmly. There were times, I just sat back opened two panes, one of the left and one on the right, and just read the code over and over again, till I got what I was doing wrong.

These two weeks I have been working on completing the views for the web client and made the action buttons functional. Keeping the code similar to GTK client. Discussed which CSV parsing library to use for sao. Ended up choosing PapaParse, and later I tried PapaParse to support custom quote character, which is a feature we support in the GTK Client. And the contribution was merged upstream :D

Did some refactoring, which I previously felt wasn’t necessary. Re-implemented some parts of the views in a better manner. And the best thing I learnt during all this, was using Chrome Dev Tools to map the source files so that I don’t have to Refresh followed by 9 clicks, every time I make some change, to just 3 Clicks! And am also using breakpoints to see the values floating around the function.

This is what I have achieved so far. Currently working on the auto-detection of multi-level field names from an import file.

### Preetwinder (ScrapingHub)

#### GSoC-2

Hello,
This post continues my updates for my work on porting frontera to python2/3 dual support.
The first coding phase is almost about to end, the task I had to accomplish during this part was Python 3 support for single process mode. I have completed this task and will soon by making my pull requests. Firstly I had to make the syntactic changes which don’t actually change the codes operation, just changes syntax with the same effect. I did this using the modernize script which uses the six library to make the code operational in both versions. After that I made some more syntactic changes which the modernize script is unable to cover(things like changed class variable names etc). After these changes, the next step is to define a precise data model about the type of strings(unicode or bytes) to be used in the API, and the necessary conversions to be performed in different parts of the code. For this I have mostly followed the approach of using native strings(unicode in python 3 and bytes in python 2) everywhere. After these changes I proceeded to make all the test cases work in both python 2 and 3. I was mostly succesful in this, apart from tests related to the distributed mode(which I am yet to work on) and a pending url issue which hasn’t yet been addressed. Once I make the PR’s I am sure I’ll have to address a few more issues, but apart from that this part of my work is mostly done.

GSoC-2 was originally published by preetwinder at preetwinder on June 24, 2016.

### Aron Barreira Bordin (ScrapingHub)

#### Scrapy-Streaming [3] - Binary Encoding and Error Handling

Hi ! In the third week of the project, I implemented the from_response_request.

This allows external spiders to create a request using a response from another request.

## Binary Responses

To be able to serialize binary responses into json messages, such as images, videos, and files, I added the base64 parameter to the request message.

Now, external spiders are able to download and check binary data using scrapy streaming.

## Error Handling

I’ve implemented the exception message, that checks internal exceptions and sends it to the external spider.

We’ve two kind of issues: errors and exceptions.

Errors are raised when there are some problem in the communication channel, such as an invalid request, invalid field, and so on.

Exceptions represents problem in the Scrapy Streaming runtime, such as an invalid url to request, invalid incoming data, etc.

## Docs and PRs

This modifications have been documented in the docs PR: https://github.com/scrapy-plugins/scrapy-streaming/pull/7

And the modification in the communication channel can be found at: https://github.com/scrapy-plugins/scrapy-streaming/pull/5

## Examples

I’ve added new Scrapy Streaming examples here: https://github.com/scrapy-plugins/scrapy-streaming/pull/4

This examples may help new developers to implement their own spiders using any programing language, so each example shows a basic feature of Scrapy Streaming.

Scrapy-Streaming [3] - Binary Encoding and Error Handling was originally published by Aron Bordin at GSoC 2016 on June 23, 2016.

### Levi John Wolf (PySAL)

#### Partially Applied classes with __new__

Python’s got some pretty cool ways to enable unorthodox behavior. For my project, I’ve found myself writing a lot of closures around our existing class init functions, and have decided it might be easier & more consistent to express this as what it really is: partial application.

Partial application is pretty simple to enable for python class constructors, since the separate new method allows you to construct closures around initialization routines.

Since embedding math & code directly has been a pain on Tumblr recently, I’ll just link to the example notebook and (eventually) move this blog to gh-pages.

# Current Results

I’ve made a great deal of headway the last two weeks. The distributed estimation code is up and running correctly, as well as the necessary testing code. I rearranged the format somewhat. fit_distributed is no longer called within fit_regularized, instead it is now part of an entirely seperate module. The PR has a good summary of the current set up:

https://github.com/statsmodels/statsmodels/pull/3055

# To Do

There is still work to be done before I move on to the inference portion but things are getting much closer. First, I need to first the GLM implementation and implement the WLS/GLS version. Second, I need to work on putting together more examples, currently I have an example using simulated data and OLS but it would be good to expand these.

### jbm950 (PyDy)

#### GSoC Week 5

Well I started this week off by getting sick and as such productivity took a little bit of a hit. The majority of this week was spent reading Featherstone’s text book. The example documentation showcasing the base class API still hasn’t been reviewed and so that part of the project will just have to be set aside until later. Overall the project will not suffer, however, because of my progress in learning Featherstone’s method.

I’ve done a couple of other things this week as well. In a discussion with Jason it was determined that the LagrangesMethod class must have access to the dynamic system’s bodies through the Lagrangian input. Upon research the Lagrangain turned out to not be an object instance but rather a function that simply returned the Lagrangian for input bodies. This meant that LagrangesMethod did not in fact have access to the dynamic system’s bodies. Due to this I decided that an easy way to get LagrangesMethod to have body information would be to add an optional keyword argument for it. This was LagrangesMethod can have a more similar API to KanesMethod. This change can be found in PR #11263.

This week I reviewed PR #10856 which claimed to fix Issue #10855. Upon review it seemed that the “fix” was to just not run tests that were failing. When researched it looks like a whole module has not been updated for Python 3.X and is failing its relative imports. When run in Python 2.X it’s still not working either but rather is throwing up many KeyError flags. I think this has not been caught sooner due to the module being a component directly dealing with another project (pyglet) thus the tests are not run by TravisCI.

Lastly there were some test errors in the example documentation for the base class on PyDy. I was not too worried about these because the PR is not currently awaiting merging and is simply a discussion PR. The failing tests, however, were not related to the changes in the PR and so a PyDy member submitted a PR that fixed the tests and asked me to review it. After I looked it over and determined that the fix addressed the issue correctly he merged the PR.

### Future Directions

Next week I plan to continue forward with reading Featherstone’s book and, if possible, begin implementing one of the methods outlined in the book. Also I plan on beginning work on mirroring Jason’s overhaul of KanesMethod on LagrangesMethod.

### PR’s and Issues

• (Open) Added support for a bodies attribute to LagrangesMethod PR #11263
• (Open) Added a depencency on older version of ipywidgets PR #100
• (Open) Blacklisted pygletplot from doctests when pyglet is installed PR #10856
• (Open) sympy.doctest(“plotting”) fails in python 3.5 Issue #10855
• (Merged) Fix multiarray import error on appveyor PR #354

## June 23, 2016

### What’s done

In the time that has passed since I last posted one of these, I managed to get a prototype of scrapy to work using the new Signals API. This introduced two very significant API changes into Scrapy.

• All Signals now need to be objects of the scrapy.dispatch.Signal class instead of the generic python object
• All signal handlers must now receive **kwargs

The first change would not affect the existing extensions/3^rd party plugins much since declaring new signals is not something for the most part extensions do, and using PyDispatcher to call the signals instead of the SignalManager class has long been deprecated in Scrapy. To accomodate this, the Scrapy SignalManager has not yet been phased out and would still be functional, although possibly deprecated depending on how the performance benchmarks work out, and whether avoiding the overhead for the method calls is required.

The second of these changes however, affects the majority of these extensions and requires that we accomodate in someway. The solution required accomodating the RobustApply method of PyDispatcher in Scrapy, this method however would considerably affect the performance of the module, and so in order to have the faster signals one would be required to use handlers with keyword arguments.

The API was also modified to accomodate twisted deferred objects to be returned, and the error handling changed to use the Failure class from twisted.deferred.

The new module has also been unit tested for the most part, with some tests borrowed from Django since they’re the original authors of this signals API. I’m currently working on the benchmark suite, writing spiders that use non-standard signals and calls to test the performance. Eariler, signaling through send_catch_log used to be the biggest bottleneck requiring 5X the time required for HTML parsing. Any improvements we can do on that, the better although ideally I would like if we could make it so that signals are no longer the bottleneck to the crawling.

The following section is under construction.

### What needs to be done

Following the midterms, the highest priority would be to complete the bechmark suite so we know the viability of the approach we have used thus far and where to proceed from here. In case the results obtained are satisfactory, we shall then continue to make backward compatibility fixes and re-writing algortihms that are still not as efficient as they can be and look to maximize performance. We can continue on to provide full backward compatibility with object() like signals, however that would come with the trade-off that the performance of them would be more or less same as that of what was previously achieveable from the API.

Another major requirement would be for me to write good documentation of these parts, since these are essential to anybody writing an extension. We would also need to be on the lookout for regressions, if any.

~ Avishkar

### Aakash Rajpal (italian mars society)

#### Midterms are here!

Hey, all Midterms are coming this week and well the first part of my proposal is about done. Some Documentation work is all that’s remaining.

Well, my first part involved integrating the Leap Python API and Blender  PythonAPI and initially It was tough to integrate as the Leap API is designed for Python2.7 whereas the Blender only supported Python3.5. Hence I wasn’t able to integrate at first, thus I came to another solution. The other solution was to send data from the python2.7 script to the blender Python API using a socket connection. This worked well however it was somewhat inefficient and slow. Hence I thought I will try to generate a Python3.5 wrapper for the LeapSDK. however after days of trying I found a Solution with a little help from the Community and thus was able to generate a Python3.5 Wrapper for the Leap SDK through Swig. The Wrapper worked perfectly fine and thus I successfully

Hence I thought I will try to generate a Python3.5 wrapper for the LeapSDK. This sounded easy but was anything but easy. I found little support online, there were pages that were meant to help but most of them were for windows and very few for ubuntu. However after days of trying I found a Solution with a little help from the Community and thus was able to generate a Python3.5 Wrapper for the Leap SDK through Swig. The Wrapper worked perfectly fine and thus I successfully integrated the two API.

I talked to my mentor to talk about the gesture support required for the project and added more gesture support.

I am now documenting my code and preparing myself for the second part of my project.

### Ravi Jain (MyHDL)

#### Maintain A Clean History!

Finally I have completed and merged the management module. Last time I posted, things i needed to be able to merge was to add the doc-strings, setup coveralls, resolve conflicts with master branch(rebase).

Adding Doc-Strings was the easiest but still took time as it gets a little boring(duh!). I used this example provided by my mentor as reference.

Now came time to do a coveralls setup, which i must say i a little more complex compared to the others. I really got a lot of help from referencing an already setup repo test_jpeg on which a fellow GSoCer is currently working. It got a little tricky in between as i stumbled upon the type of end-of-line character problem. Before this i didn’t even know that even an “type of enter” can cause problem in running scripts. It consumed my one whole day. It bugged me when i was trying to edit it in notepad on Windows. This post later helped get me over it. More on coveralls setup on my next post!

Next Rebasing and resolving conflicts my dev branch compared to master branch. When i started my master branch was a few commits ahead (setting up of badges) and thus was having conflicts. Also Rebase was required as my mentors suggested to maintain clean history in the main branch. It took me lot of experiments to finally understand the way to go for rebasing my branches. The structure of my repo:

• origin/master
• origin/dev
• dev

So i have a local dev branch in which i develop my code and constantly push to remote origin/dev branch for code reviews by my mentor. This leads to lot of commits containing lot of small changes and resolving silly issues. But when i make a pull request and merge onto origin/master branch I wish to have cleaner commit history.

So doing an interactive rebase helps to modify that history using pick(Keep the commit), squash(Merge onto previous while editing the commit description), fixup(Merge onto previous keeping the previous commit description intact). Understanding this required me doing lot of experiments with my branch which is dangerous. So I had made a copy of my dev branch, which i suggest you do right now before continuing.

To rebase your local branch onto origin/master branch use “git interactive -i <base branch>“. Warning, avoid moving the commits up or down if the are working on the same file. This may cause conflicts. Once it starts, Resolving conflicts is lot of pain because it triggers other conflicts as well if not done properly.

After rebasing come the trickier part. Your local branch has brand new rebase commits and your remote has old commits. You need to use “git push –force”. It will overwrite the commits on remote branch after which you can generate a pull request onto origin/master. Don’t do it if there are other branches based on this branch In that case directly merge onto master, downside being you wont get to be able to make pull request on which is essential for code discussions.

After all this my code was ready to merge and i got go ahead (after a day of internet cut, hate that) from my mentor to merge it. So i had finally completed second merge on to my main branch implementing the management block and setting up coveralls.

### Raffael_T (PyPy)

#### Progress summary of additional unpacking generalizations

Currently there's only so much to tell about my progress. I fixed a lot of errors and progressed quite a bit at the unpacking task. The problems regarding AST generator and handler are solved. There's still an error when trying to run PyPy though. Debugging and looking for errors is quite a tricky undertaking because there are many ways to check whats wrong. What I already did though is checking and reviewing the whole code for this task, and it is as good as ready as soon as that (hopefully) last error is fixed. This is probably done by comparing the bytecode instructions of cpython and pypy, but I still need a bit more info.

As a short description of what I implemented: until now the order of parameters allowed in function calls was predefined. The order was: positional args, keyword args, * unpack, ** unpack. The reason for this was simplicity, because people unfamiliar with this concept might get confused otherwise. Now everything is allowed, breaking the last thought about confusions (as described in PEP 448). So what I had to do was checking parameters for unpackings manually, first going through positional and then keyword arguments. Of course some sort of priority has to stay intact, so it is defined that "positional arguments precede keyword arguments and * unpacking; * unpacking precedes ** unpacking" (PEP 448). Pretty much all changes needed for this task are implemented, there's only one more fix and a (not that important compared to the others) bytecode instruction (map unpack with call) to be done.

As soon as it works, I will write the next entry in this blog. Also, next in the line is already asyncio coroutines with async and await syntax.

### aleks_ (Statsmodels)

#### Hello testing!

Hello everyone!

Today I am going to share with you a neat little trick I have learned about during the first weeks of GSoC. I will show it in form of a small example which clarifies its use. You can also find a short description in the NumPy/SciPy Testing Guidelines.

Let's say you have a function calculating different things and returning them as a dictionary (e.g. the function is estimating different parameters, say alpha and beta, of a statistical model):
def estimate(data, model_assumption):    # do some calculations    # alpha_est = ...     # beta_est = ...    return {'alpha': alpha_est, 'beta': beta_est}
Now we want to test this function for different data sets and different model assumptions. To do this we first create a separate test file called test_estimate.py. Inside this file we place a setup() function which takes care of loading the data and results:
datasets = [d1, d2]model_assumptions = ['no deterministic terms', 'linear trend']results_ref = {} # dict holding the results of the reference softwareresults_sm  = {} # dict holding the results of our program                 # (sm stands for statsmodels)def setup():    for ds in datasets:        load_data(ds) # read in the data set        results_ref[ds] = load_results_ref(ds) # parse the reference's output        results_sm[ds] = load_results_statsmodels(ds) # calculate our results
Now that all results are accessible via the results_XXX dictionaries they only need to be compared. Note that the load_results_XXX(ds) functions return dictionaries such that results_XXX[ds] is dictionary as well. It is of the form
{'no deterministic terms': results_no_deterministic_terms, 'linear trend': results_linear_trend, ...}
and results_model_assumption is again a dict looking like {'alpha': alpha_est, 'beta': beta_est}.

Phew, this probably sounds a little complicated. So, why all these nested dictionaries? Well, it makes the actual testing very easy. To check whether our result for alpha is the same as in the reference software, we just do (assuming alpha is a numpy array):
def test_alpha():    for ds in datasets:        for ma in model_assumptions:            err_msg = build_err_msg(ds, ma, "alpha")            obtained = results_sm[ds][ma]["alpha"]            desired  = results_ref[ds][ma]["alpha"]            yield assert_allclose, obtained, desired, rtol, atol, False, err_msg
This code will now produce tests for alpha for all different combinations of data sets / model assumptions. So by adding data sets or model assumptions to the corresponding list the generated tests will multiply resulting in a nice set of tests. Failing tests can easily be identified by the error message given to numpy's assert_allclose in which the data set, the model assumption and the parameter that isn't calculated correctly are mentioned. If you have questions regarding this method of testing, check out the NumPy/SciPy Testing Guidelines or leave a comment.

With that, thanks for reading! : )

## Markov switching autoregression

If you studied statistics and remember basics of time series analysis, you should be familiar with Autoregressive model, usually denoted as AR(p):
Here y is an AR process, e is a white noise term, nu is a mean of the process. Phi is a polynomial of order p:
L is a lag operator, which, multiplied by time series element, gives previous element. So (1), actually, can be rewritten in the following explicit form:
Since the process definition (1) is essentially a linear equation between process lagged values and error, it can be put in a state space form, which is shown in [1], chapter 3.3.
Again, let's extend equation (1) by adding an underlying Markov discrete process St of changing regimes:
You can notice, that the mean, error variance, and lag polynomial become dependent on switching regime value. This is a so called Markov switching autoregressive model (MS AR). Where can it be used in practice? Let's look at the example from [2], chapter 4.4, which I also used for testing of my code:
This is a sequence of U.S. real GDP. Looking at the data, two regimes are noticeable - expansion and recession. Using maximum likelihood estimation, we can fit this data into two-regime switching mean AR model to describe real GDP changing law quantitatively. Authors use AR(4) model, so do we. The next picture displays (smoothed, that is conditional on the whole dataset) probabilities of being in the recession regime:

Peaks of probability accurately correspond to recession periods, which proves that Markov switching AR provides a sophisticated tool for analyzing an underlying structure of time process.

## Implementation

Markov switching autoregression is implemented in ms_ar.py file in my PR to Statsmodels. This file contains MarkovAutoregression class, which extends RegimeSwitchingMLEModel. This class "translates" equation (4) to the state space "language".
It was quite entertaining to express ideas, explained in chapter 3.3 of [1] within the Python code. One thing I had to be very careful about was that having AR(p) model of k regimes, state space representation should carry k^(p+1) regimes, since switching means occur in (4) with different regime indices. Thus, every state space regime represents p+1 lagged AR regimes.
Such a big number of regimes leads to longer computation time, which caused some problems. For example, Kim filtering of the former real GDP model took 25 seconds, which is inappropriate, when we are doing a lot of BFGS iterations to find likelihood maximum. Luckily I found a way to optimize Kim Filter, which was quite straightforward, in fact. If you remember a previous blog post, Kim filter iteration consists of heavy-weight Kalman filter step, where Kalman filtering iteration is applied a lot of (k^(2(p+1)) for MS AR!) times, and then summing the results with weights, equal to joint probabilities of being in current and previous regime. The thing is that in the case of sparse regime transition matrix, which MS AR model is about, these joint probabilities are very rare to be non-zero, and we don't need to calculate Kalman filtering for zero ones! This feature decreased Kim filter routine evaluation dramatically, giving 2-3 seconds on my machine (which is not very powerful, by the way).

## EM-algorithm

MarkovAutoregression class also has a feature of EM-algorithm. Markov switching autoregressive model, defined by (4), with some approximations, though, is a regression with switching parameters and lagged observations as regressors. Such model, as shown in chapter 4.3.5 of [2], has a simple close-form solution for EM iteration. EM-algorithm is a great device to reach a very fast convergence. For example, in the comments to my PR I copied a debug output with the following numbers:
#0 Loglike - -1941.85536159#1 Loglike - -177.181731435
Here #0 indicates random starting parameters likelihood, and #1 indicates the likelihood of parameters after one iteration of EM-algorithm. A very significant improvement, isn't it?
MarkovAutoregression has two public methods to run EM-algorithm: fit_em and fit_em_with_random_starts. First just performs a number of EM iterations for given starting parameters, while the second generates a set of random starting parameters, then applies EM-algorithm to all of them, finally choosing one with the best likelihood.

## Testing

Right now there are two test files for MarkovAutoregression class, each based on one model - test_ms_ar_hamilton1989.py and test_ms_ar_garcia_perron1996.py. Besides formal functional tests, such as that filtering, smoothing and maximum likelihood estimation give correct values against this and this Gauss code samples, these files contain testing of EM-algorithm in its typical usage scenario - when user knows nothing about correct parameters, but wants to estimate something close to likelihood global maximum. This task is handled by already mentioned fit_em_with_random_starts method, which, by default, runs 50 sessions of EM-algorithm from random starts, each session consists of 10 iterations.

## What's next?

I hope that the hardest part of the project, that is developing of Kim filter and Markov autoregression, is passed. Two more models remain: dynamic factor and time-varying parameters model with regime switching. There also will be a lot of refactoring of already written code, so some articles are going to be all about coding.

## Literature

[1] - "Time Series Analysis by State Space Methods", Second Edition, by J. Durbin and S.J. Koopman.
[2] "State-space Models With Regime Switching" by Chang-Jin Kim and Charles R. Nelson.

## Week 3

The task of Week 3 was to create a neat little warning message if the user runs a wrong coala bear. What do I mean by wrong? Well, PyLintBear is a wrong bear for Ruby code, right? I wanted to add a warning message if such a thing happens. The main things I needed to do for this to achieve were

• Get a list of all bears.
• Figure out the language of file (eg Python/Java/Javascript/CSS …).
• Trigger a warning message if the language of the bear being executed doesn’t match the language of file.

The first task is relatively easy, as I had done the same thing earlier for Week2 task which was coala-bears-create, the third part is as simple as it can get. The whole difficulty lied in second part. I explored different tools and libraries like python-magic , mimetype but none of them are accurate enough.

Language detection is a very difficult task and these tools weren’t able to solve the problem. I also tried to use Linguist but it’s Python port wasn’t compatible with Python 3. Seeing all other options fail, I decided to use hypothesist’s dictionary which maps language extensions to language names. Fair enough for some basic checking, not going to help in cases like C header files, Python 2 v/s 3 etc.

But I had to start somewhere, and out of all options this was the most feasibale to use so I went ahead with this and began my coding. However as it turns out, in the diccussion at the issue, language detection is not that accurate, and it’ll most likely fail than succeed. So the status of the issue has been changed to not happening/won'tfix.

• The PR for the same can be seen here https://github.com/coala-analyzer/coala/pull/2310
• You can still see it in action, if you want:

https://asciinema.org/a/48487.png

## Week 4

The task of Week 4 was to clean up everything I did till now as MidSem evaluations were approaching. I didn’t get any of my work reviewed till now and I was in a mess that week. With the deadline approaching fast, I had to clean up my stuff, do bug testing and get my work reviewed with mentors and co mentors. The reviewing task took 4 long days as their were around 6 commits and most of the code had to be redone. Some of the changes that were made during review process are:

• Dynamically generating a list of bears than a static list. Also used @lru_cache() to cache the results for some performance improvements.
• Removed few extra prompts such as result message,prerequisite fail message etc., which were just optional values and the template should be kept to bare minimum. stuff needed, hence going with that philosophy it was decided to remove.
• Support for multiple languages added in the dropdown.
• Logging exceptions in a better way by using logging module.
• Add some doctests, wherever possible.
• Use coala API to include StringConverter for easy str to set conversions.
• Add a gitlab-ci for triggering automatic builds running on Gitlab.
• General code cleanup, Minor bug fixing, refactoring some variables and formatting changes.

With great support from my mentor and co mentors, I have got my work accepted to master branch. A sigh of relief, it was for me! Also, you guys can checkout and let me know your feedback by using pip install coala-bears-create.

Watch it in action here: https://asciinema.org/a/49693.png

I have got a huge takeaway lesson from my mistake of not getting your stuff reviewed earlier, which basically saves everyone’s time and doesn’t causes last minute panic attacks of code not working. In order to not happen this again, I have planned to make a devlog in which I’ll be updating my work daily & also will get mentor or someone from the coala community to review every week’s work.

## Future plans

I have started working on making some UI changes in coala application, using Python Prompt Toolkit and currently researching on that. I am also reading on unit tests, as I have to add them to my coala-bears-create application.

Happy Coding!

### fiona (MDAnalysis)

#### Timeline update and a sweet treat

Hello again! Sorry for the long wait since last post – I’m falling behind my proposed timeline so I’ve been focusing more on trying to catch that up a bit.

I’ve finished work on ‘add_auxiliary’ for now - the general framework and specific case for reading xvg files are basically done, though we’ll see in the coming stages if there are any bugs still to iron out. I’ll probably include in a future post a demonstration of the various features!

Let’s take a look at a revised timeline:

You can see I’m not quite as far along as I was hoping back at the start - building all the add_auxiliary stuff took longer than I expected! I also did some things I originally planned to do later, but in retrospect made more sense to do now. This includes in particular getting the documentation and unit tests nicely done up for AuxReader, and now that I have a good idea how both work, documentation and testing for future parts will hopefully go a lot quicker and smoother! I’m hoping I can stay roughly on track from here - but various bits can be simplified/dropped if need be, and I should still have a nice foundation, which can be further built upon later, by the end of GSoC.

So what next? I’m starting working on part 2 - an ‘Umbrella class’, a framework for storing the trajectories and associated information of a set of Umbrella Sampling simulations. The next post will focus on my plan for this in more detail - but before I leave, something a bit different!

### And now for something completely different

All the way back in my first post I mentioned I’m a keen baker. To make consuming sugary snacks even more exciting, I’ve done in the past a couple of ‘edible voting competitions’ within my research group – SBCB, at the University of Oxford – to decide important matters like the fate of New Zealand’s flag design by eating cookies.

I thought this time I’d try something relevant to my GSoC project, so may I present:

Let’s meet out contestants (the selection of which was heavily biased towards the tools that I use and had ‘logos’ I could reasonably approximate with coloured fondant):

1. GROMACS: Software for performing and analysing Molecular Dynamics (MD) simulations.

2. VMD (Visual Molecular Dynamics): A program for visualising and analysing molecular structures and simulation trajectories.

3. Tcl (and Tk): A programming language; the command-line interface for VMD uses Tcl/Tk, so writing Tcl scripts lets us automate loading and analysing structures/trajectories in VMD.

4. Python: another programming language – as you’re hopefully aware by now, since it’s what I’m using for my GSoC project! In SBCB, often used when writing scripts to automate setup, running and/or analysis of simulations when direct interation with VMD isn’t required.

5. MDAnalysis: A Python library for loading and analysing MD simulations – again, as you hopefully know by now, since it’s what I’m working on for GsoC!

6. Git: A version control system, allowing you to keep track of changes to a set of files in a ‘repository’ (so when you find everything is broken after you made a bunch of changes, you don’t have to spend ages tracking them all down to revert to the working version).

7. Github: A web-based hosting service for Git repositories, allowing sharing of code and collaboration on projects. MDAnalysis is there, I have a page too, and it’s where I’ve been pushing all my proposed changes/additions there (see add_auxilairy here) so other people can check them over!

(There are alternatives for each of these that perform more or less the same functions – but the above are those largely used in SBCB).

We started off with three of each ‘logo’, and the rules were to each time you took a biscuit, take the one you use the least, like the least, or have never even heard of – ideally leaving us with SBCB’s favourite MD tool!

So how did the vote go? (*drumroll*)

So congratulations Python – you’re the best tool for for MD*, as voted by SBCB!

(*Disclaimer: unfortunately this otherwise highly rigorous ‘scientific study’ was somewhat biased by the participation by several non-MD or non-computational personnel and on a couple of occasions disregard for the rules in favour of the cookie that was closest of most aesthetically pleasing.)

And on that triumphant note, I’ll sign off here, and see you all next post! In the meantime, if you’re disappointed you didn’t get to eat an (unofficial) MDAnalysis cookie, why not go buy an (official, though inedible) MDAnalysis sticker to show off instead?

## June 22, 2016

### Aakash Rajpal (italian mars society)

#### Midterms coming and I am good to go

Hey, all Midterms are coming this week and well the first part of my proposal is about done. Some Documentation work is all that’s remaining.

Well my first part involved integrating the Leap Python API and Blender  PythonAPI and initally It was tough to integerate as the Leap API is designed for Python2.7 whereas the Blender only supported Python3.5. Hence I wasn’t able to integerate it, however after days of trying I found a Solution with a little help from the Community and thus was able to generate a Python3.5 Wrapper for the Leap SDK through Swig. The Wrapper worked perfectly fine and thus I successfully integerated the two API.

I talked to my mentor to talk about the gesture support required for the project and added more gesture support.

I am now documenting my code and preparing myself for the second part of my project.

### tushar-rishav (coala)

#### Beta release

So finally we released coala-html beta version. At present, coala-html generates an interactive webpage using the results obtained from coala analysis. Users can search across results, browse the files and also the particular code lines where errors were produced. Similar to a coverage tool that displays the lines being missed. At present we support Linux platform only and will add more cool features in coming releases.

We would love to hear from you. If you have any feature proposal or if you find any bugs, please let us know

Now, with coala-html released I’ve started working on coala website. Further updates in next blog!
:)

### TaylorOshan (PySAL)

#### Testing...one...two...

In the last week or so I have refined existing unit tests and added new ones to extend coverage to all of the user classes (Gravity, Production, Attraction, and Doubly) as well as to the BaseGravity class. Instead of testing every user class for every possible parameterization, the baseGravity class is tested over different parameterizations such that the tests for the user classes primarily focus on testing the building of dummy variables for the respective model formulations. In contrast, the BaseGravity class tests different cost functions (power or exponential) and will also be used for different variations that occur across the user classes. Unit tests were also added for the CountModel class, which serves as a dispatcher between the gravity models and more specific linear models/estimation routines. Finally, unit tests were added for the GLM class which is currently being used for all estimation on all existing spatial interaction mdoels within SpInt. This will be expected to change when gradient-based optimization is used for estimation of zero-inflated models in a MLE framework instead of the IWLS estimation currently used in he GLM framework.

In addition to unit tests code was also completed for handling overdispersion. First, several tests were added for testing the hypothesis of overdispersion within a Poisson model. Second, the QuasiPoisson family was added to the GLM framework. This is essentially the same as the Poisson family, a scale parameter, phi, (also known as the dispersion parameter) is estimated using a chi-squared statistic approximation and used to correct the covariance, standard errors, etc. to be less conservative in the face of overdisperison. QuasiPoisson capabilities were then added to the gravity model user classes as a boolean flag that defaults to false so one can easily adopt a quasi-MLE poisson modleing approach if they use the dispersion tests and conclude there is significant overdispersion. It was decided to push the development of the zero-inflated poisson model until the end of the summer of code schedule, which is also where graident-based optimization now resides. This makes sense, since these go hand-in-hand.

Next up on the agenda is an explortory data analysis technique to test for spatial autocorrelation in vectors and a helper function to calibrate origin/destinaton specific models so that the results can be mapped to explore possible non-stationarities.

## Recap

Last week I managed to merge the code I was working on for the last 4 weeks. It was meant to bring coala language independence support, as far as the bears are concerned. As I explained in the previous posts the developer will still need to write a bit of python because the functionality is implemented as a python decorator. A minimum of 3-4 lines of code is necessary to write the wrapper using the implemented decorator.

With all that being said, I am proud to announce that bears can be now officially be written in languages other than python. Is this a daring cover picture? Yes, yes it is. Have I tested coala with each language represented there? No, no I haven't. It is worth mentioning that there are a lot more features that can be added to the decorator and I will definitely try to add as many as possible.

## I don't want to write python

First of all, you should ask yourself why. Secondly, the next part of my project revolves around creating and packaging utilities. Let's explain better.

###### Creating

For the users that don't feel comfortable with python, there should be some kind of script that creates the wrapper automatically for them by asking some questions and then filling a standard template. This is my goal for this week, building such an utility.

###### Packaging

Bears in coala come with a separate python package called coala-bears and it includes all bears developed by the coala community so far. There is another GSoC project that aims to make bears decentralized so that one can download only the bears that he needs. It is the goal of my project to develop such an utility that will let you package your bear (supposedly in a pypi package) so that distribution becomes much easier.

## Conclusion

To sum it up, the language independence part is pretty much done. Now I am working on making it easier to grasp onto and actually use. I never actually know how to end these blogs and they get kind of awkward at the end so...

## June 21, 2016

### Pranjal Agrawal (MyHDL)

#### Midterm report and future plan

The mid-term evaluations are here. For this, I am required to submit a report of my work so far, and list the plan for the future weeks. So here goes.

### The work till now

In the month since GSoC coding period started, I have :

Week 1 - 2 : Created the tools simulator and assembler for the core, to better understand the design and architecture set.
Week 3 - 4 : Written tests for and coded the main modules of the processor

With respect to the timeline detailed in my GSoC proposal, I have met most of my deadlines. The core and tools have been coded. Tests have been written and are passing for the most part(some test that are not yet passing are marked @pytest.mark.xfail, to be fixed next) .

A PR has been given from the main development branch, core to the master of the repo, which can be seen at:

https://github.com/forumulator/pyLeros/pull/1

### Issues

Unfortunately, I had to take a couple of unplanned trips urgently due to which my work, and more importantly, the work flow, suffered in the first couple of weeks. But I have worked extra during the next two week to make up for the slow start, and now I am almost at my midterm goals.

Work wise, the one major thing that I planned that has been shifted to post-midterm is setting up the hardware and testing the core on Atlys and Basys FPGA, both of which I own. Unfortunately, this is not the simplest task. Subtle issues in the code manifest themselves in the actual hardware execution that do not during simulation. For example, there's the issue of delta delay that occur between simulation steps which are not present in the hardware, which can lead to subtle nuances. Further setting up I/O properly for the boards a significant task. This make building for hardware different from building for simulation.

### Plan for the coming weeks

In the next couple of weeks, I plan to have a completely working processor, including on the hardware. Further, the code will be refactored to take advantages to some of the advances features of myHDL including interfaces. That leaves me with enough time to devote to working on the SoC design, and comparison on VHDL and myHDL versions of the core.

Week 5: Clean up the code and add documentation wherever missing. Make sure that all the tests pass and the simulation of the processor is working
Week 6: Add I/O, reusing uart from rhea if possible. Refactor the code to use interfaces. Write small examples for the instruction set.
Week 7: Setup the Atlys and Basys boards. Make sure that the processor works on FPGAs, along with all the examples. Add I/O for the hardware. Write a script to build for the two boards.

In conclusion, I worked, had issues, completed almost all goals for the midterm evals, and hope to resolve the issues in the coming weeks. I'm really enjoying this experience.

#### Week 1-2 Summary : Assembler and Simulator

This post is a little late in coming, I know. As I mentioned in the earlier post, I was completely cut off from the internet for the first couple of weeks, and the communication part of the project has been a little weak.

Anyway, this is about the work done in the first 2 weeks. The first 2 weeks were dedicated to studying the design of the processor and creating the tools, including the simulator and the assembler. Creating the assembler linker helped to get thoroughly familiar with the instruction set, while the simulator helps understand the data paths that need to be build in the actual processor. Plus, these tools are useful in quickly writing examples to test on the actual core.

What follows is a description of both the tools.

### Simulator:

An instruction set simulator, or simply a simulator, for those who don't know, is a piece of software that does what the processor would do, given the same input. We are 'simulating' the behaviour of hardware on a piece of software. The mechanism, of course, is completely different.

A ISS is usually build for a processor to model how it would behave in different situations.  Compared to describing the entire datapath of the processor, a simulator is much simpler to code.

Since the design of Leros is accumulator based, one of the operands is implicit(the accumulator) and this instruction describes adding the content of memory location r1 to the contents of accumulator, and storing it back in the acc.
Where 0x08 is the opcode, and 0x12 is the address of the register described by the identifier r1. The actuall processor would involve a decoder.:

On a simulator, this can easily be modelled by a decoder function containing if-else statements that do the same job, for example,

if instr & 0xff00:
acc += val

The storage units, for example, accumulator, register file, or the data/ instruction memory, is modelled by variables. And that's pretty much there is to a simulator.

### Assembler:

Before assembly code can be simulated, it needs to be assembled into binary for a particular instruction set, and that is the job of the assembler. The major difference between an assembler and compiler is that most of assembly code is just a human readable version of the binary that the processor executes. The major job of an assembler is:

1. Give assembler directives for data declaration, like a_: DA 2, which assigns an array of two bytes to a.
2. Convert identifiers to actuall memory locations.
3. Convert instructions fully in to binary.

When the programs is split into multiple files, there are often external references, which are resolved by the linker. The linker's job is to take two assembled files, resolve the external references, and convert them to a single memory for loading.

### Leros instruction set and tools

Since the leros instruction set is of constant length(16 bit) and uses only one operand(the other being implicit, the accumulator), the job was greatly simplified. The first pass, as described above, has to maintain a list of all the identifiers. There are no complex instructions like in the 8085 instruction set, or a complex encoding like the MIPS instruction set.

The high 8 bits represent the opcode, with the lowest opcode bit representing if the instruction is immediate. The next two bits are used to describe the alu operation, which can be arithametic like
or logical like
OR, AND, XOR, SHR
Data read and write from the memory is done using the instructions
The addressing can be either direct, or immediate, with the first 256(2^8) words of the memory directly accessible with address given as tge lower 8 bits instr(7 downto 0) describing the address. The higher addresses can be accessed by using indirect load stores, in which an 8 bit offset is added to the address, which is also retrieved from the memory using a load.

Finally, branching is done by using the
BRANCH, BRZ, BRNZ, BRP, BRN
instructions, which respectively mean the unconditional branch, branch if zero, branch if non zero, branch if positive, branch if negative.

I/O can be specified by the
IN, OUT
instructions along with the I/O address given as the lower 8-bits of the instruction.

That's the end of that. Stay tuned for more!

### TaylorOshan (PySAL)

#### Sparse Categorical Variables Bottleneck

This post is a note about the function I wrote to create a sparse matrix of categorical variables to add fixed effects to constrained variants of spatial interaction models. The current function is quite fast, but may be able to be improved upon. This link contains a gist to a notebook that explains the current function code, along with some ideas about how it might become faster.

## Converting patches to GitHub pull request

In the last blog post, I told you about this feature where the patch submitted by the developer should be converted to a GitHub pull request. This feature consists of three tasks:

### 1. Create a branch

First, we need get the python version which is affected due to this bug. Then, I use Python’s subprocess module to checkout a new branch from Python version. The contents of patch is saved as text in postgres ‘file’ table. I created a file using this content and apply the patch to the new branch. I commit and push the new branch to GitHub

### 2. Create a pull request

GitHub API provides an endpoint to create a pull request. I first set up an access token to authorize the requests to GitHub. Then, I just need to provide base and head branch and a title for the pull request as data.

### 3. List Pull Request on Issue’s page

The response from the successful API request contains all the information about the new GitHub PR. I save the URL amd state of PR in the database and link it with the issue.

That is it for this blog post, thank you for reading. Let me know if you have any questions.

## June 20, 2016

### Upendra Kumar (Core Python)

#### Page Screenshots in tkinter GUI

1. Install from PyPI :
2. Update Page
3. Install from Requirement Page
4. Install from Local Archive
5. Install from PyPI

These are some of the screenshots of the GUI application developed till now. Now, in a few days, I need to fix and give a final touch to these functionalities and write test modules for this GUI application.

Further work for next week are related to making this application multithreaded, implement advanced features and improve GUI experience. It is very necessary for me to complete the basic GUI application, so that I can get user feedback when this application is released with Python 3.6 in first week of July.

### Pranjal Agrawal (MyHDL)

#### GSoC developement progress and the first blog

June 20, 2016

I got selected to GSoC 2016 for the Leros Microprocessor project under the myHDL organization which is a sub-org of python.  The project consists of me porting and refactoring code for the Leros microprocessor, from VHDL, in which it was originally developed, by Martin Schoeberl(https://github.com/schoeberl), to python and myHDL. This will then be used to build small SoC designs and test the performance on the real hardware on the Atlys and the Basys development board. The other advantage of Leros is that it is optimized for minimal hardware usage on low cost FPGA boards. The architecture and instruction set, and the pipelines have been constructed with this as the primary aim.

The original Github for the VHDL version is available at: https://github.com/schoeberl/leros , and the documentation with the details at: https://github.com/schoeberl/leros/blob/master/doc/leros.pdf

### The situation so far

The GSoC coding period began on 22 May 2016, and ends on 27 August, 2016. Today, the date is 20 June, 2016. It has been almost a month since the start of the coding period, and due to unfortunate circumstances, the work, I'm sorry to stay, was a little slow in the first couple of weeks. On top of that, I have not really blogged about my progress all that frequently, and thus the situation looked quite bleak a couple of weeks ago. However, week 3 saw a dramatic rise in the amount of work being done, and thanks to the extra week I reserved before midterm, I have almost completely caught up to all my goals for the midterm. The blogging was still little laggy, but I will be making up for that with posts describing my weekly progress for the first 4 weeks henceforth.

### Summary of weekly work

The summary of my weekly work is as follows:

Community bonding period: Wrote code samples and get familiar with the myHDL design process.
Week 1: Studied the design of Leros thoroughly and decide the major design decisions for the python version. Started with the instruction set simulator.

Week 2:  Finished with the instruction set simulator.

Week 3: Wrote a crude assembler and linker to complement the simulator which has a high level version of the processor. Started on the actual core with the tests.

Week 4: Integration and continued work on the actual core. The core is more or less where it should be according to my timeline.

As mentioned earlier, I will be following up will blog posts detailing the work of each of the weeks described earlier.

### Further work and midterms eval

TO DO: The major thing that I have not been able to do is setup the processor on actual hardware( the atlys and basys boards), as planned before the midterm. That has been shifted to the week after the midterms.

The work for this week, before the midterm evaluation is to clean up the code in the development branches and make sure the tests pass, then give a PR to the master which I will be showing for the midterm evaluations.

I will also be writing a midterm blog post detailing the complete work and report for the evaluation.

I am immensely enjoying my work so far.

### Yashu Seth (pgmpy)

#### The Continuous Factors

We are reaching the mid semester evaluations soon. Since I started my work a couple of weeks early I am almost through the half way mark of my project. The past few weeks have been great. I also got the first part of my project pushed to the main repository. Yeah!!

First of all, some clarifications related to the confusion surrounding ContinuousNode and ContinuousFactor. Which one does what? After long discussions in my community we have come to the conclusion that we will have two separate classes - ContinuousNode and ContinuousFactor. ContinuousNode is a subclass of scipy.stats.continuous_rv and would inherit all its methods along with a special method discretize. I have discussed the details about this in my post, Support for Continuous Nodes in pgmpy. On the other hand the ContinuousFactor class will behave as a base class for all the continuous factor representations for the multivariate distributions in pgmpy. It will also have a discretize method that would support any discretization algorithm for multivariate distributions.

The past two weeks were almost dedicated to the the Continuous Factor classes - ContinuousFactor and JointGaussianDistribution. Although I had not planned for a separate base class in my timeline, but it turned out that it is a necessity. Despite its inclusion I have managed to stay on schedule and I am looking forward to the mentor reviews for this PR.

Now, I will discuss the basic features of the two classes in this post.

## ContinuosFactor

As already mentioned this class will behave as a base class for the continuous factor representations. We need to specify the variable names and a pdf function to initialize this class.

>>> import numpy as np
>>> from scipy.special import beta
# Two variable drichlet ditribution with alpha = (1,2)
>>> def drichlet_pdf(x, y):
...     return (np.power(x, 1)*np.power(y, 2))/beta(x, y)
>>> from pgmpy.factors import ContinuousFactor
>>> drichlet_factor = ContinuousFactor(['x', 'y'], drichlet_pdf)
>>> drichlet_factor.scope()
['x', 'y']
>>> drichlet_factor.assignemnt(5,6)
226800.0


The class supports method like marginalize and reduce just like what we have with discrete classes.

>>> import numpy as np
>>> from scipy.special import beta
>>> def custom_pdf(x, y, z):
...     return z*(np.power(x, 1)*np.power(y, 2))/beta(x, y)
>>> from pgmpy.factors import ContinuousFactor
>>> custom_factor = ContinuousFactor(['x', 'y', 'z'], custom_pdf)
>>> custom_factor.variables
['x', 'y', 'z']
>>> custom_factor.assignment(1, 2, 3)
24.0

>>> custom_factor.reduce([('y', 2)])
>>> custom_factor.variables
['x', 'z']
>>> custom_factor.assignment(1, 3)
24.0


Just like the ContinuousNode class the ContinuousFactor class also has a method discretize that takes a Discretizer class as input. It will output a list of discrete probability masses or a Factor or TabularCPD object depending upon the discretization method used. Although, we do not have inbuilt discretization algorithms for multivariate distributions for now. But the users can always define their own Discretizer class by subclassing the BaseDiscretizer class. I will soon write a post describing how this can be done.

## JointGaussianDistribution

In its most common representation, a multivariate Gaussian distribution over X1………..Xn is characterized by an n-dimensional mean vector μ, and a symmetric n x n covariance matrix Σ. The JointGaussianDistribution provides its representation. This is derived from ContinuousFactor. We need to specify the variable names, a mean vector and a covariance matrix for its inialization. It will automatically comute the pdf function given these parameters.

>>> import numpy as np
>>> from pgmpy.factors import JointGaussianDistribution as JGD
>>> dis = JGD(['x1', 'x2', 'x3'], np.array([[1], [-3], [4]]),
...             np.array([[4, 2, -2], [2, 5, -5], [-2, -5, 8]]))
>>> dis.variables
['x1', 'x2', 'x3']
>>> dis.mean
array([[ 1],
[-3],
[4]]))
>>> dis.covariance
array([[4, 2, -2],
[2, 5, -5],
[-2, -5, 8]])
>>> dis.pdf([0,0,0])
0.0014805631279234139


It inherits methods like marginalize and reduce but they have been re-implemented here since both of them forms a special case here.

>>> import numpy as np
>>> from pgmpy.factors import JointGaussianDistribution as JGD
>>> dis = JGD(['x1', 'x2', 'x3'], np.array([[1], [-3], [4]]),
...             np.array([[4, 2, -2], [2, 5, -5], [-2, -5, 8]]))
>>> dis.variables
['x1', 'x2', 'x3']
>>> dis.mean
array([[ 1],
[-3],
[ 4]])
>>> dis.covariance
array([[ 4,  2, -2],
[ 2,  5, -5],
[-2, -5,  8]])

>>> dis.marginalize(['x3'])
dis.variables
['x1', 'x2']
>>> dis.mean
array([[ 1],
[-3]]))
>>> dis.covariance
narray([[4, 2],
[2, 5]])

>>> dis = JGD(['x1', 'x2', 'x3'], np.array([[1], [-3], [4]]),
...             np.array([[4, 2, -2], [2, 5, -5], [-2, -5, 8]]))
>>> dis.variables
['x1', 'x2', 'x3']
>>> dis.variables
['x1', 'x2', 'x3']
>>> dis.mean
array([[ 1.],
[-3.],
[ 4.]])
>>> dis.covariance
array([[ 4.,  2., -2.],
[ 2.,  5., -5.],
[-2., -5.,  8.]])

>>> dis.reduce([('x1', 7)])
>>> dis.variables
['x2', 'x3']
>>> dis.mean
array([[ 0.],
[ 1.]])
>>> dis.covariance
array([[ 4., -4.],
[-4.,  7.]])


This class has a method to_canonical_factor that converts a JointGausssianDistribution object into a CanonicalFactor object. The CanonicalFactor class forms the latter part of my project.

## The Future

With my current PR dealing with JointGaussainDistribution and ContinuousFactor being almost in its last stages. I will soon begin my work on LinearGaussainCPD, followed by the CanonicalFactor class.

Hope you enjoyed this post and would be looking forward to my future posts. Thanks again. I will be back soon. Have a nice time meanwhile. :-)

### Utkarsh (pgmpy)

#### Google Summer of Code week 3 and 4

In terms of progress week 3 turned to be dull. I was having doubts in my mind regarding the representation since start of coding period and somehow I always forgot to have my doubts cleared in meeting. I wasted a lot of time reading theory to help me out with this doubt of mine and till mid of the week 4 it remained unclear.

During week 3, I re-structured my code in different parts. Remove the BaseHamiltonianMC and created a HamiltonianMC class which returned samples using Simple Hamiltonian Monte Carlo. This class was then inherited by HamiltonianMCda which returned samples using Simple Hamiltonian Monte Carlo. Wrote function for some section of overlapping code, changed name of some parameters to specify their context in a better manner. Apart from that I was experimenting a bit with the API and how samples should be returned.

As discussed in last post, the parameterization of model was still unclear to me, but upon discussion with my mentor and other members I found that we already had a representation finalized for Continuous factor and Joint distributions. I wasted a lot of time on this matter, I laughed at my silly mistake. If I had my doubts clear in start I would have already finished with my work. No frets now it gave me a good learning experience. So I re-wrote my certain part of code to take this parameterization into account. In discussion of week 4 meeting upon my suggestion we decided to use numpy.recarray objects instead of pandas.DataFrame as pandas.DataFrame was adding a dependency and was also slower than numpy.recarray objects. I also improved the documentation of my code during the week 4, which earlier wasn’t consistent with my examples. I was allowing user to pass any n-dimensional array instead of mentioned 1d array in documentation, I thought it will provide more flexibility but actually it was making things ambiguous. At the end of week 4 the code looks really different from what it was in the start. I wrote _sample method which run a single iteration of sampling using Hamiltonian Monte Carlo. Now the code returns samples in two different types. If user has an installation of pandas, it returns pandas.DataFrame otherwise it returns numpy.recarry object. This is how output looks like now:

• If user doesn’t have a installation of pandas in environment
>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, LeapFrog
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([-3, 4])
>>> covariance = np.array([[3, 0.7], [0.7, 5]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> sampler = HMC(model=model, grad_log_pdf=None, simulate_dynamics=LeapFrog)
>>> samples = sampler.sample(initial_pos=np.array([1, 1]), num_samples = 10000,
...                          trajectory_length=2, stepsize=0.4)
>>> samples
array([(5e-324, 5e-324), (-2.2348941964735225, 4.43066330647519),
(-2.316454719617516, 7.430291195678112), ...,
(-1.1443831048872348, 3.573135519428842),
(-0.2325915892988598, 4.155961788010201),
(-0.7582492446601238, 3.5416519297297056)],
dtype=[('x', '<f8'), ('y', '<f8')])

>>> samples = np.array([samples[var_name] for var_name in model.variables])
>>> np.cov(samples)
array([[ 3.0352818 ,  0.71379304],
[ 0.71379304,  4.91776713]])
>>> sampler.accepted_proposals
9932.0
>>> sampler.acceptance_rate
0.9932

• If user has a pandas installation
>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, GradLogPDFGaussian, ModifiedEuler
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([1, -1])
>>> covariance = np.array([[1, 0.2], [0.2, 1]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> sampler = HMC(model=model)
>>> samples = sampler.sample(np.array([1, 1]), num_samples = 5,
...                          trajectory_length=6, stepsize=0.25)
>>> samples
x              y
0  4.940656e-324  4.940656e-324
1   1.592133e+00   1.152911e+00
2   1.608700e+00   1.315349e+00
3   1.608700e+00   1.315349e+00
4   6.843856e-01   6.237043e-01


In contrast to earlier output which was just a list of numpy.array objects

>>> from pgmpy.inference.continuous import HamiltonianMC as HMC, GradLogPDFGaussian, ModifiedEuler
>>> from pgmpy.models import JointGaussianDistribution as JGD
>>> import numpy as np
>>> mean = np.array([1, -1])
>>> covariance = np.array([[1, 0.2], [0.2, 1]])
>>> model = JGD(['x', 'y'], mean, covariance)
>>> samples = sampler.sample(np.array([1, 1]), num_samples = 5,
...                          trajectory_length=6, stepsize=0.25)
>>> samples
[array([[1],
[1]]),
array([[1],
[1]]),
array([[ 0.62270104],
[ 1.04366093]]),
array([[ 0.97897949],
[ 1.41753311]]),
array([[ 1.48938348],
[ 1.32887231]])]


Next week I’ll try to do some changes mentioned by my mentor on my PR. Also I’ll write more test cases to individually test each function instead of testing the overall implementation. After my PR gets merged I’ll try to write introductory blogs related to Markov Chain Monte Carlo and Hamiltonian Monte Carlo and will work on No U Turn Sampling.

## June 19, 2016

### Upendra Kumar (Core Python)

#### Design Patterns: How to write reusable and tidy software?

Referred from : Python Unlocked 2015

Hello everyone. In this blog post I am going to tell about a important concept in software development which I came through called “Design Patterns”.

In software engineering, problems requiring similar solutions are very common. Therefore people generally tend to come up with a repeatable design specification to deal with such common problems. Studying design patterns helps one to have a basic idea of existing solutions to such problems.

Few advantages of design patterns are :

1. They speed up the development process by providing tested and robust paradigms for solving a problem.
2. Improves code readability for programmers
3. Documenting the code also becomes easy as a lot of solutions are based on common design pattern. Therefore, less efforts are required to document code.

Let’s come to different design patterns used by people in software engineering. They are mostly classified as follows :

1. Observer pattern
2. Strategy pattern
3. Singleton pattern
4. Template pattern
7. Flyweight pattern
8. Command pattern
9. Abstract factory
10. Registry pattern
11. State pattern

Let’s have a brief overview of the above-mentioned design patterns :

1. Observer Pattern :  The key to the observer pattern is “Spreading information to all listeners“. In other words, when we need to deal with a lot of listeners ( which always waiting for a particular event to be invoked ) we need to keep track of them and inform them about the occurence of an event ( For example, change of state of variable ). Below code snippet will help to make the situation more clear :
class Notifier():
"""
Provider of notifications to other objects
"""
def __init__(self, name):
self.name = name
self._observers = Set()

def register_observer(self, observer):
"""
Function to attach other observers to this notifier
"""
print("observer {0} now listening on {1}".format(observer.name, self.name))

def notify_observers(self, msg):
"""
transmit event to all interested observers
"""

for observer in self._observers:
observer.notify(self, msg)

class Observer():

def __init__(self, name):
self.name = name

def start_observing(self, subject):
"""
register for getting event for a subject
"""
subject.register_observer(self)

def notify(self, subject, msg):
"""
notify all observers
"""
print("{0} got msg from {1} that {2}".format(self.name, subject.name,msg))

The above code snippet provides a very simple implementation of the Observer pattern. There is a notifier object which provides a method to register the listeners. And in the listeners ( the Observer object ) there is start_observing function to register with the notifier.

### mkatsimpris (MyHDL)

#### Midterm Evaluation

Midterm evaluation comes this week, so, I will write a little summary of the things that I have done the previous weeks. As I described in my proposal, the first 3 weeks I implemented the color space conversion module in myhdl. I wrote all the convertible test units (VHDL, MyHDL, and verilog) in order to prove the correct behavior of the module. Moreover, I familiarized with the travis-ci,

### Riddhish Bhalodia (dipy)

#### Speeding Up!

Working on speeding up the local PCA algorithm by turning it into a cython code.

## Currently..

Let me describe the time division of the current implementation. This is for the data of size (176,176,21,22)

Code Section Time Taken
Noise Estimate MUBE/SIBE 39.23 seconds
Local PCA 140.98 seconds

The localPCA function has two main bottlenecks,

1] Computing the local covariance matrix
2] Projecting the data on PCA basis

## What is Cython?

It is a static compiler which makes writing C extensions for python really easy. These C extensions helps us get a huge time improvment. Cython allows us to call C/C++ functions back and forth natively from the python code. Due to this we are choosing this to improve the performance of the current localPCA implementation in python.

## New method for covariance computation!

Omar and Eleftherios have recently written a paper [1] which gives us a new improved method to compute the integrals in rectangles, without worrying much about memory considerations. Using this method for local covariance computation we expect the performance to improve significantly.

## New Code, New Results

We cythonized the localPCA code, also incorporating the new covariance computation [1] here. The improvement

LocalPCA time reduced from 140.98 seconds to 82.72 seconds, about 40% improvement

with obviously not affecting the accuracy of the code!

## Update, nlmeans_block optimization

Improved on the cython implementation of the nlmeans_block and the improvements are drastic. Tested on data of size (128,128,60), with patch radius = 1 and block radius = 1

Previous time = 7.56 seconds
Time after optimization = 0.64 seconds

😀

## Next Up…

1. Little more optimization of the local PCA cython code
2. Documentations and tutorials for local PCA and adaptive denoising
3. Code formatting and validation via phantom
4. Optimization of Adaptive Denoising code
5. Cythonize the noise estimation process
6. Incorporate suggestions from mentors

## References

[1] On the computation of integrals over fixed-size rectangles of arbitrary dimension
Omar OceguedaOscar Dalmau, Eleftherios Garyfallidis, Maxime Descoteaux, Mariano Rivera. Pattern Recognition Letters 2016

### mkatsimpris (MyHDL)

#### 2D-DCT Part 2

Now, I think it's time to present in details the implementation of the 2d-dct.  In the previous post, I described that I used the row-column decomposition approach which uses two 1d-dcts and utilizes the following equation: Z=A*X*AT. The first 1d-dct takes each input serially and outputs the 8 signal vector parallely and implements the following equation: Y = A * XT. X is the 8x8 block and A

## June 18, 2016

### Ranveer Aggarwal (dipy)

#### Finishing Up on the Text Box

In my previous blog post, I had begun working on a text box, and I had discussed a lot of issues. I used the text box myself and ended up hating it. So I started from scratch.

Again, it wasn’t easy and I had to go through multiple iterations.

### Method 0: Why reinvent the wheel?

Google. Stackoverflow? Someone? Help! Ooh a random forum.

Something similar happened.

### Method 1: Cmd+C, Cmd+V from the previous blog post

For making a dynamic multi line text box, I stored a number which describes the number of current lines. If this number is exceeded by

(length of the text) / (number of characters in a line)


I simply add a newline character to the text variable and increase the current number of lines.
For hiding overflows, I kept a variable for archive text and dynamically updated it.

This was complicated, the code couldn’t be clean with this method. Debugging corner cases would have been difficult. So, Cmd+A, Delete.

### Method 2: Hello Windows

While I stuck to the same OS, I had this idea of using a window.

An intermediate text looks like (the |’s are the window boundaries. The input field size is 1x5)

|abc|


|abcd|


When a character is removed

|abc|


When a character is further added

|abcde|


a|bcdef|


Yep, the window moves. Cool, right? So now the visibility of the text is controlled by this so called window. The caret plays nicely into this window. I simply use a caret position as 0 initially and keep adding or removing one as I add or remove characters. The 0 is always the left window position and the caret moves relative to it. Similarly change the caret position when left/right keys are pressed.

Say this is now the intermediate text

this is som|e text|


My caret is currently before e (the 0 for the window) and I press the left key. Here’s what happens.

this is so|me tex|t


Similarly, I shift the window right when required.
Next, I wrote down all the possible cases (corner cases at the boundaries, for example) and coded them all up.

I thought my work was done, until — bugs!

And I proceeded to rewrite everything once again.

### Method 3: The Final Method

The method didn’t change much from above, except that the caret position was now absolute - the 0 is the 0 of the text always and the window moves relatively.
I grouped similar code together into functions and ended up with a clean (and hopefully bug free) implementation of a text box. I’ll write an independent blog post on it soon.

### Results

Here’s what it looks like currently:

A better text box.

### Conclusion

Building a UI can be tougher than it sounds. When we use something like HTML and simply include a text box using a simple tag, we never think what went behind its making. The method I currently use seems efficient (every operation is O(1)), but I am sure there must be several implementations (maybe even better and cleaner) out there. I shall incorporate a better method if I find any.

### Next Steps

Next, I’ll get started on a slider element. I don’t know how to do it right now, so this week would probably be spent on exploring ways and means to do it.

### udiboy1209 (kivy)

#### Cython Needs A Lint Tool

My love for cython has kept increasing constantly for the past few weeks. It feels like by the end of my GSoC I might switch to cython entirely, if thats possible. Cython lacks a few key development tools though - a code testing tool and a lint tool. I felt great need for it while cleaning up unused variables in the latest PR. Manually doing stuff just doesn’t cut it for an automater :D.

This week was mainly focused on getting animated tiles implemented and working. Implementing animation was easy. I just had to store an extra pointer in the Tile object for the animation FrameList, and add a python api to access it. Animations can now be specified for a tile while map creation by specifying the animation name to load in the dict. The MapManager will take care of fetching the FrameList pointer and map_utils will take care of initialising the animation system.

## Debugging is inevitable

Have I mentioned before that no code works on the first go? Not even something simple like making a tile animated which very frankly just involved copying and modifying code from the last example I had built for testing the animation. After breaking my head behind debugging this for almost a day, I asked Kovak about it. It turned out to be a bug in the animation system! KivEnt’s renderer does a very neat trick to improve efficiency. It batches all the entities which have to be rendered from the same image to be processed together so that the same image doesn’t need to be loaded repeatedly. In AnimationSystem, we constantly keep updating the texture of an entity with time, hence it is important that we add those entities to the corresponding batch according to the new texture to be set. Sometimes entities don’t need to be “rebatched” because they already are in the final batch they will end up in after the texture change. There was a bug in the test condition for this which made it return True always. In the very basic sense this was what was happenning: old_batch == old_batch :P, while it should have been old_batch == new_batch. Because of the always true condition, the case where the animation was created from different images was never rendered because the texture change was never rebatched. A few code fixes and updates to the twinkling stars example so that animations use different image files led to this:

Isn’t this even more beutiful than the previous example :D. Because I had to remove a lot of code for this fix, there were a lot of unused variables lying around. That is why I so desparately wanted a good linting tool I could import into vim and automate the boring process of manually finding unused variables. Also, because cython code is majorly python-like, there needs to be a pep8 like standard for cython, along with a checker tool.

I then later found out the cython compiler can be configured to issue warnings for unused variables. Someone needs to make a good tool out of it, probably just a vim plugin for that matter. I might do it myself if I get enough motivation.

Anyway, that fix led to one more thing, animated tiles started working! Here, have a look:

## Next in line

Now that a basic map creating pipeline is in place for KivEnt, I can move on to actually trying to parse a Tiled Map file i.e. a TMX file and create a map from it. Parsing TMX should be easy as there are a lot of feature-rich TMX parsers existing for python. PyTMX is one such module I have in mind to use for this job. The rest of the task is just loading textures, models and animations from the tmx file and then assigning those values to individual tiles to create entities. I have been trying to create my own maps on tiled to get familiarised with it, and the first thing I did was create a Pokemon map because why not :P. Obviously, I won’t be able to use this in KivEnt examples because I don’t think the tileset is open source :P. So I’ll have to find an open source tileset and make another Tiled map for testing. Have a look at the Poekemon map:

## June 17, 2016

### jbm950 (PyDy)

#### GSoC Week 4

I started off this week writing the example code for a pendulum defined by x and y coordinates instead of an angle, theta. This was to show how the eombase.EOM class would handle a differential algebraic system. I also altered the simple pendulum example I made early on in the project to show how it would look as an eombase.EOM example. Of the examples I have made for the base class this one stands out as currently being the only one making use of the equations of motion generators (the other two have the equations of motion entered by hand). While addressing comments on the PR, it was mentioned that a more traditional documentation approach would allow greater visibility of the desired results of the code as the output could be explicitly shown. I agreed and moved all three examples to a single .rst document in PyDy and changed the code to represent the documentation format over example code format. At this point I made a list of all of the attributes and methods I though the base class should represent and made sure they were represented in the example documentation. In addition I included multiple ways I thought error messages should be brought up for incorrect uses of the base class. This information is currently awaiting review.

In addition to the work on the base class I had to fix the kane benchmark I made early on in the project. At some point in the last few months the input order for kane.kanes_equations() was flipped and this caused the benchmark to not be able to run on pervious versions of Sympy. My fix was to use a try/except clause to catch the error produced by the older versions of Sympy and alter the input order based on whether or not the error was produced. This code sits at PR #29 and it too is awaiting review/approval.

While I have been waiting for review of the base class PR, I have begun reading through Roy Featherstone’s book, “Rigid Body Dynamics Algorithms”. I have spent time going through Jason’s overhaul of KanesMethod as well and trying to provide as much useful feedback as I can.

Lastly I reviewed PR #11209 this week. The PR correctly alters code that tests for the presence of a key in a dictionary. It also altered the indentation of the code that immediately followed. I eventually came to the conclusion that this was a correct alteration because the variable eq_no is set in the dictionary key test and is used in the code that follows the test. I commented that the PR looks good to me and another member of SymPy merged it. This makes me slightly worried that too much value may be attached to my opinion as I still feel like a beginner.

### Future Directions

I will continue reading through Featherstone’s book until I recieve feedback on the proposed base class API at which time I will address the reviewer’s comments and hopefully begin work on the base class itself.

### PR’s and Issues

• (Open) Improved the explanation of the 5 equations in the Kane’s Method docs PR #11183
• (Open) Created a basis on which to discuss EOM class PR #353
• (Open) Minor fix in KanesMethod’s docstring PR #11186
• (Open) Fixed kane benchmark for different input order PR #29
• (Merged) Fix issue #8193 PR #11209

## June 16, 2016

### Ravi Jain (MyHDL)

#### GSoC Update: Stuck with conversion

In the past week I implemented new feature for management block – Address table read/write which shall be used for address filtering purposes and updated the test suite accordingly without much problems.

Then I started looking for cosimulating the management subblock that had been implemented. It took me a while to understand the concept. After talking with my mentors, i chose to leave cosimulation for verification of converted code for top level constructs and use simple convertible testbenches to verify generated V* codes for subblocks.

While pursuing that i faced a lot of issues and uncovered some issues with the conversion about which i have posted in discourse(Verilog VHDL) in detail.

So next,  i should develop tests for the myhdl core to cover some of the issues mentioned in the discourse and make a pull request. After that I should get done with this module within this week, completing it will good documentation as well.

### Upendra Kumar (Core Python)

#### Menace of Global Variables

Hello everyone. In this week I had learnt a few things about writing beautiful and organized code. In the end of first week I started to write code. Initially my code base was very small, therefore I went on writing and writing more and more code and repeating same design patterns.

Therefore, for these problems what I learnt was code refactoring and design patterns. The first is one somewhat obvious that we need to mainatin code’s modularity and arrange related and similar pieces of code with each other in order to ensure easier debugging and extension of other features. Therefore, I refacored my code in a number of files namely :

1. __init__.py
2. __main__.py
3. And other files like install_page.py, manage_installed_package.py, pip_extensions.py, utils.py and many others to be added depending on further features to be added.

In this process of code refactoring, I realised the menace of using global variables. One of the global variables was a Python dictionary, which was accessed by multiple functions. Therefore, when I grouped those functions in different files, my whole application failed. Then, I had to remove that global variable and make it a member variable of a class. In this case, correcting code was easy, but if I were to realise it a week or two later, then I think using global variables might would have wasted my whole day at that time.

Recently, I also went through a dozen design patterns in reaation to my GUI application. Therefore, my next post will be on Design Patterns.

## June 15, 2016

### Sheikh Araf (coala)

#### [GSoC16] Week 3 update

Another week has passed by in this journey and so far it is going great. This week I’ve been busy adding important functionality to the Eclipse plug-in.

The most important feature in the works is the ability to select bear to use for code analysis. I also had to make some design decisions and one thing I’ve learned is designing is more difficult than programming (ofcourse, take this with grain of salt).

Mid-term evaluations are coming and I expect to have a usable plug-in by then.

The major task now is to introduce some basic user-interface elements that make using the plug-in intuitive and easy. I have begun planning out the GUI with help from my mentor and will be adding some parts of it in the coming week.

The plug-in currently uses the common Problems View for the marker elements. This will change and the plug-in will now have a separate view for coala issues. Another new element will be the Annotations that will help better visualize the analysis results.

Cheers!

### liscju (Mercurial)

#### Coding Period - III Week

In this week i have planned to do redirection to the simple http server. The idea is to make GET/POST request to the url with path /file/REVHASH where revhash is the hash of file to put. To test it i created simple http server to handle such a request in python, You can take a look here:

https://bitbucket.org/liscju/hg-largefiles-gsoc/src/e8afcf299ea4bf5714859bf231d62aea0c663d3b/contrib/lfredirection-http-server.py?at=dev&fileviewer=file-view-default

So far i didnt manage to integrate this server in mercurial test framework but i started working on it. The thing worth to notice is how easy is to make http server with python, just a couple of lines and thats it.

Second thing i did was to refactor solution a bit to distinguish between different types of redirection. At this moment there are only two types: local file server and http server but in the future there will be other options so this was a good moment to make distinguishing flexible. Current solution looks like this:

_redirectiondstpeer_provider = {    'file': _localfsredirectiondstpeer,    'http': _httpredirectiondstpeer,}def openredirectiondstpeer(location, hash):    match = _scheme_re.match(location)    if not match:               # regular filesystem path        scheme = 'file'    else:        scheme = match.group(1)    try:        redirectiondstpeer = _redirectiondstpeer_provider[scheme]    except KeyError:        raise error.Abort(_('unsupported URL scheme for redirection peer %r')                          % scheme)    return redirectiondstpeer(location, hash)

Location at the beggining keeps protocol information and this is extracted  and compared with supported types. If its not supported it raises error.

To connect with redirection server and send http request i used httplib(https://docs.python.org/2/library/httplib.html) but im working on reusing current code in mercurial to open/close connection. Other thing im still working on is to send/get files from http server in chunks rather than at once. This is especially important when we consider that this solution will get/send files of big size.

Apart from fixing http connection issues, in this week im going to work on generating redirection location on the fly. The idea is that user specifies script/hook that generates redirection location, it saves this location in the file in .hg directory and the feature is reading location from this file.

https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-June/085244.html

Another thing i was working was to add instruction to pull active bookmark by "hg pull -B .". Beggining of the patch series is here:

https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-June/085232.html

Working on solution i encountered that some of the abort messages are not translated, i sent patch to this, you can browse it here:

https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-June/085251.html

#### Python, indentation and white-space

so at the time of the last update i was able to do basic indentation whenever a start and an end indent specifier was provided, this time around i’m working on stuff when the end-indentation specifier is not provided,  for example languages like python

def func(x):
indent-level1
indent-level2

here we can see that there is no specifier that an unindent is going to occur, so how do i figure out what all lines are a part of one block?

Well the answer is very simple actually, i look for the start indent specifier which in case of python it is the very famous: ‘ : ‘. Now after i find the start of indent specifier, the next step is to find an unindent, in the previous example the line containing ‘indent-level2’ unindents, and voila we have our block, starting from the indent-specifier to the first unindent, easy right? The answer to that is NO, nothing’s that easy.

## python doesn’t care about white-space:

well as we all know this isn’t true, python does care about white-space, but not as much as we thought. Python only cares about white-space to figure out indentation, anything else is pretty much useless to it,  for example:

def func(x):
a = [1, 2,
3, 4, 5]
if x in a:
print(x)

this is a pretty valid python code, which prints x if x is an integer between 1 to 5.  What is odd about this examples is, that as we know in python everything has to be indented right? and this breaks that rule! go ahead try this on your own, it works! So no even in python not everything has to be indented, a simpler example could have been:

def func(x):
a = 1
# This comment is not indented
print(a)

does it matter if this comment is not indented?  absolutely not! this is a very valid python code as well.

## The Problem:

so how is all this related to my algorithm?  as you can see in the second example, the line   ‘# This comment is not indented’ unindents and my algorithm is searching for unindents, hence breaking my algorithm, as it would think that block starts from ‘def func(x):’ and end at ‘# This comment is not indented’, also in the first example it would find that the line    ‘3, 4, 5] ‘ unindents which would again break the algorithm.

## The Solution:

The Solution is quite simple in theory: Just be aware of these cases. But that changes the algorithm completely it goes from:

• check first unindent
• Report block as line containing specifier to the line which unindents

To

• check if case of unindent
• check if this line is a comment
• check this if line is inside a multiline-comment
• check if this line is inside paranthesis() or square-brackets []
• If true repeat from 1
• else report block.

So the final algorithm is my working solution as long as we are not able to find some problem in that as well. You can follow all the code  related to this algorithm in my PR.

## Next steps:

Next steps are absolute indentation, hanging indents, keyword indents and an all new bear in the form of the LineLengthBear.

All of this looks really exciting, as i see my once planned Project come to life, i really hope all of this is useful someday and people actually use my code to solve their indentation problems.

### mkatsimpris (MyHDL)

#### 2D-DCT Part 1

The forward 2D-DCT is computed from the following equation: In the previous equation N is the block size, in our situation our block size is 8x8 so N=8, x(i, j) is the input sample and X(m,n) is the dct transformed matrix. A straightforward implementation of the previous equation requires N4 multiplications. However, the DCT is a separable transform and it can be expressed in

#### Diffusion Maps in Molecular Dynamics Analysis

It occurs to me in my previous post I didn’t thoroughly explain the motivation for dimension reduction in general. When we have this data matrix $X$ with $n$ samples and each sample having $m$ features, this number m can be very large. This data contains information that we want to extract, in the case of molecular dynamics simulations these are parameters describing how the dynamics are occurring. But this data can be features that distinguish faces from others in the dataset, handwritten letters and numbers from other numbers, etc. As it is so eloquently put by Porte and Herbst at Arizona

The breakdown of common similarity measures hampers the efficient organisation of data, which, in turn, has serious implications in the field of pattern recognition. For example, consider a collection of n × m images, each encoding a digit between 0 and 9. Furthermore, the images differ in their orientation, as shown in Fig.1. A human, faced with the task of organising such images, would likely first notice the different digits, and thereafter that they are oriented. The observer intuitively attaches greater value to parameters that encode larger variances in the observations, and therefore clusters the data in 10 groups, one for each digit

Here we’ve been introduced to the idea of pattern recognition and ‘clustering’, the latter will be discussed in some detail later. Continuing on…

On the other hand, a computer sees each image as a data point in $R^{nm}$, an nm-dimensional coordinate space. The data points are, by nature, organised according to their position in the coordinate space, where the most common similarity measure is the Euclidean distance.

The idea of the data being in a $nm$ dimensional space is introduced by the authors. The important part is that a computer has no knowledge of the the patterns inside this data. The human brain is excellent at plenty of algorithms, but dimension reduction is one it is especially good at.

## Start talking about some chemistry John!

Fine! Back to the matter at hand, dimension reduction is an invaluable tool in modern computational chemistry because of the massive dimensionality of molecular dynamics simulations. To my knowledge, the biggest things being studied by MD currently are on the scale of the HIV-1 Capsid at 64 million atoms! Of course, these studies are being done on supercomputers, and for the most part studies are running on a much smaller number of atoms. For a thorough explanation of how MD simulations work, my Summer of Code colleague Fiona Naughton has an excellent and cat-filled post explaining MD and Umbrella Sampling. Why do we care about dynamics? As Dr. Cecilia Clementi mentions in her slides, ‘Crystallography gives structures’, but function requires dynamics!’

A molecular dynamics simulation can be thought of as a diffusion process subject to drag (from the interactions of molecules) and random forces, (brownian motion). This means that the time evolution of the probability density of a molecule occupying a point in the configuration space $P(x,t)$ satisfies the Fokker-Plank Equation (This is some complex math from statistical mechanics). The important thing to note is that the Fokker-Plank equation has a discrete eigenspectrum, and that there usually exists a spectral gap reflecting the ‘intrinsic dimensionality’ of the system it is modeling. A diffusion process is by definition markovian, in this case a continuous markov process, which means the state at time t is solely dependent on the instantaneous step before it. This is easier when transferred over to the actual discrete problems in MD simulation, the state at time $t$ is only determined by the state at time $t-1$.

Diffusion maps in MD try to find a discrete approximation of the eigenspectrum of the Fokker-Plank equation by taking the following steps. First, we can think of changes in configuration as random walks on an infinite graph defined by the configuration space. From Porte again:

The connectivity between two data points, x and y, is defined as the probability of jumping from x to y in one step of the random walk, and is

$$connectivity(x,y) = p(x,y)$$

It is useful to express this connectivity in terms of a non-normalised likelihood function, k, known as the diffusion kernel:

$$connectivity \propto k(x,y)$$

The kernel defines a local measure of similarity within a certain neighbourhood. Outside the neighbourhood, the function quickly goes to zero. For example, consider the popular Gaussian kernel:

$$k(x,y) = \exp(-\frac{|x-y|^{2}}{\epsilon})$$

Coifman and Lafon provide a dense but extremely thorough explanation of diffusion maps in their seminal paper. This quote screams molecular dynamics:

Now, since the sampling of the data is generally not related to the geometry of the manifold, one would like to recover the manifold structure regardless of the distribution of the data points. In the case when the data points are sampled from the equilibrium distribution of a stochastic dynamical system, the situation is quite different as the density of the points is a quantity of interest, and therefore, cannot be gotten rid of. Indeed, for some dynamical physical systems, regions of high density correspond to minima of the free energy of the system. Consequently, the long-time behavior of the dynamics of this system results in a subtle interaction between the statistics (density) and the geometry of the data set.

In this paper, the authors acknowledge that oftentimes an isotropic kernel is not sufficient to understand the relationships in the data. He poses the question:

In particular, what is the influence of the density of the points and of the geometry of the possible underlying data set over the eigenfunctions and spectrum of the diffusion? To address this type of question, we now introduce a family of anisotropic diffusion processes that are all obtained as small-scale limits of a graph Laplacian jump process. This family is parameterized by a number $\alpha$ which can be tuned up to specify the amount of influence of the density in the infinitesimal transitions of the diffusion. The crucial point is that the graph Laplacian normalization is not applied on a >graph with isotropic weights, but rather on a renormalized graph.

The derivation from here requires a few more steps:

• Form a new kernel from anisotropic diffusion term: Let $$q_{\epsilon}(x) = \int k{\epsilon}(x,y)q(y) \,dy$$
Where $$k{\epsilon}^{(\alpha)} = \frac{k{\epsilon}(x,y)}{q{\epsilon}(x) q{\epsilon}(y) }$$
• Apply weighted graph Laplacian normalization: $$d{\epsilon}^{(\alpha)}(x) = \int k{\epsilon}^{(\alpha)}(x,y)q(y) \,dy$$
• Define anisotropic transition kernel from this term $$p{\epsilon,\alpha}(x, y) = \frac{k{\epsilon}^{(\alpha)}(x,y)}{d_{\epsilon}^{(\alpha)}(x)}$$

This was all kinds of painful, but what this means for diffusion maps in MD is that a meaningful diffusion map will have an anisotropic, (and therefore unsymmetric kernel). Coifman and Lafon go on to prove that for $\alpha$ equal to $\frac{1}{2}$ this anisotropic kernel is an effective approximation for the Fokker-Plank equation! This is a really cool result that is in no way obvious.

Originally, when I studied diffusion maps while applying for the Summer of Code I was completely unaware of Fokker-Plank and the anisotropic kernel. Of course, learning these topics takes time, but I was under the impression that diffusion kernels were symmetric across the board, which is just dead wrong. This of course changes how eigenvalue decomposition can be performed on a matrix and requires a routine like Singular Value Decomposition instead of Symmetric Eigenvalue Decomposition. If I had spent more time researching literature on my own I think I could have figured this out. With that being said, there are 100+ dense pages given in the citations below.

So where are we at? Quick recap about diffusion maps:

• Start taking random walks on a graph
• There are different costs for different walks based on likelihood of walk happening
• We established a kernel based on all these different walks
• For MD we manipulate this kernel so it is anisotropic!

Okay, so what do we have left to talk about…

• How is epsilon determined?
• What if we want to take a random walk of more than one jump?
• Hey John, we’re not actually taking random walks!
• What do we do once we get an eigenspectrum?
• What do we use this for?

## Epsilon Determination

Epsilon determination is kind of funky. First off, Dr. Andrew L. Ferguson notes that division by epsilon retrains ‘only short pairwise distances on the order of $\sqrt{2\epsilon}$’. In addition, Dr. Clementi in her slides on diffusion maps notes that the neighborhood determined by epsilon should be locally flat. For a free-energy surface, this means that it is potentially advantageous to define a unique epsilon for every single element of a kernel based on the nearest neighbors to that point in terms of value. This can get painful. Most researchers seem to use constant epsilon determined from some sort of guess and check method based on clustering.

For my GSoC pull request that is up right now, the plan is to have an API for an Epsilon class that must return a matrix whose $ij th$ coordinate is $\frac{d(i,j)^2}{\epsilon_ij }$. From here, given weights for the anisotropy of the kernel, we can form the anisotropic kernel to be eigenvalue-decomposed. Any researcher who cares to do some complex choice of epsilon based on nearest-neighbors is probably a good enough hacker to handle implementation of this API in a quick script.

## Length $t$ Walks

Nowhere in the construction of our diffusion kernel are we actually taking random walks. What we are doing is taking all possible walks, where two vertices on the graph are close if $d(x,y)$ is small and far apart if $d(x,y)$ is large. This accounts for all possible one-step walks across our data. In order to get a good idea of transitions that occur over larger timesteps, we take multiple steps. To construct this set of walks, we must multiply our distance matrix $P$ by itself t-times, where t is the number of steps in the walk across the graph. From Porte again (stealing is the best form of flattery, no?):

With increased values of t (i.e. as the diffusion process “runs forward”), the probability of following a path along the underlying geometric structure of the data set increases. This happens because, along the geometric structure, points are dense and therefore highly connected (the connectivity is a function of the Euclidean distance between two points, as discussed in Section 2). Pathways form along short, high probability jumps. On the other hand, paths that do not follow this structure include one or more long, low probability jumps, which lowers the path’s overall probability.

I said something blatantly wrong in my last post. I’m a fool, but still, things do get a little complicated when analyzing time series data with diffusion maps. We want to both investigate different timescale walks from the diffusion maps, but also to be able to project our snapshot from a trajectory at a timestep to the corresponding set of eigenvectors describing the lower dimensional order-parameters.

From Ferguson:

The diffusion map embedding is defined as the mapping of the ith snapshot into the ith components of each of the top k non-trivial eigenvectors of the $M$ matrix.

Here the $M$ matrix is our anisotropic kernel. So from a spectral decomposition of our kernel (remember that it is generated by a particular timescale walk), we get a set of eigenvectors that we project our snapshot (what we have been calling a both a trajectory frame and a sample, sorry) that exists as a particular timestep in our MD trajectory. This can create some overly similar notation, so I’m just going to avoid it and hope that it makes more sense without notation.

## Using Diffusion Maps in MDAnalysis

Alright, this has been a lot to digest, but hopefully you are still with me. Why are we doing this? There are plenty of reasons, and I am going to list a few:

• Dr. Ferguson used diffusion maps to investigate the assembly of polymer subunits in this paper
• Also for the order parameters in alkane chain dynamics
• Also for umbrella sampling
• Dr. Clementi used this for protein folding order parameters here
• Also, Dr. Clementi used this for polymerization reactions here
• Dr. Clementi also created a variant that treats epsilon determination very carefully with LSD
• There are more listed in my works cited

The first item in that list is especially cool; instead of using a standard RMSD metric, they abstracted a cluster-matching problem into a graph matching problem, using an algorithm called Isorank to find an approximate ‘greedy’ solution.

There are some solid ‘greedy’ vs. ‘dynamic’ explanations here. The example I remember getting is to imagine you are a programmer for a GPS direction provider. We can consider two ways of deciding an optimal route, one with a greedy algorithm and the other with a dynamic algorithm. At each gridpoint on a map, a greedy algorithm will take the fastest route at that point. A dynamic algorithm will branch ahead, look into the future, and possibly avoid short-term gain for long term drive-time savings. The greedy algorithm might have a better best-case performance, but a much poorer worst-case performance.

In any case, we want to allow for the execution of a diffusion map algorithm where a user can provide their own metric, tune the choice of epsilon, the choice of timescale, and project the original trajectory timesteps onto the new dominant eigenvector, eigenvalue pairs.

## Let’s talk API/ Actual Coding (HOORAY!)

DistMatrix

• Does frame by frame analysis on the trajectory, implements the _prepare and _single_frame methods of the BaseAnalysis class
• User selects a subset of a atoms in the trajectory here
• This is where user provides their own metric, cutoff for when metric is equal, weights for weighted metric calculation, and a start, stop, step for frame analysis

Epsilon

• We will have some premade classes inheriting from epsilon, but all the API will require is to return the manipulated DistMatrix, where each term has now been divided by some scale parameter epsilon
• These operations should be done in place on the original DistMatrix, under no circumstances should we have two possibly large matrices sitting in memory

DiffusionMap

• Accepts DistMatrix (initialized), Epsilon (uninitialized) with default a premade EpsilonConstant class, timescale t with default = 1, weights of anisotropic kernel as parameters
• Performs BaseAnalysis conclude method, wherein it exponentiates to the negative of each term given by Epsilon.scaledMatrix, performs the procedure for the creation of the anisotropic kernel above, and matrix multiplies anisotropic kernel by the timescale t.
• Finally, eigenvalue decomposes the anisotropic kernel and holds onto the eigenvectors and eigenvalues as attributes.
• Should contain a method DiffusionMap.embedding(timestep), that projects a timestep to its diffusion embedding at the given timescale t.

Works Cited:

## June 14, 2016

#### GSoC '16: Weeks 4 updates: Tests

Passing tests and 100% coverage are what help us sleep comfortably in the night - there is a sense of comfort knowing that every situation gives the expected results. And that is why the past week was writing tests!

I'll be honest here - I like writing the modules more than the tests. And that is probably why I put off writing tests for so long - instead of finishing the tests right after each module, I got too excited and kept jumping to the next feature. Oh well.

Anyway, at coala, we use codecov.io for coverage. Appveyor provides us with Windows build tests and CircleCI is for Linux build tests. These are totally awesome tools and you should definitely use them for your projects - you do write tests, don't you?

There was one challenge I faced - since my project is heavily user-interactive, every test is slightly more complex - we need to supress the output (sys.stdout) and simulate the input (sys.stdin). Fortunately there is already a powerful library in coalib that does just this: coalib.misc.ContextManagers:

• With suppress_stdout, all writes to sys.stdout are forced to redirected to /dev/null.

• With retrieve_stdout, an alternative pipe is created to which all output is redirected (the original stdout will be untouched).

• My favorite is simulate_console_inputs. Just like the above two, this is a context manager - so it's effect is easily limited. This takes in a variable number of inputs. Want the answers to three consecutive questions to be "Yes", "**", and "42"? Simply use with simulate_console_inputs("Yes", "**", "42"): and write your logic inside. That is as simple as it can get!

coalib.misc.ContextManagers has a lot of other awesome stuff too. You should definitely check them out.

I still have to write an integration suite, but I've put that off for later - once I finish the core of my project - settings guessing. And that will be the next two weeks.

So until then,

#### GSoC '16: Weeks 2-3 updates

Lots of activity the past few days!

I finally managed to get caching merged! And the performance improvements have been terrific. On my machine, which has a HDD, running coala on coala took around ~10 seconds to complete all sections. But with caching enabled, it takes just 4.5 seconds - a 2x speed improvement!

To enable caching, just run coala with --changed-files. Of course, this is currently only on the dev version, you'll have to wait till the 0.7 release to get it on the stable version. And from 0.8, we hope to get caching enabled by default. Really exciting stuff!

Another thing I worked on was automatic RST generation for bears. You can find it in the new bear-docs repo. With this, you can easily navigate the list of bears, categorized by each language. And for each bear, there is a description, the languages it supports, and a table with the settings taken by the bear. For example, you can take a look at the PyImportSortBear. Of course, this is a temporary solution till a more complete and comprehensive website is in place.

And to the main topic: coala-quickstart. Big steps forward:

• Remember how I asked the user for a glob expression to ignore files in a project? It's much simpler now. After Adrian's feature request #13, I've implemented automatic ignore glob generation from the user's .gitignore file. But there was a slight issue: git uses a glob syntax that is different from what we use at coala. So I needed to translate between the two before directly implementing it. After going through the gitignore documentation and testing some cases, I arrived at a solution that works pretty well.

• Another big development is in the generation of glob expressions. A project may have several languages in it; for example, Python for the code, RST for the docs, .yml for the configuration. So we need to categorize each language into its own section and add the corresponding bears to that section. But the problem lies in file globs - we only know the list of files to lint, not a concise glob expression that covers exactly that list. I wrote a neat routine that does exactly that.

• .coafile generation - .coafile is now generated automatically. To give you an insight into how the coafile looks, take a look at the generated .coafile when I ran coala-quickstart on itself: gist.

While I still have to write tests and get code reviews, I'm fairly happy with my progress. And the past 3-4 days, I've been experimenting the implementation of the core of my project - settings guessing. It's still in works, but I hope to have something by this weekend.

Till then,

### tushar-rishav (coala)

#### TDD and BDD

Past a few days been hectic resulting from the fairly complicated overall Visa application process for Spain (need it for EuroPython’16 ). Meanwhile, I have been reading about TDD or Test Driven Development and BDD or Behaviour Driven Development to write tests for the controllers using Mocha - a testing framework, Chai - a BDD assertion library and Karma - a test runner. Test runner can run tests based on different test frameworks - in this case. Mocha. Their documentaions might seem bloating at first sight. I felt the same when I started reading about them. But it’s actually simple once you’ve understood the concept behind TDD. Through this blog I shall try to explain why we should follow TDD approach and also write some basic tests using Mocha and Chai together. Let’s drop karma for a while.

### Why TDD?

Originally, TDD meant writing tests before the actual implementation. ( But you may write tests afterwards too (not a good approach though). Now, the question comes - how can one write tests for something that haven’t been implemented yet. Makes sense, isn’t it? Well usually when you test modules/functions , you already know your expectations. For instance, If I am going to write a function that checks if a given real number is a power of 2, I already know the outcome. In this case it’s a boolean value - True for power of 2 and False otherwise. So, you may very well write tests using this information.
There are many advantages of adopting TDD. First being avoiding Regression bugs. Regression bug is a bug that had been fixed in the past and then occurred again. For instance, we change obviously unrelated piece of code and therefore we do not check some old problem, because we do not expect that problem to occur again. Once we have an automated test for this bug, it will not happen again because we can easily run all tests instead of manually trying only the parts that are obviously related to the change we made.
Another reason being Refactoring. The code architecture may require changes to a project requirements. Tests prove whether the code still works, even after a major refactoring.

### Get started

#### Installation

You may install karma using npm - node package manager. To install the Karma plugins for Mocha and Chai. Also, it’d be nice to use PhantomJS for headless testing. Let’s create a pacakge.json file and add the dependencies to it. It might look like:

Now, run npm install to fetch and install the dependencies.
Also, to make it easier to run karma from the command line you can install karma-cli globally, which will run the local version without having to specify the path to karma (node node_modules/karma/bin/karma):

Karma needs a configuration file. Create one by running karma init and answering the simple question. Make sure you specify Mocha as your testing framework. Also, mention source and tests files location accordingly. In my case it’s in ./app/**/*.js for source and ./tests/**/*Spec.js for tests.
Let’s create two empty files app/powerOfTwo.js and tests/powerOfTwoSpec.js. It’d look like

Once you’re done, you have karma-conf.js. To get Chai included in the test pipeline, edit karma-conf.js and add it to the frameworks setting. Also to be able to use PhantomJS mention it in browsers.

Running karma start will execute the default karma-conf.js. You can have multiple configuration files which can be run by specifying the name of the configuration file. karma start <conf-file-name>.

Enough installation. Let’s get down to writing tests.

#### Writing tests

By default, you can use Mocha’s assertion module (which is in fact Node’s regular assertion module) to run your tests. However, it can be quite limiting. This is where assertion libraries like Chai enter the frame. Writing a test is like constructing a perfect sentence in English. Don’t believe me? Hm see yourself. We describe an umbrella of tests, and state some expected outputs for various tests under that umbrella.

Write this in tests/powerOfTwoSpec.js.

Note that we haven’t created the powerOfTwo function yet. But looking at the tests, we can say how our function is expected to behave. That’s TDD and BDD for you in the simplest form.

Now let’s write our powerOfTwo function in powerOfTwo.js

Finally, we may run the tests running npm test in project root directory.

I hope that was simple and introduced you to the basics of TDD/BDD. Go ahead and try yourself.
Explore more at

Cheers! :)

## June 13, 2016

### mkatsimpris (MyHDL)

#### 2D-DCT Implementation

The third week passed, and the color space conversion module with the unit tests merged in the original repository. These days, I figured out how to implement the 2D-DCT with a simple and straightforward way. The implementation follows the row-column decomposition method. First I created and tested the 1D-DCT module and then I created and tested the final 2D-DCT module. In the following

#### Dimension Reduction, a review of a review

Hello! This is my first post moving over to a new site built by wintersmith. Originally I was going to use jekyll pages, but there was an issue with the latest Ruby version not being available for Linux, (maybe macs are better…). I spent way too much time figuring out how to install a markdown plugin that allowed for the inclusion of LaTex. I did this all without realizing I could simply include:

<script type="text/javascript" async
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>


below my article title and LaTex would easily render. Now that this roadblock is cleared, I have no excuses preventing me from writing a post about my work.

This post is meant to discuss various dimension reduction methods as a preface to a more in-depth post about diffusion maps performed on molecular dynamics simulation trajectories. It assumes college-level math skills, but will try to briefly explain high-level concepts from Math and Stats. Towards the end I will provide a segue into the next post.

Dimension reduction is performed on a data matrix $X$ consisting of $n$ ‘samples’ wherein each sample has a set of $m$ features associated with it. The data in the matrix is considered to have dimension $m$, but oftentimes the actual ‘intrinsic dimensionality’ is much lower. As Laurens van der Maaten defines it, ‘intrinsic dimensionality’ is ‘the the minimum number of parameters needed to account for the observed properties of the data’.

(So far, the most helpful explanation of this fact was presented in a paper on diffusion maps by Porte et al In the paper, a dataset of m-by-n pixel pictures of a simple image randomly rotated originally has dimension $mn$ but after dimension reduction, the dataset can be organized two dimensionally based on angle of rotation.)

At the most abstract level, dimension reduction methods usually are posed as an optimization problem that often requires the solution to an eigenvalue problem. What is an optimization problem you ask? That wikipedia article should help some, the optimization being done in dimension reduction is finding some linear or non-linear relation $M$ that minimizes (or maximizes) a cost function $\phi (x)$ on some manipulation of the data matrix, call it $X_{manipulated}$. Examples of various functions will be given in detail later.

In most cases this can be turned into an eigenproblem posed as: $$X_{manipulated} M = \lambda M$$

Solving this equation using some algorithm like Singular Value Decomposition or Symmetric Eigenvalue Decomposition will provide a set of m linearly-independent eigenvectors that act as a basis for a lower dimensional space. (Linear independence means no vector in the set can be expressed as some sum of the others, a basis set has the property that any vector in a space can be written as the sum of vectors in the set.) The set of eigenvectors is of given by an eigenvalue decomposition will be the ‘spectrum’ of the matrix $M$. This spectrum will have what’s referred to as a ‘spectral gap’ after a certain number of eigenvalues, where the number of eigenvalues falls dramatically compared to the previous. The number of significant eigenvalues before this gap reflects the intrinsic dimension of a space.

In some cases, the manipulation is somewhat more complicated, and creates what is called a generalized eigenvalue problem. In these situations the problem posed is $$X_a M = \lambda X_b M$$ Where $X_a$ and $X_b$ are distinct but both still generated from some manipulation on the original data matrix X.

The methods discussed so far necessitate the use of convex cost functions for an optimization. From my professor Dr. Erin Pearse (thanks!):

The term convexity only make sense when discussing vector spaces, and in that case a subset U of a vector space is convex iff any convex combination of vectors in U is again in U. A convex combination is a linear combination where the coefficients are nonnegative and sum to 1.

Convex functions are similar but not entirely related. A convex function does not have any local optima that aren’t also global optima which means that if you’re at a maximum or minimum, you know it is global.

(I think there is a reason why people in optimization refer to surfaces as landscapes. An interesting surface may have many hills and valleys, and finding an optimal path is like a hiker trying to cross a mountain path blind — potentially problematic.)

Convex functions will always achieve the same solution given some input parameters, but non-convex functions may get stuck on some local optima. This is why a method like t-SNE will converge to different results on different iterations.

Methods for dimension reduction will be either linear or non-linear mappings. In both cases, the original data matrix $X$ is embeddable in some manifold. A manifold is any surface that is locally homeomorphic to $R^{2}$. We want these mappings to preserve the local structure of the manifold, while also possibly preserving the global structure. This depends on the task meant to be done with the reduced data. I think the notion of structure is left specifically vague in literature because it is just so damn weird (it is really hard to think about things in greater than 3 dimensions…)

A great example of data embeddable in a weird, albeit three dimensional manifold is the Swiss roll: borrowed from dinoj. The many different dimension reduction methods available will have disparate results when performed on this data. When restricted to paths along the manifold, red data will be far apart from black, but if a simple euclidean distance is measured, the points might be considered close. A dimension map that uses simple euclidean distance between points to resolve structure will fail miserably to eke out the Swiss roll embedding.

When looking to investigate the lower dimensional space created by a dimension reduction, linear mappings have an explicit projection provided by the matrix formed by the eigenvectors. Non-linear methods do not have such an explicit relationship. Finding physical meaning from the order parameters given by a non-linear technique is an active area of research.

It might be too small of detail for some, but the rest of this post will be focused on providing a quick explanation of various dimension reduction techniques. The general format will be:

• optimization problem posed
• formal eigenvalue problem given
• interesting insights and relations
• pictures that I like from other work

## Multidimensional Scaling (MDS), Classical Scaling, PCA

• PCA cost function: Maximizes $Trace(M^{T}cov(X)M)$
• PCA eigenvalue problem $Mv = \lambda v$ where M is this linear mapping minimizing the covariance
• Quote from a Cecilia Clementi paper on diffusion maps where she mentions PCA: ‘Essentially, PCA computes a hyperplane that passes through the data points as best as possible in a least-squares sense. The principal components are the tangent vectors that describe this hyperplane’

• Classical scaling relies on the number of datapoints not the dimensionality.

• Classical scaling cost function: Minimizes $$\phi ( Y ) = \Sigma ij = ( d{ij} - || y{i} - y{j} ||^{2} )$$ this is referred to as a strain cost function. (subscripts are currently an issue…)
• Other MDS methods can use stress or squared stress cost functions
• Classical scaling gives the exact same solution as PCA

## Isomap

• Geodesic distances are computed by constructing a nearest-neighbor graph and using Djistrka’s algorithm to find short distance. Erroneous connections can be made by improperly connecting neighbors.
• Can fail if manifold has holes.
• Demonstration of failure of PCA versus success of Isomap

## Kernel PCA

• Does PCA on a kernel function, retains large pairwise distances even though they are measured in the feature space

## Diffusion Maps

• The key idea behind the diffusion distance is that it is based on integrating over all paths through the graph.
• Isomap will possibly short circuit, but the averaging of paths in diffusion maps will prevent this from happening, it is not one shortest distance but a collective of shortest distances.
• Pairs of datapoints with a high forward transition probability have a small diffusion distance
• Eigenvalue problem: $P^{(t)} v = \lambda v$, where $P$ is a diffusion matrix reflecting all possible pairwise diffusion distances between two samples
• Diagonalization means that we can solve the equation for t=1 and then exponentiate eigenvalues to find time solutions for longer diffusion distances
• Because the graph is fully connected, the largest eigenvalue is trivial
• The same revelation also stems from the fact that the process is markovian, that is the step at time t only depends on the step at time t-1, it forms a markov chain.
• Molecular dynamics processes are certainly markovian, protein folding can be modeled as a diffusion process with RMSD as a metric

## Locally Linear Embedding:

• LLE describes the local properties of the manifold around a datapoint x i by writing the datapoint as a linear combination $w_i$ (the so-called reconstruction weights) of its k nearest-neighbors $x i_j$.
• It solves a generalized eigenvalue problem, preserves local structure.
• Invariant to local scale, rotation, translations
• Cool picture demnostrating power of LLE:

• Fails when the manifold has holes
• In addition, LLE tends to collapse large portions of the data very close together in the low-dimensional space, because the covariance constraint on the solution is too simple

## Laplacian Eigenmaps:

• Laplacian Eigenmaps compute a low-dimensional representation of the data in which the distances between a datapoint and its k nearest neighbors are minimized.
• The ideas studied here are a part of spectral graph theory
• The computation of the degree matrix M and the graph laplacian L of the graph W allows for formulating the minimization problem in defined above as an eigenproblem.
• Generalized Eigenproblem: $Lv = \lambda Mv$

## Hessian LLE:

• Minimizes curviness of the high-dimensional manifold when embedding it into a low dimensional data representation that is locally isometric
• What is a Hessian?. Hessian LLE uses a local hessian at every point to describe curviness.
• Hessian LLE shares many characteristics with Laplacian Eigenmaps: It replaces the manifold Laplacian by the manifold Hessian. ‘As a result, Hessian LLE suffers from many of the same weaknesses as Laplacian Eigenmaps and LLE.’

## Local Tangent Space Analysis:

• LTSA simultaneously searches for the coordinates of the low-dimensional data representations, and for the linear mappings of the low-dimensional datapoints to the local tangent space of the high-dimensional data.
• Involves applying PCA on k neighbors of x before finding local tangent space

## Sammon mapping

• Adapts classical scaling by weighting the contribution of each pair $(i, j)$ to the cost function by the inverse of their pairwise distance in the high-dimensional space d_ij

## Multilayer Autoencoder

• Uses a feed forward neural network that has a hidden layer with a small number of neurons such that the neural network is forced to learn a lower dimensional structure
• This is identical to PCA if using a linear activation function! What undiscovered algorithms will be replicated by neural nets? Will neural nets actually hurt scientific discovery?

Alright, so that’s all the gas that is in my tank for this post. Hopefully you’ve come and understood something a little bit better than before. In my next post, I am going to focus on diffusion maps as they pertain to molecular dynamics simulations. Diffusion maps are really cool in that they really are an analogue the physical nature of complex molecular systems.

## June 12, 2016

### Ramana.S (Theano)

#### GSoC Fortnight Update

Week2 Update:
Apologies for the delayed post of update of week 1 and week 2 progress. It is two weeks into the coding phase of GSoC and my experience has been greatly overwhelming. I have been learning a lot of new things everyday, and I am facing challenging tasks that keeps my motivation high to work learn more.
In the first two weeks, I have built a new Global Optimizer in theano. This new global optimizer builds a graph in parallel to the existing graph and makes the transfer of the OPs to the GPU in one single pass. The new optimizer has been performing pretty well as of now and currently I am analysing the time this optimizer takes per optimizing each nodes and working on speeding up the slower ones.
This optimizer is giving some amazing results. It halves the time taken for optimization. I am currently profiling the optimizer on Sparse Bayesian Recurrent Neural Network model. The optimization takes 32.56 seconds to reduce 10599 to 9404 nodes with the new optimizer, whereas it used to take 67.7 seconds for reducing 10599 nodes to 9143 nodes with the old optimizer. This has been a pleasing result.
Also, along the way, there were few more tasks done to speed up the existing optimizers. One such is, reducing the number of instance a new instance of an Op is created by pre-instantiating them. This helped in speeding up by ~2sec. Another speedup task was caching the GPUAlloc class(the class that is used to create initialized memory on the GPU). This has reduced the optimizer timing by ~3sec.
I had a few roadblocks while building this new optimizer. Thanks to my mentor (Frèdèric) who helped me out to get out few of them. It has been an amazing time to work with a highly knowledgable, and an experienced person.
I am profiling the results this new optimizer is giving on few Deep Learning models to evaluate its overall performance. In the next few days, I will write an other blog post elaborating on the profiling result of this optimizer and make this optimizer work with models that take highly unusual time to compile time with the current optimizers if the model parameters are humongous.
That's it for now folks, stay tuned!

The next important step in my project is creating an upload tool.

This tool will take every single bear one by one and upload it.

### How will I do this?

Simple. Say we are in coala-bears/bears/. Now here’s where the script is. I basically want to create an upload/ folder, where I will have all the bears ready to upload. The tool takes every bear, one by one, creates a subdirectory in the upload/ folder, that subdirectory having the bear’s name, where it will place the setup.py file and the requirements for that bear, and then uploads it.

### The trick

The setup.py and the requirements.txt files will be automatically generated based on each bear’s requirements and name, fitting it perfectly.

### meetshah1995 (MyHDL)

#### Developer's Den

This week was a fun filled week. I had to setup continous integration, documentation, test coverage and code quality tags for my repository.

I always wanted to have those tags for my repository like the professinal OS repositories on github. Turns out this was my chance.

We chose travis for CI, coveralls for test coverage, landscape.io for code quality and readthedocs for documentation. It was a bit difficult to get all in place at first , but once I got a hold of it, it became very easy to setup.

One another important progress this week was that I got my First PR merged into the main repo :). with the python decoder, I am looking forward to get the hdl decoder working as soon as possible so that I can move on to the real RISC-V processor design.

This week was majorly getting the PRs and repo workflow in place with some interaction with my mentor and invaluable tips from his end.

With all things cleared up I think I am good to go for the coming weeks !.

See you next week,
MS

### Shridhar Mishra (italian mars society)

#### Update! @12th June 2016

Here's the progress till now.
• Successfully imported and ran all the test programs and Kinect works fine.
• Successfully ran Vito's code and transferred skeleton Joint movements to Linux virtual machine.

For plotting skeleton using joint data from the json file i'll be using unity plugin which is available for free in unity app store.

Further work would be integration of the plugin with the existing code.

Cheers!
Shridhar

### Riddhish Bhalodia (dipy)

#### Summary Blog

This describes all the work that I have done in the coding period from 23rd May till 10th June, It’s going to be a short one as I have described almost every step in the previous posts.

1. Finished coding the localPCA denoising algorithm proposed in the GSOC abstract in python and started optimising it by converting it in Cython.
2. Also successfully implemented MUBE and SIBE noise estimation required for local-PCA denoising
3. Finished Adaptive denoising PR waiting for review and merge.
5. Wrote simple tests for the local-PCA and adaptive denoising
6. Comparison of local PCA implementation with the available MATLAB implementation  given by the authors (the code was encrypted so was not able to view it, just compared the outputs), and the results are very nearly matching
8. Blogging😀

## Immediate next steps

1. Phantom for the diffusion imaging and work on some new ideas
2. Conversion to cython
3. Code optimisation by enhancing theoretical complexity using different computation techniques
4. Rician Correction

## June 11, 2016

### Pulkit Goyal (Mercurial)

#### cPickle, StringIO and Queue

While switching to Python 3 or doing a switch from one version to another of other languages even, one basic task we generally come across is to get the library imports changed. With the newer versions of Python, there come changes in the standard library. Some modules names are changed whereas some are merged to form one. Some are gone and some are introduced.

## June 10, 2016

### TaylorOshan (PySAL)

#### Further tests with sparse data structure for spatial interaction models

This week the effort to build sparse compatibility for spatial interaciton models was continued. Rather than write tests, and then change the code and need to re-write tests, it was decided to finalize the framework and then follow up with the tests. One of the main bottlenecks for spatial interaction models are the huge number of dummy variables used by the constrained variants. Therefore, after conferring with my mentores more effort was put into exloring functions to generate sparse dummy variables, some of which can be seen here. Thus far it has been a sucess, as the ability to produce the diesign matrix with the approriate dummy variables for a model using ~3000 locations (like the us counties), which implies 9 million OD flows now takes minutes to estimate rather than an hour+.

Once the sparse implementation was complete, some tests were run to compare the sparse and dense framework speeds. Obviously, in the constrained variants, sparse always wins hands down, but for the unconstrained model, dense matrices are actually faster. Therefore, it was decided that the codebase will adapt according to which model is being estimated. Finally, the branches containing the desne and sparse implementation were merged into the glm_spint branch, which is the main working feature branch for this GSOC project. Unfortuantely, the end of the week was spent resolving some bugs that resulted from the merging of the sparse and dense frameworks, and I was unable to finish developing unittests for the general spint framework. This will be the first task next week before adding features to extend the framework to handle over-dispersed and zero-inflated data.

### tsirif (Theano)

#### Multi gpu and Platoon

Theano has began to develop two relatively new frameworks to extend her functionality for her user, as well as for her contributor. The first one is a new GPU frontend (and backend from Theano’s perspective) which aims to make to make GPU programming easier in general and backend-agnostic. The second one is a framework targeted for those who are using Theano to train their deep learning models.

## libgpuarray/pygpu

The libgpuarray is a library which provides a gpu programmer a two level facade. This facade is an efficient wrapper of both CUDA and OpenCL APIs for GPU programming. User selects the backend which he is going to use at runtime and the library provides him with a basic gpucontext, which represents abstractly the usual executable GPU context, to use it for gpudata or gpukernel handling. The lower “buffer” facade level provides the user with basic structures and functions to handle data, kernel and context in the way that the regular GPU APIs provide (as e.g. the CUDA Driver API). The higher “array” level provides a full array interface with structures and functions which are similar to Numpy’s ndarrays. Partial BLAS (e.g. GEMM) support also exists at both levels.

In supplement to libgpuarray’s C API, this framework also delivers Python bindings and a module that provides numpy-like GPU ndarrays! This is the most interesting part for Python users as it extends their already powerful numpy framework to include GPU calculations. Theano uses pygpu as its backend as of recent. :D

My role, as explained in a previous post , is to extend Theano’s GPU support and develop towards including multi-gpu/node functionalities. Working on the second part right now, I am including a multi-gpu collectives API in libgpuarray in the same original spirit. Considering that the only multi-gpu (and less - as of now - multi-node (well the guys put much effort for things to be optimized)) MPI-like interface out there is NVIDIA’s recently developed nccl, libgpuarray will at first support only collectives for CUDA backend. Finally, Python bindings and module for easy multi-gpu programming will be provided through pygpu. Totally, general support for collective multi-gpu programming in Python.

Follow and participate to the discussions in this thread of theano-dev google group. Also, you can watch the pull request discussions on Github, in order to follow more closely its progress.

## Platoon

The second framework I referred to in the introduction is Platoon . As of now, Platoon is described as:

Experimental multi-GPU mini-framework for Theano

It supports data-parallelism inside one compute node, not model-parallelism.[…]

It provides a controller/worker template architecture for creating distributed training procedures for deep learning models. It extends Theano in a way that her user, that needs to train faster larger models which were created with Theano, will use this to improve training performance. Note that it refers only to data-parallelism. This means that each worker process (which is responsible for managing a single GPU) uses every parameter that a to-be-trained model has. And Platoon will remain this way, as model-parallelism is handled elsewhere. Improvements on performance can be found in the latest (2016/05/09) Theano’s technical report (search for “Platoon”).

Nevertheless, distributed training algorithms are kind of a new area to explore, especially for deep learning problems. But with the ever-evolving capabilities of GPUs, there is for sure interest in the field. Currently, there are two algorithms implemented in Platoon: An asychronous Stochastic Gradient Descent [SGD from now on and ever :P] and Elastic Averaging SGD.

I have two goals considering the development of Platoon. The first is to extend its worker/controller API for multi-gpu and multi-node programming. Of course for this purpose pygpu will be used in conjunction with a Python MPI framework to work out a worker interface for multi-gpu/node collectives in Python. This interface is intended to be easy to use and intuitive in order to be used in constructing readable and efficient distributed training algorithms. For starters at least, there won’t be any effort in making an optimized implementation (nccl tries to make multi-node topology-optimized/aware framework - libgpuarray/pygpu will follow this course and platoon worker/controller will provide a working multi-gpu/node interface throughout). You can follow and participate in the designing discussions in this thread of theano-dev (see proposed design discussions there).

The second goal is to make Platoon an experimentation and development framework for distributed training dynamics, as well as a gallery of reusable (and easily configurable) implemented parts of training algorithms. The user will be able to utilize existing code parts in combination easily in order to create a specific variant of an algorithm. These parts can be combined through a provided GenericTraining interface or used as standalones in user’s code. The user will be able also to create his/her own parts easily through Theano + Worker interface (or actually his/her own defined functions). Generic validation and testing tools will be also provided. This will be implemented by realizing that a training algorithm is consisted of the following, independent yet influential to training procedure, parts:

1. Sampling strategy
2. Local (per worker/particle) minimization dynamics
3. A condition which dictates when local information should be combined
4. Global minimization dynamics (sync rule)
5. A condition which dictates when training is considered to have ended

You can see more of this discussion in the same Platoon thread as above, as well as in the documentation in my fork of Platoon.

Keep coding!
Tsirif, 10/06/2016

### kaichogami (mne-python)

#### 2nd and 3rd Week of GSoC

Hello everyone!
GSoC is almost reaching its end of its first phase(by almost its 10 days more). I have learnt a lot while working with MNE. Fortunately my undergraduate degree is on electronics and I can figure out some of the technicalities related to signals(that is the reason I gained interest in MNE) although I always have trouble when my mentors or people in the community talk in detail😛
These two weeks were spent designing the API for the new decoding module(if you have no idea what I am talking about, please refer to my last post). We are close to finalizing the Xdawn Transformer refactoring which ensures scikit-learn pipeline compatibility. It is extremely light compared to original version as it does not work on epochs object. However the drawback is its inability to handle overlap case. A overlap case occurs when the event times are at the same instant.
Second thing we worked upon were the new classes required to fullfill the idea of Jean. As mentioned in the last post, the idea involves taking 2D input at every stage of pipeline(following the sklearn API) and converting it back to the required shape in the transformer as required by the MNE functions and returning the original value. This format of pipeline will ensure that no external data in case there is some reduction in data(such as resampling where frequency of signal changes, so we need to pass this external change to the next step). Now as pipeline finishes in a scikit-learn estimator we just need to convert it back to 2D. To have a rough idea please refer to this PR.
I hope I managed to convey some idea about my progress.

Thank you
Asish Panda

### Karan_Saxena (italian mars society)

The past two weeks were pretty eventful. My project saw tremendous build-up during these days.

After my last update, my Kinect sensor was working with my Thinkpad on i5 processor. I was able to access the feed using Kinect Studio v2.0 and SDK Browser v2.0.
Kinect Studio provides an interface to the KinectService.exe, which in turn controls the Kinect Sensor. In other words, Kinect Studio provides a highly abstracted GUI for Kinect Sensor control.
SDK Browser contains SDK documentation and examples on how to control the Sensor via code.
Since Kinect SDK provides only native support for C# and C++ only (why Microsoft, why?), I was left with no other option but to use a third party wrapper - PyKinect2

So first things first, PyKinect2 is weird, like really really weird. Not only does it have almost ZERO documentation, but since it is comparatively new, there are almost no third party tutorials or blogs out there on the internet for explanation. I was almost ready to search for some alternative wrapper.
Well, after spending a lot of time Googling (even searching on page 2 of the results :P), I finally found a blog which gave me a ray of hope that PyKinect2 actually works. Check it out here.

With my newly found motivation, I setup my project on Visual Studio 2013.

Here's how I proceeded:
1) Installed Anaconda 32 bit, since PyKinect2 only supports 32 bit architecture (similar to PyKinect for XBox 360).
2) Numpy is included in Anaconda distribution.
3) Installed comtypes.
5) Install PyGame. This can get messy because of Wheels. Thankfully, Vito has explained this beautifully here.

I installed all these in my Virtualenv so as not to mess up my global configuration. Virtualenv will also help me collect dependencies at a single place while checking out the project.

After adding the code to the project, the skeleton tracking was working perfectly.

Next steps:
# To dump the skeleton coordinates in a json file for Shridhar to continue with his part of the project.
# To read more about OpenCV from here and try to convert the PyKinect code to PyKinect2 equivalent.
# We moved to Github enterprise from BitBucket. I am currently having SSL issues with my repo (link). I am figuring it out with my mentor. Once done, I will push the current code to my branch.
# I will be having my exams for next 2 weeks, and hence my work pace might be slow. After my exams, I will be completely free till the end of GSoC. There I will compensate whatever I miss out.

Onwards and Upwards!!

## Recap

After week 1 my result was a bare bones communication between the wrapper bear and the given executable, as I mentioned in the previous post. The whole gimmick/trick of this project is to use JSON strings in order to communicate between the 2 entities, this way we can assure that any language which is able to parse a JSON string will be a valid language to write a bear for coala. Last week I worked on making the wrapper send JSON strings containing the necessary information to the executable and parsing its output into result objects. My goal for this week was to write tests and setup the CI to build properly the project.

## Problems

When I wrote my timeline for this summer I was pretty sure I will be encountering difficulties so I reserved some weeks especially for solving them. In that way I was sure that I would not delay my project. While developing the JSON communication I had a minor (maybe not minor) setback. It is a bit technical but I will try to explain it without getting into many details. Basically, coala can take settings for its bears (ex: max_line_length) which are specific to each bear. A potential developer of a foreign language bear must be able to have access to such settings, obviously via the JSON string passed to his executable. The class returned by the decorator should include this functionality, but the settings have to be passed in the user's wrapper somehow. I managed to find a nifty way to do it by creating a dummy function. Without getting into any more details the settings parsing works well but there is a problem when optional settings are used. If an optional setting is not specified in the .coafile or in the command line, then the default value should be used. Well in my case, the default value is not used. In order to solve this problem I tried many simple fixes and workarounds but I think I have to dig a bit more into the coala core. I decided to postpone this for my buffer/problem-solving week, which is week 4. Fortunately this problem is not anything game-breaking and the week 3 goal can be touched without any setbacks.

## Testing

By testing your software you make sure that everything works as expected but more importantly you assure that any following development will not break the other modules. In coala continuous integration (CI) is set up to use coverage1 such that all of the pushed code is relevant (without unused parts). To make sure that every line of code you write is executed, you also have to write appropriate test cases. This means that a pull request (PR) is never accepted in coala if it does not have proper testing (aka coverage drops bellow 100%). In my case, I had to study how testing is done in coalib (coala library) since I had only written tests for bears so far.

## Wrap Up

Since I started, I haven't managed to merge any of my code because it didn't have appropriate testing. By the end of this week, I expect to be able to merge my code and bring coala (a little more than just bare bones) foreign language bear support. The following parts of the project will include utilities to make the foreign language bear developers write as little python code as possible, ideally none at all (though I don't see why, python is awesome :D).

### shrox (Tryton)

#### FODT to ODT with LXML

I wanted to be able to make an ODT file given an FODT file that LibreOffice can open and I have pretty much succeeded.

I faced a few challenges along the way due to my improficiency in lxml, though they seem to be very easy in hindsight For example, I spent an inordinate amount of time to link the <draw:image> tag to images in the Pictures directory.

I began the project by taking content from the FODT and dumping it in accordance to their tags into the files. My mentor Cedric’s already existing work helped me to a great extent for this. I was taking an entirely different path before I had look at what he had done. It took me a while to understand his code as well but it did make things simpler for me.

After that I concentrated specifically on the content.xml file of the to-be ODT. I needed to remove the images. This was simple enough as I all  I needed to do was remove the binary data.

I made use of XPath to locate the various tags in the XML and it’s impressive just how simple to use and intuitive it is.

I plan on now really cleaning up my code. It is a mess as of now. I also need to make a couple of minor changes and additions. Another problem that occurs right now is that LibreOffice is having to “repair” the ODT file that my converter makes. It will have to be fixed as well.

### srivatsan_r (MyHDL)

The HDMI receiver core is almost over, I have been reading the xilinx application notes for the HDMI receiver IP core. I will complete the HDMI receiver IP core soon and then start debugging it.

The receiver core is more complicated than the transmitter core. In the receiver there are modules for word boundary detection, channel deskew, etc, which were not present and are not needed in the transmitter.

I learnt how to account for xilinx primitives in the code using MyHDL. xilinx primitives are modules which have a dedicated space for them in the FPGA, so that we need not waste the logic gate space for them. We have to use user defined code feature for this case and we can implement the logic using MyHDL (so that simulations can be done) and use the xilinx primitive module only when the code is converted to verilog.

### Shubham_Singh (italian mars society)

#### GSOC 2016 Project Update

So next 2 weeks will be very crucial for me to speed up the development process for my project .In next couple of days i will be defining a very detailed schedule for the project with my mentor .

I am aiming to achieve GUI part for the project using python qt by designing and defining all the necessary labels ,input fields, buttons  and some basic output screens to show the result  in next 2 weeks .Further i will also try to design basic tables to store the relevant data in the database using MongoDB and then link the database backend(MongoDB) with the Front end(GUI using python qt).
Also I spent my time this week learning about plotting graph using pygraph which will be required in further stage to show the change in the astronaut's HI for a time interval specified .

For  all the above  specified work to be done on time and before deadlines,I will  be defining  the detailed plan for next 2 weeks(day by day schedule if possible ) with my mentor  .
"Let the Force be with me " .

### SanketDG (coala)

#### Summer of Code Week 2

Hello,

This is Week 2 of my Google Summer of Code project. In this week, I will talk about why documentation is important and show you a simple bear that capitalizes every uncapitalized sentence. (I know!, right?)

When we talk about documentation in a codebase, we generally mean documentation of objects, i.e. of classes and functions. This helps future hackers to get started easily with the codebase. The general convention is to give a brief explanation of the object, describe the parameters it takes and what it returns. Does this work?

Generally, yes. Most projects do this, and it should work. But the right question to ask, “Is it for humans?”

This is why hand-written documentation is important. I personally think a right-mix of generated API documentation and hand crafted documentation works well. This is what we do at coala!

The bear I wanted to talk about doesn’t do much right now, but the key takeaway was that I was able to successfully parse the documentation, analyze and manipulate each section, and assemble back the documentation. I have recorded a small asciinema to demonstrate this:

### ghoshbishakh (dipy)

#### Google Summer of Code Progress June 10

So it has been about 20 days since the coding period has begun. I have made some decent progress with the backend of the Dipy website.

The target that was set according the timeline of my proposal was setting up an authentication system and login with github in Django along with custom admin panel views for content management.

For now the new Dipy website is hosted temporarily at http://dipy.herokuapp.com/ for testing purpose. The login system and the content management system is almost complete. I have already started designing the frontend. The corresponding code can be found in this pull request.

### Details of Backend Developed So Far

For login with GitHub and Google Plus I have used python-social-auth. After a user logs in, his/her content editing permission is determined by checking if he/she has ‘push’ permission in the dipy_web repository in GitHub.

This is done by fetching repository information from GitHub API with the user’s access token:

The resonse contains permission information like:

So if a user has push:true permission then he/she has push access to the dipy_web repository and that user is granted permission to edit the content of the website.

Now there are several type of contents and each type has its own model:

1. Website Sections: The static website sections that are positioned in different pages.

2. News Posts

3. Publications

#### Website Sections:

The website sections contains some identifiers like which page it belongs to and in which position it should be placed. The content body of the website section is written in markdown. To change the markdown to HTML the markdown library is used. The model’s save() method is overrided so that each time it is edited, the new HTML is generated from the markdown. The HTML is filtered using the bleach library.

There is requirement of embedding youtube videos in the markdown content of website sections. But allowing ifames in markdown would allow embedding any kind of content from any arbitrary source. So I wrote a custom template filter for converting the youtube links into embed codes. This also makes it simple to embed the videos as only pasting the url is all that the user needs to do.

#### News Posts:

News posts are simple models for storing news with post dates. Most recent news are displayed in the home page.

#### Publications:

Publications are Journal/Conference/Book Chapter publications or any other literature about Dipy that people may want to know about or cite.

One important information about the publications that people often seek is the bibtex. Also the publication information can be extracted from the bibtex only. So instead of entering all details of the publications in a form, the content editor can only enter the bibtex and the model can be automatically populated by parsing the bibtex. For parsing bibtex instead of writing a new parser from scratch I have used bibtexparser library.

Another property of the publications is that they can be marked as highlighted and the highlighted publications can be displayed separately in the home page or anywhere else.

### Some more thoughts on the GSOC journey so far

I just love it :) Especially because of my mentors. We have weekly video call meetings where a very specific set of targets are set for the coming week. This helps to maintain my focus in a particular direction. Also there is regular exchange of mails throught the week. They also go through my code every week and give feedback, so I am always trying to be very careful and make less mistakes :P

They inspire me to work harder.

### Prayash Mohapatra (Tryton)

#### Doing CSV Import

So this is the view I am trying to replicate on Sao.

As per the plan, this week’s goals were to get familiar with the source code of Sao and Tryton in order to figure out how to make the views and plugin in the code for import/export functionality.

This would be the only way to understand how Sao works as there isn’t much documentation on it. However, doc.tryton.org provides some basic insight into what are some of the things are e.g. Window and Form views.

I will be using http://github.com/prayashm/sao to get my work reviewed before submitting it to rietveld codereview.

About making the views, I read through the code for same in tryton client and added the corresponding elements into sao which uses bootstrap. I am a bit stuck at making the tree views work for the All Fields section and CSV Parameters section. Looking into it.

While writing this I realise I might not be the best choice for a blogger :)  Have to try harder next time. If you happen to drop by this post, do comment on what you wanna read. Till then, happy coding.

### Upendra Kumar (Core Python)

#### Page Layouts and Application Flow of Logic

We have prepared some of the design layouts and application flow of logic of the ‘pip’ GUI. Some of the page layouts designed for the ‘pip’ GUI are :

1. Welcome Page

Page Layout : Welcome Page

2. Configuration Page

Page Layout : Configure Environment

3. Search and Install Page

Page Layout : ‘pip’ Search and Install

4. Install from requirements file

Page Layout : Install From Requirement File

5. Install From Local Archive

Page Layout : Install From Local Archive

6. List, Update and Uninstall Package

Page Layout : ‘pip’ List, Update and Uninstall

Now, workflow for application :

Application Flow of Logic

#### Ist Design Iteration of ‘pip’ GUI

This is my first iteration of ‘pip’ GUI project. It is based on the design process explained in my last blog post. I am re-writing the steps needed for the design of an application :

• Define the product
• Define the target users
• Define the user goals
• Define the top user tasks : Focus on top six tasks
• Define the user’s context : Know user’s context
• Explain each top task to a friend
• Map your natural explanation into a page flow
• Design each page
• Simplify and optimize the task flow and pages
• Review the communication
• Verify that purpose of each page and control is clear
• Review the results

In this design iteration I have tried to follow those steps which are relevant to my application :

# Product:

The product developed will be a GUI for pip to make various functionalities of pip ( a command line tool for installing Python packages) easily accessible to users. Main motivation behind the need for GUI version of Python Package Manager is :

• Make various functionalities provided by PIP easily accessible to Windows/LINUX/Mac based users
• Help people to focus only on fulfilling the task of installing Python packages rather than getting in unavoidable trouble of configuring various paths, versions and configurations

# Target Users:

1. Windows Users ( who have difficulty using command prompt )
2. In case of LINUX or MacOS, people who have little or no experience with Python packages
3. Users who maintain multiple versions of Python in their system
4. Users who need to manage Python packages in virtualenv ( Last priority )

# User Goals:

1. Avoid command line to manage ( i.e. install/update/uninstall) Python packages
2. Manage packages for different versions of Python installed in the system

1. Search, select version and install package ( including dependencies : first tell user about total download size including dependencies )
2. Check for new updates and install them.
3. Uninstall a package
4. Support different installation methods :
1. Requirements.txt files
2. .whl files
3. From local archives
5. Manage packages in virtual environment

My next blog post provides a peek of GUI layout and application flow of logic.

### Aakash Rajpal (italian mars society)

#### Leap Motion Working

After wasting days in trying to generate a Python 3.5 wrapper for the Leap API, I decided I will make a socket connection between the Python2.7 script that gets tracking data from leap motion and the Blender Game Engine. So I started to work on the script to recognize the gestures and everything and was able to successfully send the tracking data of Leap Motion Controller to BGE through the socket connection.

Next, I had to make sure that my Habitat Monitoring Server works as well and Thus I configured by Blender Client to get the data from leap motion and use that to get data from the HMC server.

Woohoo it worked!! I made a small demo video Youtube and pushed my code to Github!!

### aleks_ (Statsmodels)

#### Almost 3 weeks over!

It's almost 3 weeks since the coding phase of the GSoC has started.
In this time I continued with the work of the pre-coding phase, e.g.

• getting more familiar with the statsmodels code base as well as reviewing necessary Python modules like numpy and pandas,
• learning more about git in order to prevent issues linked to the rebase process,
• or studying the theory of VECMs.

Of course I have also started with actual coding. I am currently implementing the estimation of VECMs - more precisely the estimation of the parameter matrices (see e.g. chapter 7 in Lütkepohl (2005) for more details).

One highlight not related to the GSoC has been joining a meetup of PyGRAZ, where a Django-centered coding dojo took place. Though not statistics-related, I consider the meetup as useful (e.g. because of hints regarding the coding style) - and fun :)

### Aakash Rajpal (italian mars society)

#### The Leap Motion Connundrum

Well, the first part of my proposal involved integrating the Leap Motion Controller and the Blender Game Engine(BGE). I thought this wouldn’t be so hard as leap motion has a leap python API and so does Blender. So my task was simple, integrate Blender python API(bpy) with Leap Python API. Right?

No ways, as I started to work on my integration I ran into a small problem ( large as it seems now). Turns out the official Leap Python API supports only python2.7 and the bpy runs on python 3.5. Well, that was a concern. I was able to obtain the Leap data from a python2.7 script quite easily. All gestures were detected fine and I thought all I had to do was port the 2.7 script to 3.5 in the Blender. But no, it wasn’t that easy. After porting the script and running in blender I came up across all sorts of errors related to the Leap API.

It jusn’t wasn’t working.

I searched up the Leap community for answers and thankfully I found one related to making a Python3.3 Wrapper using Swig for Leap API. So I looked up the solution and turns out they have no exact steps for generating a Wrapper in Ubuntu. Well, technically they had a link to another page for Ubuntu but that wasn’t opening at all. For Windows and MacOSX it was all there. Unfortunately not for Ubuntu😦

So I kept on searching for a solution and whenever I thought I found the solution I ran into the same problem they all had links to the same web page which wouldn’t open at all.

So I decided I will try to generate the wrapper by myself learning about Swig and all. Well, I tried and tried and tried yet couldn’t fix the problem. One evening, I even thought that my project will be dead. But then another solution came up to my mind and then I

But then another solution came up to my mind “A socket connection between python2.7 and bpy”. I was able to get all the tracking data of Leap motion through my 2.7 script so I would send this to the BGE through a Socket connection. I consulted my mentors and they said it sounds good and now I am up and running again.

# Current Results

This week I moved beyond my simple preliminary code to something that actually works. I started testing the code and cleaned it up/added comments. The results are promising and the debiasing procedure is currently beating out a naive averaging procedure, there is still one bug though that I am hoping to get figured out this coming week.

# To Do

In terms of what I want to get done this week, the first goal is to get the testing code off the ground. Beyond that there are several modifications that still need to be made:

• Currently fit_distributed takes a generator to partition the model for the estimation. If the generator is None then fit_distributed will use one that is provided internally. The idea behind this is to allow for more complicated partitions scheme that may incorporate out-of-memory data or possibly even parallel compputing. The issue is, how best to handle a generator that also maintains the elements of the model object? Currently, I’ve implemented a very naive approach where the generator produces a model for each partition that essentially copies all the attributes of the main model. This has issues though because there are a number of attributes that need to be partitioned, for instance with the GLMs.

• Also, currently, everything is build on the elastic_net code, however, to perform the actual regularized model fitting fit_regularized is still called within fit_distributed. The way it works is this, fit_regularized takes an argument, distributed, which, if True, will perform the distributed estimation, by making a call to fit_distributed. Within fit_distributed, fit_regularized with the distributed estimation turned off, is called on each data partition, as well as to generate the \hat{\gamma} values used to approximate the inverse covariance matrix. However, since this requires that a model if first formed for each p it is probably not the most computationally efficient approach and I’d like to move to calling the elastic_net code directly.

• Finally, I’ve done essentially nothing to integrate my code into the duration code. I think this should be fairly doable but I wanted to have everything else working first.

### jbm950 (PyDy)

#### GSoC Week 3

Today is Friday the 10th and thus marks the end of the third week of Google Summer of Code. This week started off with continuing work on making test code and a base class for KanesMethod and LagrangesMethod. The work took a turn early in the week when I started working on an example problem that would use the base class instead of working on the base class and test code itself. This resulted in more reading and studying of code examples. This week I also had the opportunity to review multiple shorter PR’s in addition to a longer one that dealt directly with code in KanesMethod.

At the very beginning of this week I migrated all property attributes from KanesMethod and LagrangesMethod into EOM as a base class. This work shows up in PR #11182 which was originally meant to just be a discussion on the API for the base class. It was suggested to me to stop working on the actual implementation at this point and work on planning out a more complete API.

In order to come up with a more complete plan for the API for EOM I first had to get a better understanding of what was done with the equations of motion after the formation step. To do this I looked over pydy-tutorial-human-standing, Planar Pendulum Example, Double Pendulum Example and PyDy Mass Spring Damper Example. After the equations of motion are formed the most common use for them was for time simulations using ODE integration (or DAE integration) and so the next thing I did was look into the documentation for the integrators (scipy.integrate.odeint and scikit.odes.dae). With this information I was able to begin work on an example problem in for the PyDy repository that would make use of this new class.

I found that Pydy’s System class performed the work of rearrangeing the equations of motion into a form accepted by the integrators and so the main focus of the base class is to have the attributes that system expects. After analyzing System’s internals I went ahead and created a basic example of a base class and submitted the code in PR #353 in the PyDy repository. The PR shows manual entry of the equation of motion for a system with two masses, springs and dampers and will be used for further discussion of the API.

This week I reviewed two shorter PR’s and one longer PR. The shorter PR’s were PR #10698 and PR #10693 and covered sympy documentation. The first requested removing a docstring of one of the modules because the module had detailed documentation online. I suggested that this would be a negative change to sympy overall and others in sympy came to the same conclusion and it was promptly closed. The other documentation PR had a couple of spelling fixes but also had some negative or non-useful changes and is currently awaiting the author to remedy the latter. The Longer PR this week was a continuation on the Kanes Method PR #11183 Jason started last week and is an effort to improve the readibility and overall cleanliness of KanesMethod.

### Future Directions

The current plan for the next steps of the project is to come up with a few more examples of the base class in use. Once the example code has laid out an API I will begin implementing the base class itself.

### PR’s and Issues Referenced in Post

• (Open) Improved the explanation of the 5 equations in the Kane’s Method docs PR #11183
• (Open) kane.old_linearizer Issue #11199
• (Open) EOMBase class migration of property attributes PR #11182
• (Open) Created a basis on which to discuss EOM class PR #353
• (Closed) Remove documented redundant comments PR #10698
• (Open) Docmentation comments corrections PR #10693

## June 09, 2016

### Avishkar Gupta (ScrapingHub)

#### Rewriting Scrapy Signals, Part I

In this report, I shall present a brief summary of the work done upto this point. To iterate, the end goal of the project is to improve the efficiency of Scrapy’s signaling API. For this purpose the idea was to use Django’s signaling mechanism which claims to make the signals 90% faster. Consequently, according to plan the first step was to port django.dispatch, modify it to our needs and create the scrapy.dispatch. About a week and a half was spent on that front in first understanding the way that library is written, and what would changes needed to go into making the same work with Scrapy, and then actually making those changes. django.dispatch refactored much of the code of PyDispatcher into a Signal class. It also introduced a caching mechanism for receivers and introduced the weakref module into the code.

The send_robust method was used as a starting point for the send_robust_deferred call to incorporate methods returning deferred calls, this was required for re-writing the scrapy.utils.signal module which has the send_catch_log and send_catch_log_deferred modules. These methods are relative inefficient and tend to bottleneck the scraping process, and were one of the reasons behind the re-write of signals. The next step involved was to re-define the core signals at the heart of Scrapy as instances of scrapy.Signal instead of generic objectclass. The signalManager class also needed to be changed however the API of the same was kept consistent for backward compatibiliy.

You can track the progress and check out how the project is coming along here and here although I must admit I do not push frequently and most commits are either local or small test pieces written outside the mainline code. I shall however, be pushing a working prototype by the end of Monday so that’s there to look out for.

### chrisittner (pgmpy)

#### Bayesian Parameter Estimation for BNs

Now that ML Parameter Estimation works well, I’ve turned to Bayesian Parameter Estimation (all for discrete variables).

The Bayesian approach is, in practice, very similar to the ML case. Both involves counting how often each state of the variable obtains in the data, conditional of the parents state. I thus factored out state_count-method and put it the estimators.BaseEstimator class from which all estimators inherit. MLE is basically done after taking and normalizing the state counts. For Bayesian Estimation (with dirichlet priors), one additionally specifies so-called pseudo_counts for each variable state that encode prior beliefs. The state counts are then added to these virtual counts before normalization. My next PR will implement estimators.BayesianEstimator to compute the CPD parameters for a BayesianModel, given data.

Update: #696.

#### Things I wished I had known when I got serious about programming

This is my first post about actual programming! When I started doing computational research about three years ago, I was lazy and borderline incompetent. Today, the tools I have learned allow me to be equally lazy while being somewhat more competent. These range from simple lifestyle decisions to basic tech skills.

## Tip 1: Install Linux

Do yourself a favor and install Linux. My first install of Linux was done on my laptop, and it turns out that the wifi was broken until I installed the new drivers. Figuring out why things broke was a bit of a slog, but I eventually stumbled upon a great demonstration of the beauty of open source software. Incredibly, this Realtek employee wrote drivers for Linux on his own time! Installing these drivers was a bit of a rabbit’s hole, but I firmly believe that system administration builds character. Also, for what its worth the install on my desktop was painless.

## Tip 2: After installing linux, learn your shell commands

### Navigating the command line


pwd # shows where you are in the filesystem
cd # changes directory to specfied path
ls # shows existing files in path
cp # copy specified file to new filename
grep # a tool to search for regular expressions or patterns inf iles/directories
mkdir # creates a directory with a specified name in current path
chmod # changes permissions for a file
sudo # if given administrator priveleges, this allows installation
# directories if given as a prefix to a shell command
rm # deletes stuff
which (command here) # shows the path taken to binary executable of a command


Internalizing the numerous tools at ones disposal takes some time. Part of me thinks that I have learned command line tools simply because it makes me feel like Deckard.

The workflow speed up can be tremendously useful. When it comes to tools like git, the Github Graphic User Interfaces (GUIs) available to Windows are awful in comparison and it becomes a necessity.


# change to website directory
# for me this is:
cd github/jdetle.github.io
pwd # press enter
# i am in /home/jdetlefs/github/jdetle.github.io
# trying to copy to /home/jdetlefs/github/jdetle.github.io/images
# when typing a path ‘~’ is a placeholder for ‘/home/username’
# YU is a unique identifier for this file, pressing tab twice will list
# all files with these characters as an identifier (DONT PRESS
# ENTER YET)
(continued) images/‘filename’.gif # press enter
# check that it exists in the appropriate path
ls images
# ‘filename.gif’ should exist in this path!


Initially this process may take longer than dragging and dropping files, but it quickly becomes far faster than using a GUI.

### How I add a newly installed program to my $PATH  echo$path
export PATH=$PATH:/path/to/my/program echo$PATH


### How I work in virtualenvs (kind of complicated)



From here, things can get complicated if you have an installation of Anaconda. Both pip and anaconda are package managers, and but when installing itself, anaconda installs pip in its own directory, and handles virtual environments on its own. Things can break when using both conda and pip because managing dependencies is ugly and awful and left for people with higher pain tolerances than me. Of course, one can use conda virtualenvs, the differences probably aren’t too significant. This is actually still a problem on my laptop, so I am going to spend this time installing pip without conda and getting virtualenvs to work on my laptop! (6 and a half hours later.) … Okay so this isn’t pretty. If anaconda3 is installed, it looks like virtualenvwrappers won’t work because virtualenvwrapper only works using python2.7? (Don’t hold me to this). My solution was to delete anaconda3 altogether. Often times I’ve learned that the brute force solution works pretty well. (Somewhere in the distance Peter Wang feels a disturbance in the force.)

rm -rf anaconda3 # CAREFUL

Be careful with this!!! It recursively deletes this directory and all files in it and can ruin your OS install.

sudo apt-get install python-pip python-dev python-cython python-numpy g++
sudo pip install virtualenvwrapper
vi ~/.bashrc


Crap! Another thing I have to explain! vi opens Vim, a text editor that keeps things fast and simple. A fresh installation of Ubuntu 14.04 will come with Vi Improved or vim, that is a superset of a vi (a relic of days of yore) but has no preinstalled functionality. To install a working version of Vim that allows for syntax highlighting and easy workflow tools, do the following.


sudo apt-get update # cant hurt
sudo apt-get install vim
# ctrl-shift-n to open a new terminal window
# vi ~/.vimrc
# (press shift+colon) i (press enter) will allow you to start inserting
# copy and paste this  with a mouse
set expandtab
set tabstop=4
set softtabstop=4
set shiftwidth=3
set autoindent
set textwidth=80
set nocompatible
set backspace=2
set smartindent
set number
set cindent
colo torte # preinstalled color schemes in /usr/share/vim/vim74/colors
syntax on
# (press shift+colon) wq (press enter) to write and quit
# x does wq simultaneously


Here is some more info on Vim. Let’s get back to editing our bash shell configuration file (~/.bashrc)


vi ~/.bashrc
# before doing anything add the path
# (shift+colon) i (enter) and line 1 should be
PATH=~/bin:$PATH # line 2 export PATH # line 3 source /usr/local/bin/virtualenvwrapper.sh # (shift+colon) x (enter) and (ctrl+shift+n) to open a new terminal  And there you go! You should have a working installation of the virtualenvwrapper such that you are ready to use virtual environments when making your first pull request on your new Linux system! ### Using pip virtualenvs to work on github projects Let’s make a pull request for MDAnalysis using the tools we’ve learned!  # I like having a github/ folder for my various repositories # First, let’s clone into the repo cd (press enter) # takes us to home user directory mkdir github cd github  Before moving further, you should create a github account if you haven’t already and fork MDAnalysis. This will create a clone of the repo that will function as your ‘origin’ repository. MDAnalysis will be the ‘upstream’ repository that we set up later.  git clone https://github.com/YOURUSERNAME/mdanalysis # this takes a little bit (289 megabytes) mkvirtualenv MDA workon MDA pip install numpy cython seaborn # installs dependencies pip install -e . # installs MDAnalysis such that changing files # changes how packages behave when loaded for a script  From here we can start working on establishing a git workflow using branches.  git remote add upstream http://github.com/MDAnalysis/mdanalysis git branch NEW_PULL_REQUEST git fetch upstream #checks for updates git checkout upstream/develop -B develop # creates develop # branch to rebase against later and switches to it # there might be a way to do this without checking the branch out # but I dont know how git checkout NEW_PULL_REQUEST # do work on this branch  Any time you want to save the work you’ve done, you can see the files you’ve changed with  git status  Then add them to be staged for a commit that will be merged into the upstream develop branch if the pull request is accepted.  git add file_name_here # once you’ve added everything you want to include in a PR git commit -m ‘Insert a descriptive commit message here’  If you want to make a tiny commit, and blend it into a previous commit.  git rebase -i HEAD~(# of commits back you want to go)  Use vim style interactiveness to rebase commits. Changing ‘pick’ to ‘fixup’ ‘squashes’ a commit into the previous the first pick commit above it without using the commit statement. Using squash will combine commit statments. When happy, (shift+colon) (ctrl+x) and pressing Y and enter will combine commits. If still unsatisfied you can amend the commit manually.  git commit –amend #edit the commit  When you’re ready to save your work to the origin directory.  git fetch upstream git checkout develop # if prompted: git pull # updates changes made # if your command prompt makes a recursive merge, you’ve done something wrong git checkout NEW_PULL_REQUEST git rebase develop # rebase against develop to avoid merge conflicts git push origin NEW_PULL_REQUEST  Before actually making a pull requestion on github, make sure you didn’t break any tests, and you’ve written new tests for the new code you’ve written.  cd ~/github/mdanalysis/testsuite/ pip install -e . cd MDAnalysisTests/ ./mda_nosetests (press enter)  Hopefully that helps! There is a bevy of more rigorous work that’s been done on understand git branching. A succesful Git branching model is very helpful, reading the github helps too. Atom is a very nice editor with its github integration and hackability. I like to use Jupyter as a script playground for MDAnalysis. ## Tip 3: Get good at googling This tip from freeCodeCamp is applicable to any problem. Read-search-ask is a strategy that will help you learn indepently and boost confidence. Adding on to this advice, I have found that if you find the email of someone knowledgable in the area you are struggling in, simply by writing an email explaining your problem you can often find the solution on your own. If you don’t figure it out, then you might just impress that person with your detailed investigation. Even if they aren’t impressed, they’ll likely help you out. People in open source are generally receptive to people who demonstrate that they are working hard at becoming self-reliant. Always err on the side of not sending that email though; nobody likes being harassed with trivial questions. ## Tip 4: When working, avoid distractions, double check, triple check, quadruple check… When working on projects involving non-commercial software it is especially important to think of all the possible ways you could have screwed something up. Check your code for glaring logic errors and before running an intensive calculation, run a baseline to ensure that things work. In quantum chemistry, an example for this would be running a Hartree-Fock calculation with the STO-3G basis set before doing something that scales much slower. Develop scripts to ensure you are getting expected results, become skilled at using grep and simple regex. (Regexr is a great playground to learn regex) Assume that you’ve written bad code and that bugs will be caused by small changes to input parameters. Expect things to break easily. Inspect all work exhaustively. When reading academic papers, print them out and read them away from a PC. Usually academic papers use wildly esoteric jargon. This paper on diffusion maps (the subject of my next blog post) actually features a ‘jargon box’ which is just great. Academic papers usually also assume a high level of familiarity in the subject material and are written for those who are skilled at reading papers. It is easier to dedicate the intense concentration required for most papers when unplugging from tech and using some ear plugs. Finally, when communicating over email you can embrace one of two strategies. Either you can add a ‘sent from my iPhone’ tag to everything, or before adding recipients, take a second to go get a drink of water come back and reread the message for errors. Unfortunately, people will judge you for poor grammar even if they don’t mean to. (Shoot, I just ended a sentence in a preposition…) ## Tip 5: Tackle what intimidates you I seriously believe that this is the number one part of becoming an adult and it is something I have only really internalized in the last year. Problems will not go away by avoiding them. Oftentimes I find myself building up things in my head as if they will be a bigger deal than they actually are. Figuring out how to us virtualenvs was one example of such a barrier that occurred recently. This occurs in my personal life as well and invariably the outcome is always better than how I imagined it would be. Having trouble getting started on a project? Unfortunately Shia isn’t much help here. Segment your work into discrete chunks. If you have a pull request you want to make, think of all the possible minutia you have to work through in order to get things done. I like to use Google Inbox’s reminder feature to constantly remind myself of these things I need to get done. When I finish a task, I can swipe it off my todo list and enjoy that feeling of catharsis. If you are a budding programmer, take an algorithms class for free here. If you still aren’t busy enough, read the MDAnalysis Guide for Developers and start learning with help from a tight-knit community of open source contributors. ### mkatsimpris (MyHDL) #### Created separate tests and improved code quality As Chris pointed, I split the original test unit into two separate tests. The first test checks the color conversion module with myhdl simulator while the second test checks the outputs of the converted testbench in Verilog and VHDL with the outputs of the myhdl simulator. Moreover, I improved the code quality of the rgb2ycbcr.py in order to increase the health of the module using landscape.io ### Ravi Jain (MyHDL) #### GSoC: Management Block – Half Work done! Its been seven days from my last blog update. Time Flies! One thing i wish i wouldn’t have done this days was to procrastinate blogging(My org requires me to blog 3-4 time a week). Other than that, this week has been the best one yet. It certainly gets much easier once you start off. During my last blog as I said I was still figuring out the proper way to use FIFOs with the GEMAC core and the interfaces. While going through the Xilinx user guide 144 document i found that they had documented the interfaces of the FIFO too. So the problem of figuring out the interfaces was solved easily. Next I mentioned I will be developing some tests for the GEMAC core as a whole. Well I couldn’t do that cause I had never written tests before and I didn’t really have the experience to be able to identify the transactors(structure) of the package before really implementing it. So I decided to move on and pick a module or the block and implement it(Term used in MyHDL as the term ‘module’ clashes with a different term with different meaning of Python; I will be using term ‘block’ for the rest of my blog series to refer to hardware module). I chose management block to start with as opposed to Transmit Engine as mentioned in my Project Proposal because i thought that it would be better to know the configuration registers based on what the other blocks base their behaviour on. To start with again I couldn’t start with the tests (and my org suggesting having a test-driven approach) for the same reason as above. So instead I started of by implementing the features of the block directly. First feature I added was to read/write the configuration registers(basically a list of signals) present in the block. Then I quickly moved on to another feature, converting host transactions into MDIO MII transactions, which is roughly speaking, converting parallel data into serial data(write operation) and vice-versa(read operation). Well once I was completed with the first version it was pretty clumsy and nowhere close to one the people at the org desired. On my defence, it was my first time doing developing a real block other than just writing some silly examples here and there. At this point I made a pull request to the main repository, knowing that the block isn’t complete yet, desperately in need of reviews from my mentors. Beware till this point I had done no simulation or testing to whether my code was correct. Well the reviews came in and i started adding tests that simulated the features using the new API presented in the MEP-114 one by one. One small change from the API mentioned is while using ‘inst.config_sim(trace=True)’. Note that there is no parameter ‘backend’ for config_sim unlike mentioned in the MEP(MyHDL Enhancement Proposal). I used Impulse plugin in Eclipse to view the traced Signals produced by simulation. I added tests one by one and followed something like Test-First Approach after that, i.e., adding test and tweaking the code until the test passes. The tests added for MyHDL simulation till now are read/write configuration register, MDIO Clk generation(Generated using host_clock), MDIO write operation. This lead to what I would say, second version of the implementation, which was still far away from the current version., quite complicated but still passing the tests. I learned use of TristateSignal and decided to leave it until the top level block to implement. Next came the test which checked the convertibility of the code to other HDLs, which taught me quite a few things which helped me to relate the code written to the actual hardware implementation. It led to me optimising my code a lot better and trim down on unnecessary stuff, providing the current version of the code, which was convertible.Things I learnt while making the code convertible: • You cannot use yield statements in the processes. • You can have class with Signal as attributes for local Signals. • You cannot use class functions in the procedural code. • In the local functions, all the Signals used should be passed as arguments, regardless of the scope. Classes mentioned in the first point cannot be passed as a argument. • Signals can be driven only through one of the processes, i.e, you cannot perform ‘sig.next = <some value>’ in two different processes. Now i shall refine and add some more tests for the next two days and after that work on cosimulation of the block. ### Raffael_T (PyPy) #### The first two weeks working on PyPy I could easily summarize my first two official coding weeks by saying: some things worked out while others didn’t! But let’s get more into detail. The first visible progress! I am happy to say that matrix multiplication works! That was pretty much the warm-up task, as it really only is an operator which no built-in Python type implements yet. It is connected to a magic method __matmul__() though, that allows to implement it. For those who don't know, magic methods get invoked in the background while executing specific statements. In my case, calling the @ operator will invoke __matmul__() (or __rmatmul__() for the reverse invocation), @= will invoke __imatmul__(). If “0 @ x” gets called for example, PyPy will now try to invoke “0.__matmul__(x)”. If that doesn't exist for int it automatically tries to invoke “x.__rmatmul__(0)”. As an example the following code already works: >>> class A: ... def __init__(self,val): ... self.value = val ... def __matmul__(self, other): ... return self.value + other >>> x = A(5) >>> x @ 2 7 The next thing that would be really cool is a working numpy implementation for PyPy to make good use of this operator. But that is not part of my proposal, so it's probably better to focus on that later. The extended unpacking feature is where I currently work on. Several things have to be changed in order for this to work. Parameters of a function are divided into formal (or positional) arguments, as well as non-keyworded (*args) and keyworded (**kwargs) variable-length arguments. I have to make sure that the last two can be used any number of times in function calls. Until now, they were only allowed once per call. PyPy processes arguments as args, keywords, starargs (*) and kwargs (**) individually. The solution is to compute all arguments only as args and keywords, checking all types inside of them manually instead of using a fix order. I'm currently in the middle of implementing that part while fixing a lot of dependencies I overlooked. I broke a lot of testcases with my changes at an important method, but that should work again soon. The task requires the bytecode instruction BUILD_SET_UNPACK to be implemented. That should already work, it still needs a few tests though. I also need to get the AST generator working with the new syntax feature. I updated the grammar, but I still get errors when trying to create the AST, I already got clues on that though. Some thoughts to my progress so far The cool thing is, whenever I get stuck at errors or the giant structure of PyPy, the community helps out. That is very considerate as well as convenient, meaning the only thing I have to fight at the moment is time. Since I have to prepare for the final commitments and tests for university, I lose some time I really wanted to spend in developing for my GSoC proposal. That means that I probably fall a week behind to what I had planned to accomplish until now. But I am still really calm regarding my progress, because I also planned that something like that might happen. That is why I will spend way more time into my proposal right after I that stressful phase to compensate for the time I lost. With all that being said, I hope that the extended unpacking feature will be ready and can be used in the coming one or two weeks. ### Ravi Jain (MyHDL) #### Get your own badge! (Part-1) Recently, Week-2 of GSoC 2016 concluded when my mentor asked me to setup the main repo with integration tools: ravis-ci, landscape.io (linting), coveralls (code test coverage), readthedocs and add badges for each of them. My first response was, What are badges? This are markups provided by the integration tools showing corresponding score or stats, supposed to be added to Readme.md file of your repository. Rest of the blog is on how to get one for your own-repo. Travis-CI: Travis-CI is an integration tool which tests and deploys your code everytime you make any commit for all your branches and pull requests as well. The tests are run according the script provided by you in .travis.yml under your branch. Setting it up is quite easy. Head to the website travis-ci.org. Sign Up using your Github Account and follow the instructions to link your repo. Remember to switch-on your main repo at your travis account. You can modify switches to your different repos anytime by accessing your profile at travis-ci.org. Straight-forward guides for writing script file .travis.yml in almost all languages are provided by travis-ci.org here. After Travis-Ci integration is over, sign in in your travis-ci account. Select your main repo from the left panel. You can see the badge next to your repo’s name like this. Click on the badge. Select your main branch and ‘Markdown’. Copy the link provided to your Readme.md file to get the badge on your main account. Landscape(Linting): After every code push, Landscape runs checks against your code to look for errors, code smells and deviations from stylistic conventions. It finds potential problems before they’re problems, to help you decide what and when to refactor. To help you setup, everthing is documented here. Add the repository you want to perform checks on. After setting up, go to your dashboard by logging into landscape.io account, Click on your repository. Badge will show up on top right corner as below. Click on it and copy the markdown link into Readme.md file of your repo to get the badge on github. Part – 2 will contain setting up of coveralls and readthedocs will be posted soon! ### mike1808 (ScrapingHub) #### GSOC 2016 #1: with_timeout # Preamble The end of the May and the begging of June is a tough time for me. I’ve passed my final exam for Master Degree, and, finally, I am a M.S. in Computer Science. Now I’m preparing for the Ph.D. exams. Meanwhile, I’m working on Splash during this summer. I got something to tell you. # with_timeout The first thing that I should to do is splash:with_timeout API. It allows to wrap any function and let it run only the specified amount of time in the Splash Lua scripting. Originally, I was going to add a timeout functionlity only to one API - splash:go, but after the discuss with my mentor, we decided to create a more general API. ## Implementation There are two possible ways to implement this API: 1. Using Lua. 2. Using Python. The both has its own advantages and disadvantages. The implementation with Lua is more simple. It requires to use splash:call_later and splash:wait existing APIs and also some kind of polling (infinite loop) to check whether the running callback finished its work or not. Polling isn’t the best solution, because it requires CPU to do a lot of unnecessary work. On the other hand, the Python implementation can know when the callback is finished its execution and notify the main event loop. Also, it’s more agile and configurable. So, we decided to write this API using Python. ## Callbacks The first thing that you should think of is the callback execution. In the current Splash version there are some API functions that take as an argument a callback. These callbacks are executed as coroutines. They are created using Splash#get_coroutine_run_func method. Earlier, there wasn’t any need to stop the execution of the running coroutine. However, the main idea of splash:with_timeout is the ability to run a function only the specified amount of time. The first solution was just ignore success and error callbacks from the running function. The idea is simple but not correct. Consider the following example of Lua script: function main(spash) local ok, result = splash:with_timeout(1, function() splash:wait(2) assert(splash:go("https://google.com")) end) splash:go("https://www.python.org") splash:wait(3) return splash:url() end  The first argument of splash:with_timeout is the amount seconds you want to wait and the second one is your callback. As you can see, we set the timeout to 1 second and in the callback we’re waiting for 1 second then trying to go to the https://google.com. After that we’re navigating to the https://www.python.org. Then waiting for 3 seconds and returning the current URL. The result URL, obviously, should be https://www.python.org, because the callback of splash:wait_timeout would exceed its timeout. However, the result URL will be https://google.com. The reason is that we didn’t stop it when 1 second has elapsed. So, I’d implement the coroutine stop functionality. I added a new method to BaseScriptRunner which is BaseScriptRunner#stop. It sets the flag self._is_stopped to True and during the coroutine execution that flag is checked: if it’s True StopIteration exception is raised and the coroutine stops its execution. ## Error handling There was an interesting conversation with my mentor related to how I’d handle errors from the callback of splash:with_timeout. There are two ways to handle errors and exceptions in Splash Lua scripts: 1. Return a flag ok which tells whether the operation was successful or not and result which contains the result or the reason of an exception of the operation. 2. Return result which contains the result of the operation and raise an exception using error(...) if the operation failed. In Lua, exceptions are thrown only if a user did something wrong (e.g. passed wrong arguments), so we’ve chosen the first solution because the timeout of callback isn’t related with the user itself rather than the API implementation. ## Results You can see my work in this PR#465. # Further plans This week I’m finishing with all my exams and I can start spend more time on working for GSoC. Thank you for reading. See you next time :wink: ### Anish Shah (Core Python) #### GSoC'16: Week 2 This blog post is about my second week in Google Summer of Code 2016. :) ## Converting patches to GitHub pull request If a contributor wants to fix an issue, then (s)he has to submit a patch for it. This patch is a file generated using Mercurial commands - hg diff > issue-fix.patch. This patch is downloaded by core developers and then they review, run tests to check if everything is working properly. There’s a lot of manual work involved here. Now my next task is to convert this patch to a GitHub pull request to automate everything. There’s a lot of problems that I came across for this task that we need to still work on. ### 1. Architecture to solve the problem In my GSoC proposal, I had mentioned how I would solve this particular problem. It was using a Git binary on b.p.o server and to call Git command from Python using subprocess. But my mentor wanted to know if this can be done purely using GitHub APIs. GitHub provies a contents API to add, update files. But it doesn’t provide any API to make changes using diff. I read through some parts of hub (CLI to open a PR) but the workflow for this was same as git. After the new branch is pushed to remote, Hub uses GitHub API to create a new PR. Since it was not possible to do this purely using GitHub API, we decided to stick to the idea mentioned in my proposal. ### 2. Which Python version is affected by the issue Our core developers maintain multiple Python version (2.7 and 3.x). To create a PR, we need to know the branch it needs to be merged with. The issue page has a Versions field which tells us which Python versions are affected due to thie issue. It can be wrong sometimes, but at the moment it is the best possible way to know the Python version. b.p.o has an issue and version table in DB which stores all the information. I just retrieve the information using the issue id. ### 3. What should we do when an issue affects multiple branches If a single issue affects multiple Python version, then we might want to open multiple PR for each Python version. This has not been finalised yet. ### 4. Different patches for different Python versions Since Python 2.7 and 3.x are very different, they can be different patch files for different versions. Currently, there’s no way to identify which patch file is for which version. So there’s no way of determining the Python version. Last two problems are still in progress but it is more of a later part of this feature. You can find the email thread about the discussion on this here. Thanks for reading this post. Let me know if you have any questions or how you would solve the above problems. :) ## June 08, 2016 ### Yen (scikit-learn) #### Difference between np.float64 & np.float64_t If you are a newcomer to Cython just like me, it is probably that you will be confused by the usage time of np.float64_t and np.float64. Below, I’ll briefly introduce how these two types are fundamentally different, and generalize this concept to other datatypes as well. Before that, we need to know a bit of Cython. ## cimport in Cython In Python, we use the import statement to access functions, objects, and classes inside other modules and packages. Cython also fully supports the import statement, so it allows us to access Python objects defined in external Python modules. However, note that if above were the end of the story, Cython modules would not allow to access other Cython modules’ cdef functions, ctypedefs, or structs, and it would not allow C-level access to other extension types. To remedy this, Cython has a cimport statement that provides compile time access to C-level constructs, and it looks for these constructs’ declarations from separate Cython files called definition files, which have a .pxd extension and need to be created by us. In a .pxd file, we can place the declarations of C-level constructs that we wish to share, and only the declarations here are cimportable. Also, since some_file_name.pxd created by us have the same base name as the original file some_file_name.pyx, they are treated as one namespace by Cython. Therefore, we need to modify some_file_name.pyx in order to remove the repeat declarations in it. And after we have created the .pxd file and clear the .pyx file, now an external implementation file can access all C-level constructs inside .pyx via the cimport statement. Let’s take an real-world example to see how cimport works! ## cimport numpy as np In some files of the well-known machine library scikit-learn (such as this one), you can find the following code snippet: cimport numpy as np import numpy as np  I remember I was stunned when I saw these lines for the first time, WHAT IS IT? Well, the good news is that we now know the basics of the cimport statement, so we can figure it out step by step. First, since only the declarations in .pxd file are cimportable, we have to identify which .pxd file is Cython looking when executing cimport numpy as np. After a bit of research, we should find that a file called __init__.pxd lies in the numpy folder under our Cython installation. An __init__.pxd file can make Cython treat the directory as a package just like how __init.py__ works for Python (see here). Therefore, in this case Cython will treat the numpy folder as a package and give us access to Numpy’s C API defined in the __init__.pxd file during compile time. On the contrary, import numpy as np will only give us access to Numpy’s pure-Python API and it occurs at runtime. Note that here we use the same alias (i.e., np) for both of the imported external packages, but thanks to the almighty Cython which will internally handles this ambiguity, we don’t not need to use different names. ## np.float64 v.s np.float64_t So here comes our main topic, what is the difference between np.float64 and np.float64_t, and which should I use? #### np.float64 np.float64 is a Python type object that is defined at Python level to represent 64 bits float data, and it has common attributes such as __name__ that most of other Python objects have too. You can simply use the following code to verify it: import numpy as np type(np.float64) print np.float64 print np.float64.__name__  #### np.float64 In __init__.pxd, you can find the following lines: ctypedef double npy_float64 ctypedef npy_float64 float64_t  So it is clear that np.float64_t represents the type double in C, and it is nowhere near as a Python object. Therefore, if you call print np.float64_t in a .pyx file, it will warn you the following message during compile time: 'float64_t' is not a constant, variable or function identifier #### Which to use? Let’s take another simple example to illustrate the usage time between these two types: import numpy as np cimport numpy as np def test(): // 1 cdef np.ndarray[np.float64_t, ndim=1] array // 2 array = np.empty(10, dtype=np.float64) print array  1. We use np.ndarray to declare the type of the object exposing the buffer interface, and place C data type inside the bracket for the array elements. So, we should make sure we use np.float64_t here to specify the element’s data type . 2. To initialize the Numpy buffer we just declared, we can create an array object at Python level and assign it to the Numpy buffer. In this case, we should use np.float64 since we are not declaring C type variable. Of course, The same concept can be generalized to other data types (e.g., np.int32 v.s np.int32_t, np.int64 v.s np.int_64_t, etc.) ## Summary After working on Cython for a month, I found debugging in Cython is both hard and frustrated because the documents is not really thorough. Consequently, I hope this blog post can safe your effort by helping you clarify the difference between data types defined in Cython and Python. In the future, I will also document more of my findings about Cython during my GSoC. ### sahmed95 (dipy) #### Fitting models using scipy and unit testing Hi everyone, this is my first blog after the start of the coding period and the past two weeks have been quite busy and eventful. We now have a working code to get model parameters which has been tested with simulated data. This is my first time with software testing and it has been quite a learning experience. So, let me describe the work so far. Although we are fitting the IVIM (Intravoxel incoherent motion) model to dMRI signals here, the techniques and code developed so far can be used for any kind of model fitting. The equation we want to fit : S = S0 [f e^{-b*D_star} + (1 - f) e^{-b*D}] This is a bi-exponential curve with the independent variable "b" and parameters : f, D_star, D and S0. The parameters are perfusion fraction (f), pseudo diffusion constant (D_star), tissue diffusion constant (D) and the non gradient signal value (S0) respectively. We intend to extract these parameter given some data and the bvalues (b) associated with the data. We follow test-driven development and hence the first task was to simulate some data, write a fitting function and see if we are getting the correct results. An Ipython notebook tests a basic implementation of the fitting routing by generating a signal and plotting the results. We have the following plots for two different signals generated with Dipy's multi_tensor function and the fitting by our code. https://github.com/sahmed95/dipy/blob/ipythonNB/dipy/reconst/ivim_dev.ipynb Once, a basic fitting routine was in place, we moved on to incorporate unit tests for the model and fitting functions. Writing a test is pretty simple. Create a file as test_ivim.py and define the test functions as "test_ivim():". Inside the test, generate a signal by taking a set of bvalues and passing the ivim parameters to the multi_tensor function. Then, initiate an IvimModel and call its fit method to get model_parameters. The parameters obtained from the fit should nearly be the same as the parameters used to generate the model. This can be achieved by numpy's testing functions assert_array_equal(), assert_array_almost_equal(). We use nosetests for testing which makes running a test as simple as "nosetests test_ivim.py". It is necessary to build the package by running the setup.py file with the arguments build_ext --inplace. The next step is to implement a two-stage fitting where we will consider a bvalue threshold and carry out a fitting of D while assuming that the perfusion fraction is negligible, This will simplify the model to S = S0 e^{-b*D}). The D value thus obtained will be used as a guess for the complete fit. You can find the code so far here : https://github.com/nipy/dipy/pull/1058/files A little note about the two fitting routines we explored : scipy's leastsq and minimize. "optimize" is a more flexible method, allowing bounds to be passed while fitting and allows the user to specify particular fitting methods while leastsq doesn't have that flexibility and uses MINPACK's lmdiff written in FORTRAN. However, leastsq performed better on our tests than minimize. A possible reason might be that leastsq calculates the Jacobian for fitting and may be better suited to fit bi exponential functions. It might also turn out that implementing Jacobian improves the performance of optimize for fitting. All these questions will be explored later and for now we have included both fitting routines in our code. However, as Ariel pointed out, minimize is not available in older versions of scipy and to have backward compatibility we might go ahead with leastsq. Next up, we will implement a two stage fitting as discussed in the original paper on IVIM in Le Bihan's 1988 paper.[1]. Meanwhile feel free to fork the development version of the code and play around with the IVIM data provide by Eric Peterson here : https://figshare.com/s/4733b6fc92d4977c2ee1 1. Le Bihan, Denis, et al. "Separation of diffusion and perfusion in intravoxel incoherent motion MR imaging." Radiology 168.2 (1988): 497-505. ### Utkarsh (pgmpy) #### Google Summer of Code week 2 During this week with some good amount of effort I was able to get ahead from my proposed time-line by writing the sample and generate sample methods. My this week’s time was spend in writing the sample method and reading Handbook of Markov Chain Monte Carlo[1]. My experience of reading book proved rather depressing as I wasn’t able to even partially gasp the content about Monte Carlo given in introductory chapter but seeing HamiltonianMCda class returning samples did boost my morale. Earlier I wasn’t able to test my implementation of find reasonable epsilon method. There was a bug in it, and it took me a hard time to find it(I treated a single valued 2d numpy.matrix similar to a floating value thus the bug). Also I found that numpy.array are more flexible than numpy.matrix and even scipy recommends numpy.array over numpy.matrix [2] (This is conditional, check post for details). During this week meeting we(community) also decided to use numpy.array instead of numpy.matrix in accordance to the post[2]. As sampling method was functional for the first time I was able to actually see the performance of HMC sampling algorithm. Though theoretically I knew that step-size and number of steps effect performance of algorithm, and on actual run difference was clearly visible. Sometimes for un-tuned values of step-size (epsilon) and number of steps (calculated using Lambda, see the algorithm 5[3]), the algorithm took ages for returning mere 5-6 samples. During adaptation of epsilon in dual averaging algorithm sometimes the epsilon value was decreased by huge exponent which in turn increased the number of steps by the same(exponent), causing algorithm to run for a great deal of time. Not only the difference was visible due to these parameters, the sample quality was also effected by algorithm we choose for discretization. Modified Euler performance was really awful compared to leapfrog. Results generated by leapfrog method were really good. I also wrote tests for the HMC algorithm. I was thinking of using mock as it was already being used in tests for other sampling methods in pgmpy, but my mentor recommended to generate samples and apply inference on them. This also took a great deal of time, as I was to choose a model and hand tune the parameters so that tests don’t become too much slow. This week also me and my mentor weren’t able to settle upon the parameterization of model as discussed in previous post, and we don’t have any leads in that matter as of now. For the next week I’ll clean up the code and try think more harder on the mentioned matter. ### fiona (MDAnalysis) #### How to train your Python While working with trajectories these past weeks, I’ve been reading up and making use of a neat set of things related to the iterator protocol in Python. Iterators are what let us loop over each frame in a trajectory in turn, analysing as we go, and so are a key part of what I’m doing with my ‘add_auxiliary’ subproject and of MDAnalysis in general! Let’s take a look. ### Iterators and Iterables Let’s imagine our cat has been a busy boy, and we’ve ended up with a bundle of little kittens to deal with. Now it’s time for vaccinations, and while we could ourselves pull each kitten out one by one, vaccinating each in turn, we know what a potential health hazard that is. Instead, we’re going to get our good friend Python to do it for us! You might already know in about for loops in Python. Basically, if we have:  for variable in sequence: do_something(variable)  what will happen is that, first, variable is set to the ‘first item’ in sequence, and do_something run with that value. variable is then set to the ‘second item’ in sequence and do_something run again, and so on until we’ve run do_something for every item in sequence. We want sequence to represent the litter, do_something to be ‘vaccinate kitten’, and variable to be set to each kitten in turn. If we represent the litter as simply a list of kittens, we don’t need to do anything further – the process of determining ‘first item’, ‘second item’ for lists (as for several other data types) is pre-built into Python, and follows each item in the list as you’d expect. However, it would be nice if we had a Litter class (of which this first_litter is an instance) so we could store both our kittens and things like a date of birth and the total number of kittens, and have a nice framework to use if the event of any future litters. We need to tell Python how to determine each of the items (kittens) that belong to first_litter. We can do this by defining some special methods in the Litter class to make it iterable or (more specifically) an iterator: • an iterable class has an __iter__() method • an iterator needs an __iter__() method and a next() (or __next__() in Python 3) method __iter__() needs to return an iterator object, so when the class is itself an iterator, __iter__() can just return self. The next() method needs to return the ‘next’ item in the class each time it is called, and raise a StopIteration error if we’re trying to move past the last item. When we run a for loop, next() is used to get the value of each item in turn, and the loop exits when it gets StopIteration instead. The XVGReader class I’ve included below, which forms part of my ‘add_auxiliary’ subproject, is an example of a custom class that is an iterator. For now, we’re going to make our Litter class iterable instead. To do this, we need to make __iter__() return an iterator; we can do this by making it a generator function. ### Generators A generator function returns a generator, a type of iterator. We make a generator function by adding a yield statement to it, where we might otherwise use return. What happens is that, when first called, the function will run until the first yield statement is reached, at which point the appropriate value is returned (this will be out ‘first item’) and the function ‘pauses’. When, at some point later, we’re ready for the next value, the generator is resumed (e.g. by calling it’s next() method), and progress continues until yield is encountered again; the new value (the ‘second item’) is returned, the generator pauses, and so on. After we’ve gone through the final yield and reach the end of the function, it’ll raise the StopIteration error. So how does this look for our cats? And now we’ve successfully vaccinated all out kittens, without endangering our personal safety! ### A more practical example: XVGReader Unfortunately, I’m not being paid to play with kittens all summer, so before I go I’ll leave you with some excerpts (for reading auxiliary data stored in the .xvg file format) of what I’ve been working on to show how I’ve actually been using Python’s iteration protocol. Until next time! ## June 07, 2016 ### liscju (Mercurial) #### Coding Period - II Week In this week i have been working on making current solution concrete base to work on later. I have been trying to make redirection feature change how things are stored in largefiles: new redirection location would be main store and repository store and user cache would be additional places but this idea leads to some content synchronization problems it would be hard to overcome. After trying to overcome them and talk with developers on chat i decided to simplify the solution. New redirection target will be additional place to store the files and the feature will work mainly on the wire protocol layer: it will redirect clients to the other location if file is already there. This will not lead to any kind of synchronization problems and will be better fit to the current solution. Other problem i had was how to fit new feature to the wire protocol. In the last week i tried to make statlfile redirects file to the other location but this solution had problems with backward compability. I decided to make additional procedure on server redirectlfile which client would invoke to get to know where he can redirects to get the file. This solution is much simpler and it keeps backward compability. Solution with using statlfile to redirect files can be found here: https://bitbucket.org/liscju/hg-largefiles-gsoc/commits/57a6ba3f62b13a861ac3ea211ae0ecd60d7f069f?at=devOld New solution using redirectlfile can be found here: https://bitbucket.org/liscju/hg-largefiles-gsoc/commits/5b18a0303b04c9c304fb86b719b6431c75634327?at=dev Apart from this i was trying to better "fit" solution in the code. Because this feature adds code to few different places it makes their code mangling functionalities: old standard functionality of largefiles and added functionality of new feature. Because this feature is not mandatory(it is enabled in hgrc file) it makes code harder to read as all of that places had to check if feature is turned on repository and do feature related tasks. I am proposing solution in which all of the feature related functionalities are kept in single module - it extends function in other modules by using extensions.wrapfunction. This looks pretty similiar how mercurial extensions extends mercurial core. Im in the middle of waiting for opinion of mercurial developers to get to know what they think about it. https://bitbucket.org/liscju/hg-largefiles-gsoc/src/11908305bf918900550b78d68ee627f9f2143218/hgext/largefiles/redirection.py?at=default&fileviewer=file-view-default Apart from this i am planning to do some exception handling of the errors, but as solution is only local for now there is no way i can test it. Probably something like new exception RedirectionError would be enough for now, because solution now only works locally. Additionally my patches for making largefiles compatible with python3 got merged, the series can be seen here: https://www.mercurial-scm.org/repo/hg-committed/graph/016a90152e9c ### Aron Barreira Bordin (ScrapingHub) #### Scrapy-Streaming [2] - Communication Channel and Communication Protocol Hello everyone ! In the second week of the project, I continued implementing the scrapy-streaming communication channel and started to work in the communication protocol. ## Communication Channel The communication channel is responsible for connecting scrapy and external spiders, receiving / sending messages, validate incoming data, handle the process, etc. I’ve implemented the communication validators, that are responsible for checking the the incoming messages have all required fields, valid data type, setting its default values. If any problem in the incoming message is found, a verbose information is sent to the external spider. Also, a buffered line receiver was implemented, that receives and buffer data from the process stdout, and parse messages line by line. ## Communication Protocol I made some advance in the communication protocol as well. I’ve implemented the following messages: • spider: generates a new scrapy spider • request: opens a new request • close: close the current spider • log: print something in the scrapy-streaming logger • response: send request’s responses • error: send errors in the communication channel ## Examples I’ll be adding scrapy-streaming examples while developing the project. The first example is a github spider that tracks project informations, and it’s already possible to run it using the current communication protocol. Scrapy-Streaming [2] - Communication Channel and Communication Protocol was originally published by Aron Bordin at GSoC 2016 on June 07, 2016. #### Scrapy-Streaming [1] - Project Structure and Initial Communication Channel Hi ! In the first week, I implemented some initial work in the Scrapy-Streaming project’s structure and documentation. ## Project Structure I started defining the new project package, with tox, travis, and codecov. I added two new commands to scrapy: streaming and crawl. The streaming command let you can standalone external spiders; and the crawl command allows you to integrate external spiders in your Scrapy’s projects. To run a standalone spider, you can use: scrapy streaming my_executable -a arg1 -a arg2 -a arg3,arg4  and if you want to integrate it with a scrapy project, you must create a file named external.json in the project root, similar to this example: and then, you can run these spiders using: scrapy crawl <spider name>  These implementations were tested and pushed at: https://github.com/scrapy-plugins/scrapy-streaming/pull/2 ## Documentation I’ve written a big part of the documentation to better define the API under development, and it can be read at: http://gsoc2016.readthedocs.io. The documentation contains the whole communication channel, a simple tutorial, and descriptions about the project behavior. ## Communication Channel I started to work in the communication channel, defining the classes responsible for connecting external processes with scrapy, and implemented the basic message wrappers. Scrapy-Streaming [1] - Project Structure and Initial Communication Channel was originally published by Aron Bordin at GSoC 2016 on June 06, 2016. ## June 06, 2016 ### Levi John Wolf (PySAL) #### GSOC Call Notes, June 6 2016 I’ve had to take a break from the spatial hierarchical linear modeling kick I’ve been on recently to get back to some GSOC work. Today, I had my weekly call with my mentors. On today’s call, my mentors and I discussed a few things. ## Testing & Merging of project code Since many of the improvements I’m making to the library are module-by-module, I was advised to submit PRs when a logical unit of contribution is ready to ship. As I’ve already been trying to use Test-Driven Development principles for my project, writing a test of what I want the API changes to look like then writing to that specification, this is relatively simple: the module is ready to submit when the spec tests pass. So, now that much of the initial foray into the labelled array API is done, I can beging to connect the tests & submit PRs where possible. ## Appropriate Targets for Labelled Array interfaces We also discussed how to extend the labelled array interface to other submodules. For example, I have a good idea how a consistent labelled array interface could look for Map Classifiers in the exploratory spatial data analysis module. Other elements of that module should also be relatively straightfoward to implement on labelled arrays, since all that’s generally needed is input interception: dataframe+column name needs to get correctly parsed into numpy vectors. This is quite simple. The spatial regression module also seems like a relatively straightforward place to add a labelled array interface, using a similar strategy to what I’ve already been doing in pysal.weights. Defining a from_formula classmethod for spatial regression modules would allow for specification of regressions using patsy & pandas dataframes. But, in other parts of the library, like region or spatial_dynamics, it’s less clear as to what the labelled array interface should look like, so I’ll have to gain some perspective there. ## Remaining Confusion in weights construction I’m running into some minor confusion because I’m trying to make a call like from_dataframe(df, idVariable='FIPS') equivalent to from_shapefile(path, idVariable='FIPS'), and can’t figure out when PySAL considers things ordered vs. unordered. For background, a spatial weights object in PySAL encodes the spatial information in a geographic dataset, allowing estimation routines for various spatial statistics or spatial models. In doing this, it relates each observation to every other observation, using information about the spatial relationship between observations. In our library, these are used all over the place. But, in building a new, abstract interface to the weights constructors, I got quite confused. Particularly, I was expecting to be able to write a pair of classmethods, say, Rook.from_shapefile() and Rook.from_dataframe(), that have similar signatures and generate similar results. Something like from_dataframe(df, idVariable='FIPS') being equivalent to from_shapefile(path, idVariable='FIPS'). Unfortunately, it’s somewhat confusing to figure out how to make this work correctly, without making the API incoonsistent. This is because PySAL handles ids in weights objects across its various weights construction functions and classes in different ways. I think, overall, we expose four different variables or flags at different points in the API that deal with how observations are indexed in a spatial weights object: 1. ids - ostensibly, a list of the ids to use corresponding to the input data, considered in almost every weighting function. 2. idVariable - a column name to get ids from when constructing weights from file used in existing from_shapefile functions to generate ids. 3. id_order- a list of indices used to re-index the names contained in ids in an arbitrary order, impossible to set from from_shapefile functions but used in the weights class’s __init__ 4. id_order_set- a boolean property of the weights object denoting whether id_order has been explicitly set. To me, this is rather confusing, despite some conversation trying to flesh this out. First, all lists in python are ordered. So, when a user passes a list of ids in as ids, its confusing that the order of this list is silently ignored. Second, when we construct weights from shapefiles using an idVariable, the resulting weights object has some peculiar properties: the id_order is set to the file read order, but the id_order_set flag is always False. This is confusing for a few reasons. First, shapefiles & dbf files are implicitly ordered, so a column in the dbf should correspond exactly to the order in which shapes are read, barring data corruption. So, if I use a column of the dataframe to index the shapefile, this should be considered ordered. Second, our docstring below seems to imply that either id_order_set is False and id_order defaults to lexicographic ordering, or id_order_set is Trueand id_order has special structure: id_order : list An ordered list of ids, defines the order of observations when iterating over W if not set, lexicographical ordering is used to iterate and the id_order_set property will return False. This can be set after creation by setting the 'id_order' property. But, one can easily generate an example where id_order is not lex ordered and id_order_set is False: import pysal as ps Qref = ps.queen_from_shapefile(ps.examples.get_path('south.shp'), idVariable='FIPS') Qref.id_order_set False Qref.id_order [u'54029', u'54009', u'54069', u'54051', u'10003', ...] Qref.id_order_set False This is important because, when we construct weights from Dataframes, we need to make a decision about what gets picked as an index and how to treat that index. Right now, I’ve made the executive decision to choose consistency in beahvior, so that from_dataframe(df, idVariable='POLYGON_ID') will consider that column to be ordered from the start. This means that the resulting weights will have the same iteration order as weightsl from_shapefile(filepath, idVariable='POLYGON_ID'), but the dataframe call will set the id_order_set flag, while the shapefile classmethod does not. ### Ranveer Aggarwal (dipy) #### Making a Text Input Field In my last two blog posts, #0 and #1, I had successfully set up a UI framework and begun building primitive UI objects. A 2D button overlay was built, and the next task was to make a text field. Well, how about this? A text box, this is. It works, but is incomplete. Yes, it is a text box, but there are a few things you expect from a text field that it doesn’t do. ### To Do on the Text Box • Two types of text boxes - single line and multi line • Limit size of text box, store the overflow somewhere else and display only if backspace is pressed repeatedly enough • Capture Ctrl/Alt/Space as separate keys and render them as they should • Maybe draw boundaries around the text box to make it look more like an input field Again, doing all these in VTK isn’t easy and would involve doing stuff from first principles. (Wow! Word Processors must be really hard to make!) ### Implementation The interactor is the one that captures the key stroke. The callback function for the text box is outside the interactor class. There’s no way a direct transfer of key stroke can take place. But, what the interactor does have is the renderer. And the renderer has the text box. Therefore, if we somehow store a list of UI elements in the renderer and query that list for a text box once the picker picks one up and compare, we can get the right text box to work upon. For doing so, a parent class for all UI elements was created and the renderer was made to store a list of all such elements. Thanks to this idea by my mentors we could now pass keystrokes to the text box class and add and remove characters at will. For making a dynamic multi line text box, I stored a number which describes the number of current lines. If this number is exceeded by the length of the text divided by the number of characters in a line, I simply add a newline character and increase the current number of lines. For hiding overflows, I’ll probably keep a variable for archive text and dynamically update it. ### Next Steps • Fix the text box - make it more user friendly • Make a draggable slider. This is going to take some effort ### Riddhish Bhalodia (dipy) #### Validation of Adaptive Denoising This is a small blog post with many figures about the validation and comparison of results using adaptive denoising method described in blogpost 2 and blogpost 5 ## Match or Mismatch? The best way to validate the adaptive denoising method [1] , (PR here) is compare it’s results with the MATLAB implementation provided by the authors themselves [2]. So I ran their code and my code on the same data. The biggest problem was converting the .rawb file format data (the example data used in ) from the brainweb database into nifty (.nii) file format to be used from python, so I have a small MATLAB script for getting nifty file format from any 3d data which can be found here. Now for the validation, there are some type conversion differences from going from MATLAB to python, but all in all looks pretty neat! ## Example Update! Well we have three methods for denoising 1. NL-Means with voxelwise averaging (link) 2. NL-Means with blockwise averaging (link) 3. Adaptive soft coefficient matching (link) Currently there exists two examples, denoise_nlmeans compares 1. and 2. for diffusion data and denoise_ascm deals with 3. for the diffusion data. A new example will be created called denoise_methods_compare, which will compare all the three methods (and others which may be added in future), on different datasets like T1 data, diffusion data and spinal cord images, that is in the next to come ## References [3] Multiresolution Non-Local Means Filter for 3D MR Image Denoising Pierrick Coupe, Jose Manjon, Montserrat Robles, Louis Collins. Adaptive . IET Image Processing, Institution of Engineering and Technology, 2011. <hal-00645538> ### Nelson Liu (scikit-learn) #### (GSoC Week 2) Intro to decision trees Apologies for the late post, I had this sitting in my drafts and forgot to publish it! Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression popular for their ease of use and interpretability. ## Overarching Goals In building a decision tree from data, the goal is to create a model that predicts the value of a target variable by learning simple binary "decision rules". For example, here is a decision tree used to classify the species of a variety of iris given data. In this image, the rules for splitting are in the non-terminal nodes of the tree. For example, starting at the top of the tree: if a given sample has a petal length (in cm) ≤ 2.45, it is classified as being of species setosa. If the sample has a petal length of more than 2.45, we look at more rules to classify it (the right child of the root node). The tree is simple to interpret, and works surprisingly well. ## Building the Tree The process of constructing a tree is a bit more involved, but quite intuitive. The gist of the problem is deciding which sample attribute to split on, and what value of the attribute to split on. Imagine the you have a set of (x,y) coordinates, and each of these coordinates has an associated label that you wish to predict (say, red or blue). Your task is to draw straight lines (parallel to either the x or y-axis) to divide or "partition" the data space. A reasonable first cut would be to draw the line x = 0.5 (the black line in the image above), dividing the space in half. In these halves, the right side is composed entirely of blue points. As a result, this part of the space is "pure" (composed of data points of only one class). Moving forwards, we turn our attention to the left half of the space. A next reasonable split would be the line y=0.3 (approximately). This split, visualized by the yellow line, once again produces a set of pure points in 0 ≤ x ≤ 0.5 and 0 ≤ y ≤ 0.3. We can also split on the purple line (x = .1), thus dividing all of the points of various classes into their separate partitions. By doing so, we have constructed the decision rules necessary to learn our tree. By iteratively splitting, we can narrow down the space on the basis of x and y coordinates until we determine the proper class. The decision tree constructed is shown below. This example is simple enough when using data with one two attributes (e.g. x and y), but things quickly get tricky as the number of dimensions increases. To calculate the best splits, decision trees use various criterion of error that are essentially different metrics of "good" splits. The first part of my GSoC project will be implementing the mean absolute error (MAE) criterion, which judges potential splits by their absolute distance away from the true value. I hope this blog post was informative and clear in describing what decision trees are, as well as the iterative partitioning process of growing trees. If you want to learn more, Pedro Domingos has an informative set of lecture videos on decision trees in his Coursera course (see week 2). ## June 05, 2016 ### Sheikh Araf (coala) #### [GSoC16] Week 1 update It’s been a week since the coding period of Google Summer of Code started and so far it’s been great. After some discussion with my mentor during the community bonding period I started coding my project which is an Eclipse plugin for coala. In the first few days I setup maven to build my project and Travis-Ci for continuous integration. Many developers underestimate the importance of test-driven development. Although initially it seems as if writing tests would slow you down, but instead I found the opposite to be true. This is especially the case if you’re dealing with GUI interactions. After writing code in lieu of launching a new Eclipse workspace to test the plug-in manually, you can just write tests and run them in headless mode. This in my opinion is substantially quicker. So, after the foundation was laid further development was a breeze. I did some clean-up of the existing prototype of the plug-in and re-organized the project structure. The next task was to run the code analysis on a separate thread to avoid freezing the main UI thread. Although this was possible using built-in Java libraries, I used the Apache commons’ exec library to run the coala-json command. This was necessary because whilst dealing with multiple processes and supporting many operating systems a lot of things can go wrong. In the coming weeks I plan on adding options to select the bear to use for running the analysis and handle bears that require user interaction. If everything goes according to the plan I’ll have a usable static code analysis plug-in by mid-term evaluation. Cheers! ### Valera Likhosherstov (Statsmodels) #### GSoC 2016 #1 ## Kim filter (This is a picture I get, searching for "Kim filter" in Google) If you look at the code in my Pull Request, you can see the class KimFilter (kim_filter.py file), which is the main part of my project. So what does it exactly do? There is such thing called a linear state space representation: It describes some observed vector process yt in time. Ht, At, Ft, Gt are some matrices, zt is a vector of exogenous data and beta_t is an unobserved process vector, which generates yt. et and vt are white noise vectors and nu is the intercept term. This representation is usually used, when we have some real life process yt, and we want to understand its internal structure, to find a law describing it. Exact values of matrices in the model definition depend on unknown vector of parameters in a model-specific way. In the general case it's just a vector of all matrices' elements, written in a row (but I never witnessed anything like this before). To estimate the unknown parameters we use the principle of maximum likelihood, widely used in Statistics and Machine Learning, which says that the most likely parameter values are the most probable. To apply maximum likelihood estimation (MLE), we need to calculate probability density of the observed process values. This is an issue a well known Kalman Filter deals with, calculating filtered values of unobserved process in addition. Then we apply some numerical maximization towards Kalman filter output as a function of parameters and get the solution. Let's modify the linear state space representation a little: You can see that instead of time indexes under the representation matrices we now have St, a discrete Markov process with P states and P x P transition matrix, which represents a changing regime of the process. When the process St is known a priori, we have a usual linear model. But when it is unknown and random, the model is very complex. To estimate it, we use Kim Filter, described in [1], which has been implemented by me recently in formerly mentioned PR. It iterates over the process and during every iteration runs through three steps: 1. Kalman filter step (_kalman_filter_step method in KalmanFilter class). During this phase we use Kalman filter for every value of current and previous regimes and thus get a battery of P x P filtered states and their covariances, conditional on every previous and current regime combination. If you look in the code, you can see, that I delegate this calculations to P KalmanFilter instances, which is very handy. 2. Hamilton filter step (_hamilton_filter_step method). During this phase we calculate the likelihood of current observation, conditional on previous observations, and append it to the overall likelihood. Also we calculate some helpful weight probabilities. 3. Approximation step (_approximation_step method). Here we sum up filtered states and covariances, obtained during the first step, with weights, obtained during the second step, to achieve filtered states, conditional on only current regime value. For the detailed mathematical description you can refer to [1] or look around my code. ## Maximum likelihood estimation We now can estimate likelihood of the parameters. The next goal is to optimize parameters according to the likelihood. This responsibility is handled by RegimeSwitchingMLEModel class, located in rs_mlemodel.py file and derived from MLEModel class. It was very handy to extend MLEModel and substitute Kalman filter by Kim filter instance in a corresponding property. These filters have a very similar interface, and MLEModel, being more about optimizing, delegates all model-specific actions to filter instance. All noticeable changes were about parameters transforming, since we have another group of parameters, representing transition matrix. Also, RegimeSwitchingMLEModel has got a feature, discussed in the next item. ## Starting parameters Just passing the likelihood function to optimizer doesn't guarantee its convergence to global maximum. Optimizer can get stuck in local maxima, so the goal is to start from somewhere very close to the best likelihood value. My proposal suggested two approaches to this problem: 1. Just to pretend like there is no problem and let user provide the starting parameters. Probably he will enter something very close to global maximum. This scenario is also implemented, because it is used for testing, based on Kim's code samples, where author passes some a priori known starting values to optimizer. 2. To fit a one-regime analog of the switching model, using provided data, first (to run this feature, set fit_nonswitching_first=True in fit method of your RegimeSwitchingMLEModel extension). Then to initialize a switching model, where transition matrix is set to identity (like if there is no switching) and all regimes are equivalent to that obtained by non-switching fit. This approach seems to be intuitive, because we are choosing the model, complicating it gradually, according to the likelihood improvement. ## Tests To test the model I used this Gauss code from Kim's site, implementing Lam's Generalized Hamilton model. There is a basic test test_kim.py of Kim filter, which just tests likelihood and filtered states, obtained from my code, against Gauss code output. I am lucky that it is passed! Also, there is a file test_rs_mlemodel.py, which tests MLE against the same code. It is passed as well! Almost the same, but commented and more readable code you can see in the notebook, presenting regime_switching module usage. Looking at the output of the notebook, you can see, that fit-non-switching-first start works not so bad, likelihood differs not much from the best, as well as obtained parameters. ## What's next? I hope to start implementing models soon, to make my module more ready-to-use. The first model in a row will be Markov-switching Autoregressive model, and it probably will be the hardest one. There is going be some fun, like implementing EM-algorithm for starting parameters estimation, I can't wait to start! Feel free to leave comments if you don't understand something or have got any thoughts about my project. Thanks! ## Literature [1] "State-space Models With Regime Switching" by Chang-Jin Kim and Charles R. Nelson. ### mr-karan (coala) #### GSoC Week 2 Updates ## Week 2 Updates The task of Week 2 was to make the prototype in a usable state. I used Python Prompt Toolkit to revamp the whole code. • Now the application won’t accept wrong input for Boolean Values Here’s the sample code to write a Validator to be used in prompt function as an argument class BooleanValidator(Validator):  def validate(self, document): text = document.text.lower() allowed_val = ("yeah", "yes", "yep", "yo", "nope", "no", "nah") if text not in allowed_val: raise ValidationError(message="Not valid")  This is one of the sweet features I liked about this library, as it prevents the user from entering a wrong input and also give an error message present as a bottom status bar. • Added a dropdown list to show all the languages coala supports till now. Since coala supports more than 35 languages, I thought it would be nice to have a drop-down list so that the user can clearly see and pick the language to avoid entering a name which might have some typo mistake or any other inconsistency with the naming • Added a status bar to show accepted input for yes/no type of questions. Since the application can accept loose input terms like “yeah” ,”yep” etc for “Y” or “YES”, I thought a good way to present this information to the user will be by utilizing Python Prompt Toolkit’s status bar feature. • Refactored my original function to take user input, now using prompt() method that Python Prompt Toolkit provides. I have to use Python Prompt Toolkit again to improve the coala User Interface as a part of my GSoC project, so I thought this would be a good way to get acquainted with the library. I packaged the whole application in a Python Package and uploaded to PyPi for all you to simply install it using pip. While uploading this package I got to learn from the stupid mistake I did. I used the name coala-bears-create in bin/coala-bears-create which is definitely not the right way to import package. Then I got to know the workaround for this, that is to use _rather than - as PyPi automatically converts the underscores to dash in package names. ### Here’s a video of the tool in action: All the code is uploaded on Gitlab at coala / coala-bear-management and I am waiting for my Pull Request to be accepted soon :) ### udiboy1209 (kivy) #### A Man Has Duties Game of Thrones fans might identify the title as a (not so) famous dialogue from the series’ mysterious character Jaqen H’ghar. So the GSoC coding period began two weeks back on 23rd May while I was obsessed with this epicly famous TV series. I was busy binge-watching episodes from season 1 through 6 (which is currently running) when I realized that I needed to get on with the planned work of these two weeks. A man has duties. ## A man’s plans The planned work for these two weeks involved writing boilerplate code for the new Map module in KivEnt. Boilerplate for a new module in KivEnt essentially involves making a python package with the name kivent_<module_name> and creating a setup.py file to compile cython inside it. And then there’s KivEnt specific classes to be added i.e. GameSystems and data objects. The map module requires a manager to make creating and modifying maps easy enough and bundled in a class. So my plan was to first learn about KivEnt’s structure and then create all these classes and decide how to pipeline the flow of creating and rendering a map for the user would be. Possibly also make an example app to demonstrate this. ## Give a man a task and a man will do it In all fairness, nobody gave me this task. I planned for it myself. But you’ll understand my obsession if you have seen GoT :P. I had planned for reading and researching on how to make KivEnt gamesystems and managers, how to render entities on the screen inside a system and how to allocate and use contiguous memory. But I didn’t need to do that this week because the AnimationSystem I had coded in the community bonding period gave me a very good introduction to a lot of KivEnt’s internals. Making the classes for the map module was fairly easy. I had to just reuse and modify code I had written for the animation module. I had thought memory management might be a problem but my mentor Jacob Kovak has done a really great job with wrapping up the ugliness of cython’s memory management and malloc into a really easy-to-use API. I realized this while figuring out how to store the animations for the animation module. I wanted to read more about it and dive into the code, but I’ve been obsessed with something else this week :P. The only problematic or challenging thing was dealing with the build script setup.py. ### setup.py is dark and full of terrors Quite frankly its not that horrible, but it was my first time dealing with cython build scripts. The cython docs for building are vast and quite complex. There’s a lot of options and features which can be used and would take a lot of time to understand them all. So I was fiddling around with a lot of options that didn’t work too well. In the end, I just copied the setup.py of another KivEnt module which was easy to understand and modify and worked out pretty well :D. ## Current status I still need to test the code I have written. I am waiting for Kovak to add a feature to the core module for registering managers to be init in the GameWorld. It is easy to do if you are adding a module to core, because it just involves adding a few lines to GameWorld’s init. But I can’t add a manager I create in the map module to GameWorld because that would make the map module a dependency of the core module which is not desirable. In most cases people would use the core module without needing the map module. Once that is done, the GameWorld instance would have a MapManager instance which can be accessed in your app. Then I can test the example I have written. ## And my watch ends The next task is integrating animated tiles into the map module, which will use the animation module I added to core. I hope things run well in the boilerplate, else it would make me a little behind schedule. I don’t have a demo to show currently but I can show you a screen shot of my code. Okay, okay don’t laugh at me! I usually code on tty1 in linux (its very distraction free) and I didn’t know how to take a screen shot of that :P. In other news, we got logo stickers printed for the new Electronics Club logo! And I also received the GSoC welcome package which contained a sticker. So that makes two sticker additions on my laptop! I also learned a bit about 3D printing and used a 3D printer to print a batman logo, which I’ll paint and stick somewhere :D! ## June 04, 2016 ### Kuldeep Singh (kivy) #### First and Second Week of GSoC The second week of GSoC has come to an end and I am really excited. This week I worked on these features : • sys-info : The feature provides the system information like devices name, internal storage, screen resolution etc. • sms[1][2] : The feature provides api for sending and receiving sms in iOS and android. • sharing (Text and image) : Sharing of images and text in android. • Bluetooth : The feature provides discovery of new devices, get the paired devices, turn on, turn off the visibility and other methods. • brightness : Change the brightness of the screen (Only for linux and iOS). The calling feature was completed and merged before the community bonding period. My mentors and other kivy developers helped me whenever I got stuck. They make things look easy😀 and we do a lot of discussion. Next week, I will be working on implementing features for windows and iOS. Stay tuned for more updates. Kuldeep Singh Grewal (kiok46) #### Community Bonding Period, Google Summer of Code’16. (Kivy) This summers I am doing a GSoC project under Python Software Foundation and sub- organisation Kivy. (Awesome people here!!!!). My project is Plyer (Platform independent compatibility layer) which is basically to provide platform independent API’s to users for accessing features of their devices, mainly 5 platforms (Windows, Linux, Android, iOS, OSX). My mentors for this project are Akshay Arora (qua-non) who is a freelancer in India and Ryan Pessa (Kived) who is a software developer in Kansas city. This time period was fun and full of knowledge, I got to know the members of my organisation who are really awesome people and have mind blowing skills. I worked on some features like. Sysinfo, calling so I could reduce my work stress in the first week. I had to learn Cython, C++ from scratch and created wrappers to be used in python as we discussed that for the development in windows, we might need them. [see Wrapping-Cpp-using-Cython on my github account.] which was cool.😀. I had very less experience with development in iOS and OSX, so I tried to familiarise my self for the same. I bought a new Mac and an iPhone as this was requirement for the project. Will be updating the progress in my other blogs. Kuldeep Singh Grewal (kiok46) ### Levi John Wolf (PySAL) #### The Beginnings of a new API NOTE: A demo of the relevant code I’m referring to for the new labelled array API in pysal.weights in this update is available in this notebook, and the actual code lives in a weights2 module in my gsoc feature branch. I’ve decided to target our weights module to prototype the labelled array interface. In general, we’ll need extensions built into at least our exploratory spatial data analysis module, esda, our spatial regression module, spreg, and our spatial dynamics module, spatial_dynamics. So far, I’ve focused on weights because it’s so central to everything else that the library does. It also poses unique challenges to deisgn around, and I’ve already done a bit of work before GSOC in making a labelled array interface for it. I’ve been building constructors that let us build spatial weights objects from the primitives defined in various other computational geometry packages. Fortunately, what’s required is nothing more than building efficient type conversions &, where possible, relying on duck typing. I’m somewhat concerned that just relying on duck typing may make this API more fragile than we’d like, so I’m trying to be eager about conversions to our native geometric objects, as long as the conversion is computationally cheap. Altogether, this means that I’ve done quite a bit of redesigning of the weights module. But, in general, it still supports the same basic interaction style, but now can build weights from arbitrary iterables of shapes or PostGIS-style dataframes. Trying to balance this work and my own independent work on my dissertation has been challenging so far, but fortunately, the GSOC work has been more forthcoming than I expected. Hopefully, as the project matures, balancing this will be simpler. ### Leland Bybee (Statsmodels) #### GSoC Initial Work # Project Background There are two primary goals that I’m attempting to accomplish for this project. The first is the implement a framework for handling distributed estimation in statsmodels. The second, is to implement some methods for statistical inference for regularized models. In terms of the distributed estimation, the first thing I’m working on is implementing the methods discussed here [1] for handling distributed regularized regression. Once I’ve completed that the goal is to work on a more generalized distributed estimation framework. Once that is done I’m planning to move to handling the statistical inference, particularlly following the method discussed here [2]. # Current Work In terms of the current progress, so far I’m happy with how things have been going. I’ve implemented a simple version of the procedure discussed in [1] build on top of the elastic net code already in statsmodels. In terms of the structuring the build, the current approach is to add an argument to fit_regularized to determine whether to use the distributed code. In this case a generator is expected as input to handle possible extensions to more intricate data structures like those provided by dask. I’m working on testing the current code this coming week and then am planning to extend what I have now to a more general framework. [1] Communication Efficient Sparse Regression: A One Shot Approach. Jason D. Lee, Qiang Liu, Yuekai Sun, and Jonathan E. Taylor. 2015. [2] A Significance Test for the Lasso. Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani. 2013. ### Utkarsh (pgmpy) #### Google Summer of Code week 1 This week of work proved to be quite productive. I was consistent with my proposed time line. For this week work I proposed to write base class structure for Hamiltonian Monte Carlo (HMC), implement methods for leapfrog, modified Euler and algorithm for finding a reasonable starting value of epsilon. During the start I wrote the leapfrog and modified Euler as methods of HMC class, but my mentor told me to write a base class and using that base class write leapfrog and modified Euler as different classes. The earlier structure looked something like this: class HamiltonianMCda(object): def __init__(self, discretize_time='leapfrog', *args): # Some arguments and parameters # chooses discretization algorithm depending upon # the string passed to discretize_time pass def leapfrog(self): pass def modifiedEuler(self): pass  But why did we settle upon base class implementation? With the earlier structure things were not flexible from user point of view. What if user want to plug-in his own implementations. After the changes I created a base class called DiscretizeTime , class inheriting this could then be passed as an argument for discretization of time. Advantage of having a base class is that it provides a basic structure to the things, and adds extensibility. Now things look something like this: class DiscretizeTime(object): def __init__(self, *args): pass def discretize_time(self): # returns the initialized values pass class LeapFrog(DiscretizeTime): def __init__(self, *args): pass def _discretize_time(self): # computes the values and initializes the parameters pass class HamiltonianMCda(object): def __init__(self, discretize_time=LeapFrog, *args): # discretize_time is a subclass of DiscretizeTime pass  Now using these base class user can pass his/her own implementations as an argument. I also wrote other base classes for finding Log of probability Distribution and Gradient of the log. During the time of writing my GSoC proposal me and my couldn’t decide how these methods for finding log and its gradient should be implemented. In this week meeting with the mentor, I proposed that I’ll write method in each model classes we will be implementing, and use that, if user doesn’t provide a class inheriting the base class for gradients and we settled upon it. Though this week work was great, but still there are things which remain unclear, from theoretical point. How to parameterize a continuous model still remains in doubt. Currently I have assumed that model parameterization will be of a matrix or array type structure. This assumption is good enough for the most of the common models I have came across, but things cannot be stated with certainty that it will generalize to all kind of continuous models. Me and my mentor are looking into things more deeply and hopefully we will find some solution soon. For the next week I’ll try to finish the Hamiltonian Monte Carlo sampler. ### fiona (MDAnalysis) #### Auxiliary power, then full steam ahead! Nearly two weeks of coding down, about eleven to go! It’s been a bit of a slow start, but now I’m starting to get into the swing of things. This post was originally going to be longer, but I’ve decided to split it, so unfortunately you don’t get as many cats this week. On the plus side, the intended second half – the inaugural What I Learnt about Python This Week (witty alternative name pending) – should be up in the next couple of days, rather than the usual 1+ week! I’ll even give you a sneak preview: What I will be doing in this post is expanding more on my plans for the first part of my project – adding auxiliary files to a trajectory. I’ll start with a very brief overview of some bits of Python, intended only to try and help the following make a bit more sense if you’ve never used it or programmed before. If you want to actually learn how to use Python, go check out one of the numerous tutorials! ### Some brief notes about Python In Python, we store data in variables. Depending what it is, the value we store will have a particular data type: common built-in examples are numbers, strings (a sequence of characters), and lists (a sequence of values, each can have a different datatype). Functions are things that (usually) take some input and do something with it. A class is essentially a framework/container used to store a bunch of related variables (attributes) and functions (also attributes, more sepcifically methods) which can use or alter the data attributes. We can create different instances of a class, each with a particular set of attribute values; this allows us to keep track of different related set of values, and to more easily work with them using the accompanying methods. ### The Way Things Are In MDAnalysis, working with a simulation trajectory goes something like this: we make u, an instance of the Universe class, to contain information about our system, including the attribute trajectory. trajectory is itself an instance of a Reader class; there are several different Readers, each able to reach trajectory data from a particular file format. trajectory stores general information – like the name of the file where the trajectory data is written, and the time between each recorded timestep (dt) – as well as frame and ts, respectively a number identifying a particular timestep (numbering sequentially from 0) and an instance of a Timestep class which stores the position (and velocity and forces, if available) data from that frame. The trajectory reader has a read_next_timestep() method that (as per the name) can be used to read the next timestep recorded in the trajectory file, and update the frame number and ts with the appropriate values. This allows us to perform particular analyses along the whole trajectory by looping through each frame in turn, updating the data in ts as we go - which is very useful! What ‘add_auxiliary’ is trying to achieve is to allow us to perform anaylsis with both the position data and other sets of data not stored in the trajectory file (i.e. ‘auxiliary’ data), while running through this loop. ### The Way Things Will Be Basically, we want ts to store, in addition to the position data of the current frame from the trajectory file, auxiliary data recorded elsewhere. First, we’ll need some AuxReader classes that will work in a similar way to the trajectory Readers: each instance will store some general stuff like a name of an auxiliary file, a step number, and the value of the auxiliary data for that step (step_data). A read_next_step() method will allow us to update step and step_data as appropriate. These steps may not, however, match up with the trajectory time steps (there are several reasons why we might want to record the auxiliary data say, twice as often). First, we can assign each auxiliary step to its closest trajectory timestep: We can now add another method, read_next_ts(), that’ll use this condition and read_next_step() to read through all the auxiliary steps belonging to the next trajectory timestep, keeping a record of the auxiliary values from each of these in a ts_data list. We’ll also have a get_representative() method that will based on this list pick a single value to represent the timestep – we’ll let the user choose if this is, say, an average, or the value from the closest step, and store this representative value, ts_rep, in the AuxReader. For convenience, we’ll also copy this value to ts, to be stored in ts.aux under a custom name. To the trajectory Reader, we’ll add an auxs list (initially empty), and an add_auxiliary method that will, given a set of auxiliary data, make the appropriate AuxReader instance and add this to the auxs list. Out data structure now looks something like this: Lastly, we’ll also add a couple of lines so that when we want to read the next timestep, in addition to running read_next_timestep() on the trajectory, we’ll run read_next_ts() on each auxiliary reader in auxs. This all means that when we move to the next timestep, both the position and representative auxiliary data in ts will be updated, and we should now be able to analyse them alongside each other! There’s a bunch more stuff I’ve skipped over that gives us more options and neatens up the process (I’ll be talking about some of this in the next post, so stay tuned!), but hopefully for now you have a clearer idea of what I’m currently doing. As it stands, I have the framework I’ve discussed here up and (mostly) working, for the test case of auxiliary data stored in the ‘.xvg’ file format; so I’m mostly double checking everything works as I expect, looking at expanding to more general auxiliary readers and seeing what additional user-control can be added. I now have a work-in-progress pull request, so head over there to see the nitty-gritty code details and/or pitch in with any ideas, feedback or advice; and I’ll see you again soon with Post 3.5! ## June 03, 2016 ### TaylorOshan (PySAL) #### Generalized Linear Models and Sparse Design Matrices This week was primarily focused on exploring different options for estimating generalized linear models (GLM). First, I built an option into the gravity model class that would use exisitng GLM code through the statmodels project. Next, I put together my own code to carry out iteratively weighted least squares estimation for GLM's. Of immediate interest was a comparison of the speed of the code-bases for carrying out the estimation of a GLM. Each technique was used to calibrate each model variety (unconstrained, production-constrainedm atttraction-constrainem, and doubly constrained) when the number of origin-destination pairs (sample size) was N = 50, 100, 150, and 200. Some quick results can be see here in this gist. Expectedly, the results show that for either technique the unconstrained models are the fastest to calibrate (no fixed effects), followed the by the singly-constrained models (N fixed effects), and finally, the doubly-constrained models take the longest (N*2 fixed effects). The more fixed effects in the model, the larger the deisgn matrix, X, and the longer the estimation routines will take. More surprisingly, the custom GLM (from now on just GLM) was way faster than the statsmodels GLM (from now on SMGLM). The statsmodels uses either the pseudoinverse or a QR decomposiiton to compute the least sqaures estimator, which would have been thought to be way quicker than the direct computations used in my code. For now, I have only tested the pseudo-indverse SMGLM, as the flag to switch to QR decomposition is not actually available to the SMGLM api (on to the weight least sqaures class at a lower level of abstraction). Perhaps, the SMGLM takes longer to compute because it has a fuller suite of diagnostics or perhaps the psuedo-inverse is not as quick when there is a sparse design matrix (i.e., many fixed effects). After some additional exploring, I found that there was an upper limit to the number of locations (N = locations**2) that could considered for the singly-constrained or doubly-constrained models given the high number of fixed effects. Somewhere between 1000 and 2000 locations, my notebook would run out of memory. To cirvument this, I next developed a version of the custom GLM code that was compatible with the sparse data strcutres from the Scipy library, which cna be found in this branch. This required a custom version of the categorical() function from statsmodels (exampe here) to create the dummy variables needed for constrained spatial interaction models and then altering the least squares computations to make sure that no large dense arrays were being created throughout the routine. Specifically, it was necessary to change the order in which some of the operations were carried out to avoid the creations of large dense arrays. Now it is possible to calibrate constrained spatial interaction models using GLM's for larger N. The most demanding model estimation I have tested was a doubly-constrained model with 5000 locations, which implies N=25,000,000 and a design matrix with the dimensions of (25,000,000, 10001). Looking forward, I will further test this sparse GLM framework to see if there are any losses assocaited with it when N is small to moderate. This week I also explored gradient optimization packages that coyld be used in lieu of a GLM framework. So far the options seem to be between autograd/scipy or Theano. I was able to create a working example in autograd/scipy, though this has not been developed any further yet and will likely be pushed off until later in the project. For now, the focus will remain on using GLM's for estimation. Next week I will begin to look at diagnostics for count models, zand alternatives when we have overdispersed/zero-inflated dependent variables. ### Shubham_Singh (italian mars society) #### GSOC 2016 Community Bonding Period -IMS Hello Everyone . So i have started working on my new project for Italian Mars Society which is going great with the help of very supportive mentors . This is the progress of the project till now . • First of all , I started with reviewing the available code in the organisation's repository starting with the health monitor ,heart rate monitor and finally the habitat monitor . • As my project will be implemented with the Habitat Monitor GUI , i successfully imported all the relevant files of Habitat Monitor for better understanding of its GUI and functionalities . • Successfully installed MongoDB for database and Python qt4 and qt designer for GUI development using python . • Further ,i will be starting with keeping the parallel documentation of the project covering from scratch i.e installation of the required libraries and dependencies of the project on my Ubuntu 14.04. As of now , the project development is little slow as i am currently having my exams from 30th May to 11th of June . I am still taking out some time from my studies to work on the project .I can start with the proper coding after 11th of June . ### srivatsan_r (MyHDL) #### Working on HDMI cores I have completed the HDMI transmitter core, it can be only tested along with the HDMI receiver core. HDMI receiver core is half way through. I have completed the decoder module of the HDMI receiver core. I m referring to the application notes by xilinx(xapp495) for creating the IP cores. The application notes – XAPP495 was latest notes released by xilinx and it contains some reference code in verilog. There was a small complication in using the xapp495 reference code as the reference as the code contained only a DVI encoder and decoder. I needed HDMI encoder and decoder modules so I referred the XAPP460 application notes (the older notes) for the HDMI modules and referred the new notes for the other helper modules modules. #### Proposed Timeline for GSoC Here’s my tentative timeline for GSoC – May 23 • Become familiar with the HDMI stream specification • Design interface and Transactors • Design block diagrams for HDMI cores • Create Models for each component June 5 • Generate TX core tests based on the test framework • Create a pull‐request for the new tests • Complete HDMI TX core implementation • Create a pull‐request for the core implementation • Corresponding documentation June 19 • Midterm evaluation • Generate RX core tests based on the test framework • Create a pull‐request for the new tests • Complete HDMI RX core • Create pull‐request for the core implementation • Corresponding documentation July 3 • Generate tests for EDID information sharing • Create a pull‐request for the new tests • Create interface for EDID information sharing • Create a pull‐request for the interface July 17 • Generate tests for integrating EDID with HDMI cores • Create a pull-request for the new tests. • Integrate the EDID I2C interface with the HDMI Rx and Tx cores. • Create a pull‐request for the EDID integration August 1 • Have both HDMI TX and RX cores complete and functional • Create a pull‐request for the core implementation • Create HDMI core functional tests • Create a pull‐request for the tests • Create a pull‐request for any changes to the core implementation • Submit documentation for review August 23 • Re‐submit documentation for review as needed August 24 • Final Evaluation ### Vikram Raigur (MyHDL) #### GSOC : Second week During the second week. I made Run Length Encoder core which is here . I kept them as a PR. I will merge once I finish a bit more work on them. I made a modification to Entropy Coder. Probably, I believe it will use gates instead of big comparators used in earlier design. The link for the new Entropy Coder is here. Also created setup.py and requirement.txt file for my repository. I discussed with chris about the directory for the JPEG Encoder. Also, there was a typo error in verilog version used for test which got fixed this week. I used pylint and flake8 to make my code look more beautiful. I made interfaces into separate classes for input, outputs and others. Overall, this week was a very productive week. ### meetshah1995 (MyHDL) #### The Code has been broken ! The RISC-V is one of the most structured ISA I have come across so far , hence it was easy for me to design a structural module for the decoder. Like any other RISC instruction set, RISC-V also has a family of instructions with similar family-code and argument list. These instructions which belong to a particular family are distinguished by their own instruction opcodes. Since the number of families in the ISA was not very large , I chose to segregate the instructions by their family first and then to decode the exact instruction by extracting its individual opcode. This sounded all fine when I planned it. But the implementation was a bit difficult than I had thought. I had to come up with a data structure that can efficiently store multiple hierarchies of keys and values as the bitwise difference of different instructions was quite varied especially in the non logical and arithmetic instructions. I decided to go with the multiple-key dictionary available directly in python. I made a hierarchy of keys , each differentiating one instruction from each other, First key being the family-code. Maximum number of keys that I had to use per instructions was 4, in very large families [0x14, 0x1C] etc. Sample Decode Logic : addw rd rs1 rs2 31..25=0 14..12=0 6..2=0x0E 1..0=3 My decoder will extract the 5 bit family-code from 6 to 2, the 7 bit slice and the 3 bit slice from 31 to 27 and 14 to 12 respectively. Thus as you can probably guess, the order of the keys will be family-code-> 3-bit slice -> 7-bit slice. Thus when these keys are fed to the decoder table, it spits out the instruction name. Once we have the instruction name we can easily extract the arguments it takes from the 32 bit instruction. I got a lot of help from the opcode repository from risc-v as it listed down the args, opcodes and instruction names in a very lucid readable manner. Then came the testing part. I created the tests for each instruction and test method for each family. Thankfully they all passed :). So I guess the code has indeed been broken. Now I have to move on with the remaining RISC-V implementation. See you next week, MS ### Vikram Raigur (MyHDL) #### GSOC : First week I was woking using python2 earlier. I created my account in http://discourse.myhdl.org/ . I came to know that we should use Pyhton3. Also, we should work using MyHDL 1.0 dev Commands to install MyHDL 1.0 dev and make it work with python3 “ sudo -E git clone https://github.com/jandecaluwe/myhdl // clone MyHDL 1.0 dev python3 setup.py install // run the setup script in myhdl folder.  “ I started running tests for already made JPEG Encoder using the Python3 testbench. I was told to use already made modules when needed from the rhea repository. I was investigating FIFO in the MyHDL Rhea. I came across a bug and I made a PR for the bug. Though chris made a correct solution for the bug. Mentioning about the bug, it was a conversion error. I made RLE double FIFO along with proper tests during my first week. Iam making a branch and creating a PR for every module I design. PR for RLE double Fifo is here. It was a great week understanding the already present codes. Thinking for modifications. ### jbm950 (PyDy) #### GSoC Week 2 This week involved more research and learning than coding and as such the results are less visible. I made some changes to the test code for models.py under mechanics, I spend time looking through the structure and workings of KanesMethod and LagrangesMethod, and created a PR to hold the EOM class development discussion. The tests for test_models.py were timing out and so it was suggested that I change the tests from using rhs() to testing mass matrices and force vectors. I put these changes together and on my machine this halved the run time of the tests. After this change was added to PR #11168 and the tests passed the models.py and its test code files were merged. To learn more about how KanesMethod and LagrangesMethod work I began by reading through much of the documentation for the mechanics module. Included were the pages on the two methods themselves, information on how linearization is performed in the code and three different examples in which the two methods are used to solve dynamics problems. As I read through the documents I occassionally found places that could use some minor adjustments and I presented these changes in PR #11117 which has since been merged. In addition to reading the documentation for information, I went through the code for each method themselves. I made note of all of the attributes, properties and methods that each of the classes contained and compared these between the classes. In order to compare the differences in how information was stored in each of the classes I needed to return to the Kane’s method and Lagrange’s method documentation pages. At this point Jason was able to clarify much of my confusion regarding the different equations that the pages contained. Jason decided this clarification would be useful to add to the pages themselves and introduced PR #11183. I reviewed this PR for him though my contribution was more along the lines of pointing out places that might need clarification and changes that might improve the flow of the page rather than of the content itself. Last major line of work this week was beginning the equations of motion class itself. It was suggested by Jason that it might be useful to have the *Method classes return a class containing the equations of motion information. I have decided that for the moment this would be a better pursuit than creating a parent class from which the *Methods would inherit. The *Methods classes tend to be different in their approach to forming the equations of motion and thus I do now currently believe that a parent class could be overly useful for multiple *Methods. What new methods could benefit from, however, is a unified output format as the rest of the code could be written to accept the new output rather than specific Method classes. This would I feel would promote the addition of new Methods more than a inherited base class would. Along this line I have created a branch that holds the code for the new equations of motion class and its test code. Currently the work is to transfer the pseudo-code I have into real code. I have finished reading A Beginners Guide to 6-D Vectors (Part 2) but have made no other progress towards Featherstone’s method this week other than locating a copy of his book. ### Future Directions Next week I plan to continue working on filling out the equation of motion class along with its test code. The next step would be to integrate this into KanesMethod and LagrangesMethod to see what tests fail and where additional work would be needed. Also I plan to look into Featherstone’s book and the python implementation of spatial vector algebra to see what I might need to work on in order to implement his method in SymPy. ### PR’s and Issues Referenced in Post • (Merged) Pydy models migration PR #11168 • (Merged) Physics documentation PR #11117 • (Open) Improved the explanation of the 5 equations in the Kane’s Method docs PR #11183 • (Open) [WIP] EOMBase class development PR #11182 ### Articles/Books Referenced in Post ### fiona (MDAnalysis) #### Auxiliary power, then full steam ahead! Nearly two weeks of coding down, about eleven to go! It’s been a bit of a slow start, but now I’m starting to get into the swing of things. This post was originally going to be longer, but I’ve decided to split it, so unfortunately you don’t get as many cats this week. On the plus side, the intended second half – the inaugural What I Learnt about Python This Week (witty alternative name pending) – should be up in the next couple of days, rather than the usual 1+ week! I’ll even give you a sneak preview: What I will be doing in this post is expanding more on my plans for the first part of my project – adding auxiliary files to a trajectory. I’ll start with a very brief overview of some bits of Python, intended only to try and help the following make a bit more sense if you’ve never used it or programmed before. If you want to actually learn how to use Python, go check out one of the numerous tutorials! ### Some brief notes about Python In Python, we store data in variables. Depending what it is, the value we store will have a particular data type: common built-in examples are numbers, strings (a sequence of characters), and lists (a sequence of values, each can have a different datatype). Functions are things that (usually) take some input and do something with it. A class is essentially a framework/container used to store a bunch of related variables (attributes) and functions (also attributes, more sepcifically methods) which can use or alter the data attributes. We can create different instances of a class, each with a particular set of attribute values; this allows us to keep track of different related set of values, and to more easily work with them using the accompanying methods. ### The Way Things Are In MDAnalysis, working with a simulation trajectory goes something like this: we make u, an instance of the Universe class, to contain information about our system, including the attribute trajectory. trajectory is itself an instance of a Reader class; there are several different Readers, each able to reach trajectory data from a particular file format. trajectory stores general information – like the name of the file where the trajectory data is written, and the time between each recorded timestep (dt) – as well as frame and ts, respectively a number identifying a particular timestep (numbering sequentially from 0) and an instance of a Timestep class which stores the position (and velocity and forces, if available) data from that frame. The trajectory reader has a read_next_timestep() method that (as per the name) can be used to read the next timestep recorded in the trajectory file, and update the frame number and ts with the appropriate values. This allows us to perform particular analyses along the whole trajectory by looping through each frame in turn, updating the data in ts as we go - which is very useful! What ‘add_auxiliary’ is trying to achieve is to allow us to perform anaylsis with both the position data and other sets of data not stored in the trajectory file (i.e. ‘auxiliary’ data), while running through this loop. ### The Way Things Will Be Basically, we want ts to store, in addition to the position data of the current frame from the trajectory file, auxiliary data recorded elsewhere. First, we’ll need some AuxReader classes that will work in a similar way to the trajectory Readers: each instance will store some general stuff like a name of an auxiliary file, a step number, and the value of the auxiliary data for that step (step_data). A read_next_step() method will allow us to update step and step_data as appropriate. These steps may not, however, match up with the trajectory time steps (there are several reasons why we might want to record the auxiliary data say, twice as often). First, we can assign each auxiliary step to its closest trajectory timestep: We can now add another method, read_next_ts(), that’ll use this condition and read_next_step() to read through all the auxiliary steps belonging to the next trajectory timestep, keeping a record of the auxiliary values from each of these in a ts_data list. We’ll also have a get_representative() method that will based on this list pick a single value to represent the timestep – we’ll let the user choose if this is, say, an average, or the value from the closest step, and store this representative value, ts_rep, in the AuxReader. For convenience, we’ll also copy this value to ts, to be stored in ts.aux under a custom name. To the trajectory Reader, we’ll add an auxs list (initially empty), and an add_auxiliary method that will, given a set of auxiliary data, make the appropriate AuxReader instance and add this to the auxs list. Out data structure now looks something like this: Lastly, we’ll also add a couple of lines so that when we want to read the next timestep, in addition to running read_next_timestep() on the trajectory, we’ll run read_next_ts() on each auxiliary reader in auxs. This all means that when we move to the next timestep, both the position and representative auxiliary data in ts will be updated, and we should now be able to analyse them alongside each other! There’s a bunch more stuff I’ve skipped over that gives us more options and neatens up the process (I’ll be talking about some of this in the next post, so stay tuned!), but hopefully for now you have a clearer idea of what I’m currently doing. As it stands, I have the framework I’ve discussed here up and (mostly) working, for the test case of auxiliary data stored in the ‘.xvg’ file format; so I’m mostly double checking everything works as I expect, looking at expanding to more general auxiliary readers and seeing what additional user-control can be added. I now have a work-in-progress pull request, so head over there to see the nitty-gritty code details and/or pitch in with any ideas, feedback or advice; and I’ll see you again soon with Post 3.5! ## June 02, 2016 ### Adhityaa Chandrasekar (coala) #### GSoC '16: Week 1 updates The past week has been terrific! There has been a slight shuffle in my GSoC timeline - I planned to start with a bearlib library to enumerate the list of required and optional options for a bear and all the possible values that they can take. This would have then been used further on in the core of my project. But I decided to start with coala-quickstart which is designed to be super-user-friendly and assist the user in the creation of their first .coafile. In my experience, the beginning is the toughest part of any change or new practice. For a user starting with coala this would be the coafile generation. And this utility is basically designed to make it easy. My whole project pretty much revolves around coala-quickstart, so this is definitely a big step in the forward direction. But before I started with coala-quickstart, I needed to complete my left over issues - file caching and casing bear. So I worked parallelly on both GSoC and the completion of these two. I'll take a quick minute to explain both: • Caching is basically a core performance improvement. Currently, when you run coala on a project, all files are collected and linted. But this is a waste of computation power - why not just run coala on those files that have changed since the last time? That's exactly what caching is supposed to do. And the speed improvements has been amazing - coala is now 3x faster (benchmarked on coala source code itself). • CasingBear is the other one. While most of my contribution in coala has been in core, I thought this would be an awesome bear to create. To give you an overview, the bear makes a incredibly easy to enforce/toggle the casing technique in your functions, variables, classes, objects and so on between snake_casing, camelCasing and PascalCasing. Both are set to feature in the next coala release 0.7/0.3 (caching is initially going to be an experimental feature that needs to be enabled manually, but we hope to make it fully stable and default in 0.8). Will keep y'all updated about it. Anyway, coming back to GSoC, coala-quickstart was previously in a very early stage. I've added a few cool features in the past week. Here's an overview: • ASCII bear logo art! Pretty cool huh? • There is now proper libraries to prompt a question to the user and give info. Both are colorful too, thanks to PyPrint, a system independent python module for colors in a terminal. • Project based bear suggestions! Is your project mostly in Python? Just tell us your project directory and a glob pattern to match the files you want to lint (and ones you want to ignore), and coala-quickstart will automatically list the most relevant bears you might be interested in (such as PEP8Bear, PyLintBear, and so on along with language independent bears such as LineLengthBear). • A neat interface for the user to know about bears. • An introduction to sections followed by prompting the user for the sections they want in the project. This is then followed by simple questions for each section. • Writing all the obtained settings to a .coafile in the project directory. Here's a neat asciinema showing off all the features: Anyway, that was week 1. Looking forward to week 2 and the rest! -- Adhityaa ### Valera Likhosherstov (Statsmodels) #### GSoC 2016 #0 ## About me Welcome to my blog! Feel free to leave comments or anything. I am a student from Russia, right now I am on my 3rd year at Ural Federal University studying Computer Science. Also, I am studying at the Yandex School of Data Analysis, which is a free Master's level program by Yandex, Russian search engine and the biggest IT-company in the country. My interests lie in areas of Statistics, Machine Learning, also in Computer Vision and Natural Language Processing a little. I really love Mathematics, as well as I love its application, expressed within code. My hobbies are listening to hip-hop music, working out in gym, visiting art galleries and reading novels by Dostoyevsky. This year I am happy to take part in Google Summer of Code in Statsmodels under Python Foundation. It's a great opportunity, and I hope I will pass all deadlines successfully :) ## About my GSoC project The goal is to implement a Python module to do inference for linear state space models with regime switching, i.e. with underlying parameters changing in time by markovian law. A device for that, called Kim Filter, is described in "State-space Models With Regime Switching" book by Chang-Jin Kim and Charles R. Nelson. Kim Filter includes Kalman Filter as a phase, which makes my work much easier and motivates pure Python approach, because I can delegate all the heavy lifting to statsmodels' Cython module performing Kalman Filter routine. The next step of my project is to implement well-tested econometric models with regime switching, including Markov switching autoregression, Dynamic Factor model with regime switching and Time varying parameter model with Markov-switching heteroscedasticity. You can find details and exact specification of models in my proposal. To perform testing, I am going to use Gauss code examples, published on Professor Chang-Jin Kim's website. ## Setting up a development environment To setup the environment, I followed advices of my mentor Chad Fulton, who helps me a lot with technical, coding, and field-specific issues. Probably, this would be helpful to anyone, who wants to contribute to statsmodels or any other Python library. I am using Mac OS, so I performed the following steps: 1. Deleted all versions of Statsmodels in my site-packages directory. 2. Cloned the Statsmodels master repository to ~/projects/statsmodels (or anywhere else in your case). 3. Added ~/project/statsmodels to my PYTHONPATH environment variable (included line export PYTHONPATH=~/projects/statsmodels:$PYTHONPATH at the end of ~/.bash_profile).
Now, any changes made to Python files are available when I restart the Python instance or use reload(module) command in the Python shell.
If I pull any changes to Cython files, I recompile them with python setup.py build_ext -i in statsmodels folder.

## Running Gauss code

Code examples, provided by "State-space Models With Regime Switching" authors, require Gauss language interpreter, which is not a free software. But there is an open-source Ox console, which can run Gauss code.
But OxGauss doesn't support some Gauss functions by default, and you have to load analogous Ox language wrappers. In my case that was a function optmum, widely used in Kim-Nelson code samples, Ox Console developers provide M@ximize package for it. Another problem I spent some time to figure out is that M@ximize is incompatible with Ox Console 7, so I have to use 6th version, which works just fine.

## What's next?

I will post regular reports about my work during the whole summer. I have already implemented a part of my project, so the next report is coming soon. But if interested, you already can see the code. Next time I will talk about design of Kim Filter and details of its implementation and testing. I'm sure you'll enjoy that!

### Riddhish Bhalodia (dipy)

I have been working on two things this week, tiding up the adaptive denoise PR which needs working on as we intend to have it merged by the next 2 weeks, and debugging and improving the local PCA based denoising. The branch were I am adding my code for adaptive denoising is here

Well this was a good week😀

This based on the methods described in blogpost #2. I will provide an overview of what has been done and how it will be incorporated into the DIPY’s framework. Plus I will also give few results and outline the things which are left to be done.

1. The nlmeans implementation which followed a voxelwise averaging approach [1] now has a keyword type. When type = 'blockwise' it performs a more robust but slower blockwise averaging [2, 3]. The example (doc/examples/denoise_nlmeans) has been updates to show both the results as shown below. Proposed nlmeans and the associated example

2. Adaptive soft coefficient matching(ASCM) [3], again described in blogpost #2 is added to DIPY. The code is given here and it’s associated example.

3. Added functional tests for ASCM and blockwise averaging of nlmeans
4. Added wavelet.py in dipy/core/

Now the things left to be done before merging.

1. Optimize the blockwise averaging approach, see if any more speedup can be achieved in the cython code.
2. Documentation

## Local PCA

In this week I have identified few problems with the code and I believe we are very nearly there with respect to the MATLAB output [6].  The following are the changes and titbits which made me reach the results

[A] The D must be converted to a binary matrix

The diagonal matrix of the eigenvalues is thresholded during the local PCA and then a modified matrix D_hat is constructed which is a binary diagonal matrix with ones corresponding to the retained eigenvalues and zeros elsewhere. This was a big catch! (but we don’t use D anymore😀 )

[B] Plot of the retained eigenvalues

This is shown in figure below, the highest number of eigenvalues which can be retained is 22 here.

[C] Local noise variance, overcomplete or not ?

The computation of the local variance of the lowest eigenvalue image for MUBE and SIBE should be done in overcomplete manner or in a standard manner?

[D] Corrected sigma, direction independent or not?

This is one of the source of confusion, the sigma estimated by taking the local variance from the eigenvector corresponding to the lowest eigenvalue in MUBE and SIBE is for every voxel (a 3D array), then when it is corrected it is divided by a function of SNR (which is gradient direction dependent 4D array) and hence the corrected sigma becomes gradient direction dependent! So what we did was compute SNR by taking the local mean as a whole for all the gradient directions.

[E] Median based noise estimation

This method to estimate noise is mentioned in [5] and we though to try it out in our framework replacing MUBE and SIBE. But finally I didn’t use it as it gave not so detailed results as we get from MUBE and SIBE.

[F] Correction in PCA (This step made the difference I guess)

I replaced the following lines of code

D_hat = np.diag(d)
Y = X.dot(W)
X_est = Y.dot(np.transpose(W))
X_est = X_est.dot(D_hat)


with the following, just like the standard PCA

D_hat = np.diag(d)
W_hat = W * 0
W_hat[:,d > 0] = W[:,d>0]
Y = X.dot(W_hat)
X_est = Y.dot(np.transpose(W_hat))


## Results

This is the good news! We have nearly got to the denoising level we want to achieve with local PCA [4], below are the results shown for slice 10 of a diffusion data, the plots are for raw data, the output of the MATLAB [6] implementation and finally the current python implementation. (Sorry for the plots being so small!)

## Next Up …

I would target following things next week in Local PCA

1. Clean up the code and add more robustness
2. Debug it just a little more
3. Try different changes in noise estimation and compare RMSE with the matlab output so that we can get an idea for the best implementation
4. Rician Correction is still left to be done
5. Optimization of the code

1. Incorporate suggestions that may come in
2. Add few more tests if required
3. Documentation generation for dipy

## References

[1] Impact of Rician Adapted Non-Local Means Filtering on HARDI
Descoteaux, Maxim and Wiest-Daessle, Nicolas and Prima, Sylvain and Barillot, Christian and Deriche, Rachid
MICCAI – 2008

[2]An optimized blockwise nonlocal means denoising filter for 3-D magnetic resonance images
Coupé P, Yger P, Prima S, Hellier P, Kervrann C, Barillot C. Ieee Transactions on Medical Imaging. 2008;27(4):425-441. doi:10.1109/TMI.2007.906087.

[3] Multiresolution Non-Local Means Filter for 3D MR Image Denoising
Pierrick Coupe, Jose Manjon, Montserrat Robles, Louis Collins. Adaptive .
IET Image Processing, Institution of Engineering and Technology, 2011. <hal-00645538>

[4] Diffusion Weighted Image Denoising Using Overcomplete Local PCA
Manjón JV, Coupé P, Concha L, Buades A, Collins DL, et al. (2013) PLoS (Pub Library of Science) ONE 8(9): e73021.

[5]MRI noise estimation and denoising using non-local PCA
Jose Manjon, Pierrick Coupe, Antonio Buades
Medical Image Analysis Journal (MedIA), 2015

### Ravi Jain (MyHDL)

#### The first merge to main branch!

I spent the previous two days making changes in the Interfaces for ports and trying to resolve all the errors in being able to instantiate the GEMAC core remotely using the interfaces. And finally I merged my first pull request from the dev branch into the main branch yesterday, after getting review from Tom Dillon and Christopher Felton. It feels good to get started!

Well to test the instantiation I started to look up PyTest and was able to instantiate the core using interfaces after solving lot of import module related problems.
One thing i learnt which shall be useful for others is when you use ‘py.test’ in the command line to run all the tests, the current directory from where you are calling is not included in the pythonpath if its not explicitly mentioned, which can be a problem if you are still developing the core and havent installed the package yet or the changes are very frequent requiring you to update the installation of package everytime. A simple solution to this being calling the tests using ‘python -m pytest’ which searches the current directory and sub-directories as well for the modules.

Now, I am working on including the FIFOs somewhere between host-client interface, which which will play important part in providing features for Flow Control ; and than develop tests for the top-level modules.

## My Project so far:

The coding phase has begun and unfortunately i wasn’t able to start right away, but I had already done some part in the community bonding period and was able to work on another aspect of my project before the week got over, so now finally let me introduce you to my project which is *drumrolls* Generic Spacing Correction. I know it doesn’t sound very exciting right?But believe me the “Generic” makes it pretty exciting.

Demystifying the name, my project aims to do what you already see in most of your editor programs, be it natively or through plugins; which is automatically indenting your code. It doesn’t stop there though, it indents your code but it’ll indent it regardless the language you use! So you basically indent your C files and python files by the same algorithm! Bye-Bye ctrl + I? Maybe but there’s a lot of work to do.

See, every language style guide follows some basic rules when it comes to indentation, one of them being: there are levels of indentation and each level “means” being part of the same context.

So I just have to identify these levels. Easy. Right? Well for a particular language yes it’s not so difficult, but when it comes to managing the same for all languages out there, then the task becomes a little daunting.

## Here’s how i plan to tackle these problems:

First of all my algorithm doesn’t support all languages from the start, it’d support basic languages like C, C++, JAVA maybe even ruby. This is because unlike languages like python, these languages have markers to specify when to start and end indentation.

So far i’ve been able to come up with a very basic implementation, though it still lacks features of indents based on keywords and also absolute Indentation, Like:

if(condition)
indents


and

int some_function(param1,
param2,
param3)


other than it works for basic indentation of blocks specified by those indent markers.       my PR has all the code to this basic indentation algorithm and is still under-review/wip.

Later on i’d like to make this algorithm configurable to the extreme.

Apart from the basic functionalities like hanging indents and whatnot, it’d be nice to have an algorithm which is configurable to all styles of indentation.

## What is Style of indentation?

Apparently there are many ways of correctly indenting your code, and it’s upto
the Community what type of indentation they want to follow.

For example:

if(something){
// code
}

if(something)
{
// code
}

if(something)
{
// code
}


all of these are ways indenting an if block. None of these is wrong and it’s
entirely up to you which style you prefer, a generic and versatile algorithm would support all three via configurations.
Sadly my basic implementation doesn’t support the third kind yet.

All in all it’s just the beginning and i’m really excited on how the project develops and what type of algorithms i’m able to deliver, hopefully i’ll be back next time with a more functional and useful algorithm.

### Aakash Rajpal (italian mars society)

#### Finally Coding and Getting results!

After a bad(expected) start, I was able to achieve some part of my project. I understood the existing Habitat Monitoring Client (HMC) server as I had to modify this server for my project. I played around with it , tried to get the necessary data. The HMC server is a GUI based server and I had to use the server without the GUI. Basically, my task was to remove the GUI from the server and modify it more so that the data can be sent to the Blender Client.

I began coding the server and was able to get able to get some success initially. I was in touch with my mentor and after 3 days was able to get the server working completely without the GUI. It was working fine, as required for my project and I contacted my mentor to tell him about it. He asked me to push my code to my repository on bitbucket.

Finally Some coding and pushing, I felt well happy as I pushed my first piece of code( technically not mine ) but still I pushed it. It was there on the repo and now to start working on the Blender client and making a socket connection between the client and server.

I studied socket programming in C at my college, but never really tried it with python. I studied a bit about it and understood it was the same except some syntax. But my task was a bit harder as my client wasn’t a normal TCP python client it was a blender client(bpy). So I had to study a bit of blender scripting through its python API called bpy. I was very pleased as I got to learn two new things and I guess this is what GSoC is about.

I learned, studied, coded and finally was able to establish a TCP socket connection between the blender client and HMC server. Pushed the code to my repo and Felt sattisfied

### Pulkit Goyal (Mercurial)

#### Project, Organisation and Mentor

The community bonding period came to end about a week ago. I have learnt more about how to contribute to open source and how to interact with other developers in the community. In this post, I will be writing about my GSoC project, organisation for which I am working and an awesome mentor who is guiding me.

### Anish Shah (Core Python)

#### GSoC'16: Week 1

This blog post is about my progress in the first week of GSoC 2016.

## Adding GitHub PR to a b.p.o issue

Last week, I told you all about linking GitHub Pull Requests to an issue on issue on b.p.o. I had finished working on this feature during the last week of community bonding period itself. However, I still had to add some tests for this feature. So this week, I completed writing the tests for this feature and submitted the complete patch for it. You can find the patch here

## Show Pull Request state on b.p.o

Now the next task was to show the state of pull requests linked to an issue. The state can be open, closed or merged as it is shown on GitHub. This is also done in real-time using GitHub webhooks. GitHub sends a payload for an event occuring on Pull Request (like opened, closed, reopened, sync). The payload has information about the state of PR. On the UI side, GitHub generally shows three states for PR - (open. closed or merged). But on the API side, it has just two state - (open/closed) and an extra boolean field “merged”. For b.p.o, we decided to have just three state for each PR - (open, closed or merged). You can find the complete patch for this feature here

Once a contributor opens a pull request on GitHub, that person is automatically subscribed to it. But, developers who are in the nosy list might not get notified about the pull request. So, it would be good to add GitHub comments on b.p.o. However, contributors who are subscribed to both the things will double mails related to the same thing. So we decided on adding GitHub comments on b.p.o only if no other GitHub comments are added in last X minutes. This was also done using GitHub webhooks. I have submitted a patch here

Thanks for reading. I’m very excited for the next 12 weeks. Let me know if you have any questions about my project. :)

## May 31, 2016

### Ranveer Aggarwal (dipy)

#### Clickity Click

In my last blog post I had started on building the UI with a button element. The problem with the one I had initially built was that it was in 3D. While 3D is cool, it’s not very useful when it comes to buttons that are 2D, i.e. you don’t want your button to reduce to a line when you rotate the scene by 90 degrees.

And so, I learnt how to build 2D overlays. Now an actor in VTK can be of two types - a 3D actor and a 2D actor. For an overlay button, we want a 2D actor, since it stays stuck on the screen and doesn’t move as you move the screen. Next was to associate a PNG with the button and here’s how you can do so:

png = vtk.vtkPNGReader()
png.SetFileName(icon_fname)
png.Update()

# Convert the image to a polydata
imageDataGeometryFilter = vtk.vtkImageDataGeometryFilter()
imageDataGeometryFilter.SetInputConnection(png.GetOutputPort())
imageDataGeometryFilter.Update()

mapper = vtk.vtkPolyDataMapper2D()
mapper.SetInputConnection(icon.GetOutputPort())

button = vtk.vtkTexturedActor2D()
button.SetMapper(mapper)


The button is the 2D actor you need. Next, we need to capture the click for the button. For this, VTK has a prop picker which can be used in the interactorStyle to get which prop was clicked. Here’s how you do it:

# Define the picker
picker = vtk.vtkPropPicker()
# This gets the position of the click
picker.Pick(click_pos[0], click_pos[1], 0, self.renderer)
# GetViewProp and GetActor2D both can be used alternatively to get the actor2d at the position of the click.
actor_2d = self.picker.GetViewProp() # actor = picker.GetActor2D()
if actor_2d is not None:
actor_2d.InvokeEvent(evt)
# If there is no actor2d at the click position, we see if there is a 3D actor.
# We give more preference to the overlay buttons
else:
actor_3d = self.picker.GetProp3D()
if actor_3d is not None:
actor_3d.InvokeEvent(evt)
else:
print("No actor at this position")


And so, the button clicks have been captured using vanilla VTK, without using a button widget which gives us slightly less control.

The result, this.

A button overlay

I have built a button class with loads of features - button change on event, ability to add callbacks on events etc. These have been partially documented, the complete documentation will be done later.

### Next Steps

The next UI element I’ll be building is a text box. Let’s see how it goes.

### Karan_Saxena (italian mars society)

#### Google Summer of Code 2016 Payment Options

Case 1

Bank: SBI
Currency: USD, i.e. conversion done by SBI

From [1]
Remittance Inward: No charge
Foreign Currency Conversion charges: 250 /-

Service Tax is calculated as below:
1) 0.145% of the gross amount of currency exchanged for an amount of 1,00,000, subject to minimum of 35/-
2) 145 and 0.0725% of the gross amount of currency exchanged for an amount of rupees exceeding 1,00,000 and up to 10,00,000

Our transactions will be $505 +$2250 + $2750. Assuming Base Rate of 66 (as of today given by SBI. See [2]) the amounts are 33300, 148500 and 181500. The service tax would therefore be 48.28, 180.16 and 204.08 Total money deducted on the overall transaction: 250*3 + 48.28 + 180.16 + 204.08 = 1182.52 Note that the banks donot provide live rates. See SBI rates for the day on [2]. Total money received: 363330 - 1183 = 362147 All the other fees is covered by Google for 2016 term. Case 2 Bank: SBI Currency: INR, i.e currency converted by Payoneer Note that Payoneer provides live rate as given on [3]. Payoneer fees Current rate: 66.60 Amount deducted: 7332.66 Total money received: 363330 - 7333 = 355997 Important Links Disclaimer: All these details are true as on 4th May, 2016 and are applicable only for GSoC 2016. Things might change. ## May 29, 2016 ### Karan_Saxena (italian mars society) #### Community Bonding Period Updates So today marks the day we begin our coding period. (Well, I must admit I got a bit late in publishing this blog post) I will be describing my current setup that I have done in the past month. For recap, my work during the summers will be with Italian Mars Society(IMS), on a project called ERAS - European MaRs Analog Station. As the name suggests, the aim of ERAS is to build a analog base station which will be used to train astronauts to be capable to visit planet Red. The primary stage of this project is to build a Virtual setup of the same, called V-ERAS or Virtual-ERAS. ## Virtual ERAS station “on Mars” V-ERAS uses multiple 'Microsoft Kinect for Xbox One' devices to recognize and track user body and movements. Data from Kinect is also placed on a Tango bus, to be available for any other V-ERAS module. Kinect feed data is used to animate an avatar with the Blender Game Engine. The latter is also responsible to draw the whole virtual martian environment, manage interactions among multiple users and allow them to interface with tools, objects and all terrain vehicles (ATVs). V-ERAS also uses Motivity - a static omnidirectional treadmill, on which users can move to walk on the emulated Martian environment.  Microsoft Kinect for Xbox One IMS-ERAS is a project under Python Software Foundation(PSF) for GSoC'16. The description of all the selected projects for IMS, under PSF for GSoC'16, can be found on this link. The title of my project is "Improving the step recognition algorithm used in ERAS virtual environment simulation". In brief, the aim of my project is to 1) improve the feet recognition in the body tracker module written by Vito Gentile during GSoC'15. 2) port existing code to use Kinect v2, using PyKinect2 ## A user walking on Motivity ## My RGB and depth data from Kinect Sensor My first step was to get my hands on Kinect and to make sure if it works on my laptop. My laptop configuration is Intel i5 3.3GHz with 12GB RAM and Nvidia NVS 5400M. One important point to be noted here is that Kinect required USB 3.0 as mandatory. Initially, I began with Kinect Studio v2.0 and SDK Browser v2.0 (Kinect for Windows) to test the data being received and sent by the Kinect sensor. Kinect v2 sends an overwhelming amount of data. At runtime, it was sending 5GBps of data. For the project, I will be going forward with PyKinect2, based on Microsoft Kinect SDK 2.0 (compliant with Kinect for Xbox One). More details about PyKinect can be found on Github. During these days, I also setup Tango, both on Windows and Ubuntu. The instructions for the same can be found on ERAS Documentation. Time to start coding now :) Onwards and Upwards!! ### Redridge (coala) #### Week 1 - Creating the decorator ## Recap I think I will be starting each week's post with a "Recap" section. In that section I will explain the motivation and aim behind each week's coding. So my project is about making it possible to develop coala bears with literally any programming language. This is useful because an open source's community survival is based on contributors. The easier you make it for contributors to contribute, the higher the chances you will have an everlasting and successful open source project. The way I chose to implement this functionality wasn't that complicated: basically we have a normal python bear (wrapper bear) whose sole purpose (in this world) is to start a process of a given executable (binary, script, etc), send it a big line containing a JSON string and then wait for a big line again with a JSON string. All the relevant information (file name, lines and optionally settings) will be passed through that JSON and similarly Result1 and Diff2 objects will be passed back. Yes, you might legitimately argue that we add some overhead, which is totally true, but this is the trade-off we are willing to pay for such a feature. A modular and extendable way to build such a feature would be to have a class that contains all the basic functionality of these so called wrapper bears, and then every time we want to create a new one, we just inherit from that class and make small changes (say executable file name). In coala there is already a class that does something similar, it integrates all kind of bears that use linters3. So I had a starting point. Back to the recap, my goal for this week was to start that "class" which I named ExternalBearWrap (with the help of the community :D). I chose to implement it similarly to the already existing Linter class which makes use of python's decorators. 1. Object that creates the results for coala to display 2. Object that creates the results that also offer a suggested fix 3. Static analysis tools built usually for each language. coala offers a way to integrate those in bears. ## Python decorators I used a substantial amount of time this week learning about the decorators and their use cases. They make use of a functional programming aspect called function closure. I will not be detailing function closures and decorators here, instead I will point you to this link if you want to learn for yourself. Instead of having a class from which the other wrappers will inherit, we make a decorator for the other wrappers to use. The decorator way has some advantages but the most important one is that it minimizes the code written in the wrappers. We want such a feature because the developers who choose to write bears in other languages obviously want to write close to zero python. These wrappers will be auto-generated later on in the project. ## Wrap Up To sum it up, I managed to write the external_bear_wrap decorator. So far the wrapper bears can send and receive input to and from a given executable passed as a decorator argument. Next week the functionality should be completed by sending and receiving the appropriate JSON and parsing it into coala Result and Diff objects. ### liscju (Mercurial) #### Coding Period - I Week In the first week I started to work on the solution(really amazing, isnt it?) :P First thing was setting up the weekly meeting time with the mentor to discuss goals of the weeks to accomplish. Second thing was discussing some design issues about the solution - how to fit feature into the working code. Decision for now is to reuse proto.statlfile implementation for that - before statlfile was remote store procedure to check if given file is accessible in remote store. Now it will return also location of the file. And the most important thing - I started working on the simplest implementation of feature possible - client and server on the same machine, different location to store large files also on local machine. One could ask whats the idea of working on such a simplest subset of the solution, but in my view this is really important phase. In this phase you must design test for the solution, you must make design decision how things fit together, also working on a subset of the solution and getting feedback as quick as possible gives space to discuss it and learn from it. Im in the middle of making changes so implementation details are constatly changing and mercurial feature for changing history are really helpful for working on it, if you havent encountered evolve extension i really suggest checking it: https://www.mercurial-scm.org/wiki/EvolveExtension Apart from the project my patch "send stat calls only for files that are not available locally", which I described in previous post ,was merged into the main repository, you can have a look here: https://selenic.com/hg/rev/fd288d118074 My patches about migrating largefiles to python3 hasnt been merged so far, but working on the solution i found bug in import-checker.py(inner tool for checking if imports in source files comforms to the style guide) - it didnt recognize properly relative imports of the form from ..F import G I fixed this bug and sent to the mailing list and its already merged, you can check this here: https://selenic.com/hg/rev/660d8d4ec7aa I also got some feedback about excessive password prompt i described in previous post and in mailing list: https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-May/084490.html As soon as I accomplish all goals of this week and have some free time Im gonna send new proposal of the solution ### SanketDG (coala) #### Summer of Code Week 1 So as I said in my previous blog post, I am working with coala for language independent documentation extraction for this year’s Google Summer of Code. It has been one week since the coding period has started, and there has been some work done! I would like to explain some stuff before we get started on the real work. So, my project deals with language independent documentation extraction. Turns out, documentation isn’t that independent of the language. Most programming languages don’t have an official documentation specification. But It could be said that documentation is independent of the documentation standard (hereby referred to as docstyle) it uses. I have to extract parts/metadata from the documentation like descriptions, parameters and their descriptions, return descriptions and perform various analyzing routines on this parsed metadata. Most of my work is with the DocumentationComment class, where I have to implement routines for each language/docstyle. I started out with python first because of two reasons: • Its my favourite programming language (Duh!) • coala is written in python! (Duh again!) So python has its own docstyle, that is known as “docstrings”, and they are clearly defined in PEP 257. Note that PEP 257 is just a general styleguide on how to write docstrings. The PEP contains conventions, not laws or syntax It is not a specifictaion. Several documentation tools support compiling these docstrings into API documentation like Sphinx and Doxygen. I aim to support both of them. So, I have come up with the following signature for DocumentationComment:  DocumentationComment(documentation, language, docstyle, indent, marker, range)  Now let’s say doc is an instance of DocumentationComment. doc would have a function named parse_documentation() that would do the parsing and get the metadata. So if I have a function with a docstring: And I load this into the DocumentationComment class and then apply the parsing: Note: Not all parameters are required for instantation. Now printing repr(docdata) would print: You may ask about the strange formatting. That is because it retains the exact formatting, as displayed in the docstring. This is important, because whatever analyzing routines I run, I should always be able to “assemble” back to the original docstring. That’s it! This was my milestone for week 1, to parse and extract metadata out of python docstrings! I have already started developing a simple Bear, that I will talk about later this week. PS: I would really like to thank my mentor Mischa Krüger for his thoughts on the API design and for doing reviews on my ugly code. :P ### Adrianzatreanu (coala) #### Week 1 By week 1 I finally got a part of the hardest thing regarding this project: a way of maintaining the requirements for the bears easily. To help myself with that, I created a PackageRequirement class. Now each bear can have a REQUIREMENTS tuple made of instances of the PackageRequirement class. To automatize working, I worked on two separate things: • creating a “multiple“ method, which helps you instantiate multiple instances of that class instead of calling out the same class again all over and over again • creating subclasses, such as PythonRequirement and NPMRequirement, which have the manager already set to “pip“, respectively “npm“. These classes receive 3 arguments: manager, package name and version. However, there’s more: you can also avoid specifying the version, and this way, the latest will automatically be specified. On the other hand, I am working on a bearUploader tool. This will upload a MockBear (which I chose to be cpplintbear for the basic functionality it provides) to provide continuous integration. This is still work in progress, as the tool only creates a setup.py file for the bear right now, but it’s going to be done the next week. So for the next week: More on the bearUploader tool! ### kaichogami (mne-python) #### Decoding API Hello! Its been a week after the coding period. Lot of discussion about the approach of refactoring the decoding module took place. However every idea had some sort of shortcoming. The refactoring is being done so as to comply with the scikit-learn pipeline. The pipeline chains various steps of transformers ending with a classifier, which is a requirement. This week I worked on changing EpochsVectorizer class. Earlier this class changed the dimensionality of Epochs data matrix to a two dimension matrix(Epochs is a data structure of mne containing events of brain signals sampled at, greater than Nyquist frequency). After changing, EpochsVectorizer accepts Epochs object and returns a 2D matrix containing information about the epochs, with original shape as (n_epochs, n_channels, n_times) and a vector of shape (n_samples) with the event labels. With all the changes in view, a pipeline object with mne would look like: X, y = EpochsVectorizer().fit_transform(epochs) clf = make_pipeline(Xdawn(epochs.info), MinMax(), LogisticRegression()) # CV with clf cv = cross_val_score(clf, X, y, ...)  There are uses of above technique. Firstly, most important it makes it compatible with sklearn. Using cross_val_score becomes possible, therefore saving lot of effort to chain transformers as well as to cross validate. Secondly Epochs is an extremely heavy object which is bound to make processes slow. This was a gist of what we did this week. Another change is being proposed by Jean for a new API. There are shortcoming for passing info between steps. It is a problem to pass info between various steps. Secondly info may not be enough to reconstruct the original data. Or in some cases it provides a lot more information than needed. As for my PR you can see it here. Thank you for reading! ### mr-karan (coala) #### GSoC Week 1 Updates ## Week 1 Updates The task of Week 1 was to complete the prototype of coala-bears –create application. As explained in my previous blog post, this tool helps new bear creators an easy approach to generate most of the code which is reused for every bear and allows the user to simply plug in the values specific to the bear she’s creating. A scaffolding template for both the bear file and test file is present in the scaffold-templates directory. While figuring out on how to replace the values present in these files with the one which user provides, I got to know about an interesting method safe_substitue present in Python Std Library itself. This function is better than using substitue in my case as the user can choose not to provide values at run time, hence there won’t be a KeyError in that case. The default delimiter is $ which describes the placeholder value, and basically anything after $if it matches with the dict[key] will be replaced by value of that key. Example: >>> s = Template('I am working from$start to end') >>> s.substitute(start='Monday', end='Friday') I am working from Monday to Sunday  I am done with implementing the basic functions of the CLI, which is basically asking users some values and appending them and generating the file at directory chosen by the user. ### Further Work for Week 2 involves • Look for the feasability of adding Python Prompt Toolkit features • Add more options if required • Make a python package so that user can install from pip • General code cleanup if required ### Here’s a video of the tool in action: All the code is uploaded on Gitlab at coala / coala-bear-management ### Yashu Seth (pgmpy) #### ContinuousFactor.discretize Hello everyone, this post is in continuation to my last post where I discussed the ContinuousFactor (previously known as ContinuousNode) class. Here, I will be describing the discretize method of this class. We know that, for now our pgmpy models can only deal with discrete variables. In order to perform tasks like inference on purely continuous or hybrid settings, we need appropriate models for their representation. My latter part of the project involves creation of such models. But, this post is not about that. Here, we will see how we can convert the continuous random variables into discrete ones so that we can use the already existing Bayesian models in pgmpy. The discretize method is used to construct a set of discrete probability masses from the given continuous distribution. In Bayesian models, there can be different cases involving continuous and discrete random variables - • Discrete parents - Discrete child • Discrete parents - Continuous child • Continuous parents - Discrete child • Continous parents - Continuous child We will discuss each case one by one, and also see how we can form Tabular CPDs in such cases. In cases where I wish to distinguish discrete and continuous variables, I will use the convention that discrete variables are named with letters near the beginning of the alphabet (A, B, C), whereas continuous ones are named with letters near the end (X, Y, Z). Discrete parents - Discrete child This is the most trivial case, and we already have well structured models to handle such cases. In case you are not familiar with such settings, you can refer Probabilistic Graphical Models Principles and Techniques to get a general idea on probabilistic graphical models. Discrete parents - Continuous child Let us say we have the following structure, A -> X where, A is a discrete random variable with states a0, a1, and a2 and X is a continuous normal random variable, such that its mean and vaiance depends upon the the particular state of A. We have, P(X, A=a0) = N(0 ; 1), P(X, A=a1) = N(1 ; 0.5), P(X, A=a2) = N(2 ; 0.5), Lets see, how we can form a tabular CPD for this setting, import numpy as np from pgmpy.models import BayesianModel from pgmpy.factors.CPD import TabularCPD model = BayesianModel([('A', 'X')]) cpd_a = TabularCPD('A', 3, [[0.3, 0.2, 0.5]]) from pgmpy.factors import ContinuousFactor from scipy.stats import norm x_a0 = ContinuousFactor(norm(0, 1).pdf) x_a1 = ContinuousFactor(norm(1, 0.5).pdf) x_a2 = ContinuousFactor(norm(2, 0.5).pdf) from pgmpy.continuous.discretize import RoundingDiscretizer cpd_x_values = [x_a0.discretize(RoundingDiscretizer, -5, 5, 1), x_a1.discretize(RoundingDiscretizer, -5, 5, 1), x_a2.discretize(RoundingDiscretizer, -5, 5, 1)] cpd_x_values = np.round(cpd_x_values, 3) cpd_x_a = TabularCPD('X', len(cpd_x_values[0]), cpd_x_values, evidence = ['A'], evidence_card = [3]) model.add_cpds(cpd_a, cpd_x_a) print(cpd_x_a) ╒═════╤════════╤════════╤════════╕ │ A │ A_0 │ A_1 │ A_2 │ ├─────┼────────┼────────┼────────┤ │ X_0 │ 0.0 │ 0.0002 │ 0.006 │ ├─────┼────────┼────────┼────────┤ │ X_1 │ 0.0606 │ 0.2417 │ 0.3829 │ ├─────┼────────┼────────┼────────┤ │ X_2 │ 0.2417 │ 0.0606 │ 0.006 │ ├─────┼────────┼────────┼────────┤ │ X_3 │ 0.0002 │ 0.0 │ 0.0 │ ├─────┼────────┼────────┼────────┤ │ X_4 │ 0.0002 │ 0.006 │ 0.0606 │ ├─────┼────────┼────────┼────────┤ │ X_5 │ 0.2417 │ 0.3829 │ 0.2417 │ ├─────┼────────┼────────┼────────┤ │ X_6 │ 0.0606 │ 0.006 │ 0.0 │ ├─────┼────────┼────────┼────────┤ │ X_7 │ 0.0 │ 0.0 │ 0.0 │ ├─────┼────────┼────────┼────────┤ │ X_8 │ 0.0002 │ 0.006 │ 0.0606 │ ├─────┼────────┼────────┼────────┤ │ X_9 │ 0.2417 │ 0.3829 │ 0.2417 │ ╘═════╧════════╧════════╧════════╛  We can also extend this example to multiple discrete parents of a continuous child. Continous parents - Continuous child Let us say we have the following structure, X -> Y where, Y and X are continuous normal random variables such that the mean of X is constant and that of Y is a function of probabilty values taken by X. The two variables have constant variance. We have, - X ~ N(0 ; 1) - Y ~ N(-0.5x+1 ; 1) Lets see, how we can form a tabular CPD for this setting, import numpy as np from pgmpy.models import BayesianModel from pgmpy.factors.CPD import TabularCPD model = BayesianModel([('X', 'Y')]) from pgmpy.factors import ContinuousFactor from scipy.stats import norm from pgmpy.continuous.discretize import RoundingDiscretizer x = ContinuousFactor(norm(0,1).pdf) x_values = [x.discretize(RoundingDiscretizer, -5, 5, 2)] x_values = np.round(x_values, 3) cpd_x = TabularCPD('X', len(x_values[0]), x_values) print(cpd_x) ╒═════╤═══════╕ │ X_0 │ 0 │ ├─────┼───────┤ │ X_1 │ 0.023 │ ├─────┼───────┤ │ X_2 │ 0.477 │ ├─────┼───────┤ │ X_3 │ 0.477 │ ├─────┼───────┤ │ X_4 │ 0.023 │ ╘═════╧═══════╛ y_values = [] for x in x_values[0]: y = ContinuousFactor(norm(-0.5*x+1, 1).pdf) y_values.append(y.discretize(RoundingDiscretizer, -5, 5, 2)) y_values = np.round(y_values, 3) cpd_y = TabularCPD('Y', len(y_values[0]), y_values, evidence=['X'], evidence_card=len(x_values[0])) print(cpd_y) ╒═════╤═════╤═══════╤═══════╤═══════╤═══════╕ │ X │ X_0 │ X_1 │ X_2 │ X_3 │ X_4 │ ├─────┼─────┼───────┼───────┼───────┼───────┤ │ Y_0 │ 0.0 │ 0.001 │ 0.157 │ 0.683 │ 0.157 │ ├─────┼─────┼───────┼───────┼───────┼───────┤ │ Y_1 │ 0.0 │ 0.001 │ 0.16 │ 0.683 │ 0.155 │ ├─────┼─────┼───────┼───────┼───────┼───────┤ │ Y_2 │ 0.0 │ 0.003 │ 0.22 │ 0.669 │ 0.107 │ ├─────┼─────┼───────┼───────┼───────┼───────┤ │ Y_3 │ 0.0 │ 0.003 │ 0.22 │ 0.669 │ 0.107 │ ├─────┼─────┼───────┼───────┼───────┼───────┤ │ Y_4 │ 0.0 │ 0.001 │ 0.16 │ 0.683 │ 0.155 │ ╘═════╧═════╧═══════╧═══════╧═══════╧═══════╛  Now we can clearly extend this example to multiple parents as well. Continuous parents - Discrete child These settings can be modelled using various function like, softmax or logistic functions. Once we have a well defined relation, the Tabular CPDs can be formed in a similar fashion as they have been in the above two settings. With this I conclude the description of the method, ContinuousFactor.discretize. In my next post I will be describing how we can exploit the class based architecture of the various discretization methods to create and insert our own custom algorithm for discretizing the distributions. I hope you enjoyed this post and will be looking forward to my future posts as well. Thank you. ## May 27, 2016 ### Yen (scikit-learn) #### scikit-learn Sparse Utils Now Support Fused Types Dealing with sparse data is fairly common when we are anlyzing large datasets. However, sparse function utilities in scikit-learn only support float64 currently and will therefore implicitly convert other input data types, e.g., float32, into float64, which may cause unexpected memory error. Since Cython fused types allow us to have one type definition that can refer to multiple types, we can solve potential memory wasting issue by substituting float64 with Cython fused types. Below, I’ll briefly introduce sparse function utilities in scikit-learn and describe the work I’ve done to enhance it during GSoC. ## Sparse Function Utilities In scikit-learn, sparse data is often represented as scipy.sparse.csr_matrix or scipy.sparse.csc_matrix. However, these two matrices do not provide built-in methods to calculate important statistics such as L2 norm and variance, which are useful when we are playing with data. Therefore, scikit-learn leverages on sparsefuncs_fast.pyx which defines some helper methods for sparse matrices to handle sparse data more conveniently throughout the project. ## Memory Wasting Issue However, original implementation of sparse function utilities in scikit-learn is not memory-efficient. Let’s take a simple function which do not use Cython fused types and will calculate L2 norm of a CSR matrix X as example: def csr_row_norms(X): """L2 norm of each row in CSR matrix X.""" # 1 cdef: unsigned int n_samples = X.shape[0] unsigned int n_features = X.shape[1] np.ndarray[np.float64_t, ndim=1, mode="c"] norms np.ndarray[np.float64_t, ndim=1, mode="c"] data np.ndarray[int, ndim=1, mode="c"] indices = X.indices np.ndarray[int, ndim=1, mode="c"] indptr = X.indptr np.npy_intp i, j double sum_ # 2 norms = np.zeros(n_samples, dtype=np.float64) # 3 data = np.asarray(X.data, dtype=np.float64) # Warning: might copy! # 4 for i in range(n_samples): sum_ = 0.0 for j in range(indptr[i], indptr[i + 1]): sum_ += data[j] * data[j] norms[i] = sum_ np.sqrt(norms) return norms  1. Declare Cython’s static-typed (in contrast to Python’s dynamic-typed) variables to store attributes of the input CSR matrix X since static-typed variables can accelerate the computation a lot. 2. Initialize norms with 0s. 3. Since we’ve already used cdef to declare data as a np.ndarray which contains np.float64_t element in step 1, data of X need to be converted into type np.float64 if it belongs to other data types. 4. Calculate the squared sum of each row and then take squared root of it to get L2 norm. As illustrated above, we can see that STEP 3 IS DANGEROUS because converting type of data may implicitly copy the the data and then cause memory error unexpectedly. To see how it will affect the memory space, we can use memory_profiler to monitor memory usage. Here is the result of memory profiling if we pass a scipy.sparse.csr_matrix with np.float32 element into our example function: It is abvious that memory usage increase dramatically because step 3 copies the data so as to convert it from np.float32 to np.float64. To solve this problem, we can introduce Cython fused types to avoid data copying. But firstly, let’s take a brief look at Cython fused types. ## Cython Fused Types Here is official page’s clear introduction for fused types: Fused types allow you to have one type definition that can refer to multiple types. This allows you to write a single static-typed cython algorithm that can operate on values of multiple types. Thus fused types allow generic programming and are akin to templates in C++ or generics in languages like Java / C#. Note that Cython fused types are specialized at compile time, and are constant for a particular function invocation. By adopting Cython fused types, our function can accept multiple types and therefore doesn’t need to do datatype conversion. ## Common Pitfalls Intuitively, in order to integrate Cython fused types to solve the memory issue described above, we will delete step 3 and change step 1 in our function as follows:  # 1 cdef: unsigned int n_samples = X.shape[0] unsigned int n_features = X.shape[1] # Change type from np.float64_t to floating np.ndarray[floating, ndim=1, mode="c"] norms np.ndarray[floating, ndim=1, mode="c"] data = X.data np.ndarray[int, ndim=1, mode="c"] indices = X.indices np.ndarray[int, ndim=1, mode="c"] indptr = X.indptr np.npy_intp i, j double sum_  However, above changes will cause Cython compile error Invalid use of fused types, type cannot be specialized. It seems that Cython doesn’t allow us to declare fused types variable and then assign value to it within a function if this function doesn’t accept any argument that has type involves the same fused types. Hence, we need to introduce a implementation trick here. ## Enhanced Implementation The trick I used here is to define a wrapper function and make its underlying implementation function accept fused types arguments. The reason behind this is mentioned above: If a function accepts some argument that has a particular fused type, it can use cdef to declare and init variable with that particular fused type within its scope. Code of enhanced implementation is showed below: # Wrapper function def csr_row_norms(X): """L2 norm of each row in CSR matrix X.""" return _csr_row_norms(X.data, X.shape, X.indices, X.indptr) # Underlying implementation function def _csr_row_norms(np.ndarray[floating, ndim=1, mode="c"] X_data, shape, np.ndarray[int, ndim=1, mode="c"] X_indices, np.ndarray[int, ndim=1, mode="c"] X_indptr): cdef: unsigned int n_samples = shape[0] unsigned int n_features = shape[1] np.ndarray[DOUBLE, ndim=1, mode="c"] norms np.npy_intp i, j double sum_ norms = np.zeros(n_samples, dtype=np.float64) for i in range(n_samples): sum_ = 0.0 for j in range(X_indptr[i], X_indptr[i + 1]): sum_ += X_data[j] * X_data[j] norms[i] = sum_ return norms  Finally, to verify our enhancement, here is the result of memory profiling if we pass a scipy.sparse.csr_matrix with np.float32 element into our enhamced function: Cool! As what figure shows, our function no longer copy the data anymore. ## Summary All of the functions in sparsefuncs_fast.pyx now support Cython fused types! Great thanks to all of the reviewers and their useful opinions. In the next few weeks, my goal is to work on clustering algorithms such as KMeans in scikit-learn so as to make it also support Cython fused types. ### TaylorOshan (PySAL) #### SpInt Framework Given the API deisgn discussed in the previous post, the initial few days of coding were used to build the general framework and core classes that will be used in SpInt. Since the gravity-type models make up the majority of existig model specifications, the initial focus for developing the general framework essentially means focusing on developing classes for gravity models. It was decided to split the four primary model specfications (unconstrained or traditional gravity model, production-constrained, attraction-constrained, doubly-constrained) into four separate "user" style classes (see pysal.spreg module) that inherit from one base class (BaseGravity). These user classes serve to accept structured array inputs and configurations from the user and then carry out model-specific checks of the data. Then they pass the inputs into an __init__ method for BaseGravity. Depending on which inputs are passed and the type of user class that __init__ is called from, BaseGravity then further checks the data and preapres it into an endogenous dependent variable, y, and and a set of exogenous explanatory variables, X. Finally, BaseGravity then calls an __init__ method to the class that it inherits from, CountModel, with these newly formatted inputs and any estimation options, and a fit() method. The init_method carries out a simple check to make sure the dependent variable, y, is indeed a count-based variable and the fit method carries out the selected estimation routine. Currently the default estimatation framework/method is MLE using the generalized linear model (GLM)/iteratively re-weighted least squares (IWLS) in statmodels, though others can be added such as GLM/gradient search or non-linear formulations/gradient search. The design was carried out in this manner (CountModel --> BaseGravity --> Gravity/Production/Attraction/Doubly) in order to promote flexibility and modularity. For example, CountModel can be expanded by adding tests for overdisperison, additional estimation routines and count-based probability models other than Poisson (i.e., negative binomial), which can be useful throughout pysal for other count-based modeling tasks. This also means that SpInt is not a priori limited to any single type of estimation technique or probability model in the future. Then the BaseGravity class does the heavy lifting in terms of data munging and common data integrtiy checks. And finally, the user classes aim to restrcit input for specific types of gravity models, carry out model-specific checks, and then to organize and prepare the model results for the user. This code and an exmple of its use in an ipython notebook can be found here. At the moment estimation can only be carried out using a GLM/IWLS in statmodels and only basic results are available, though a more user-friendly presentation of the results will be created, such as the results.summary() method in statsmodels. Looking forward, this framework will be filled out with the additons already described as well potentially adding a mechanism that allows users to flexibly use different input formats. Currently, input consists of all arrays, but it would be helpful to allow users to pass in a pandas DataFrames and the names of the columns that refer to different arrays. I will also begin to explore sparse data structures/algebra and when they will be the most beneficial to employ. ### tushar-rishav (coala) #### coala-html Beta So I had been working on coala-html beta version since a few weeks. The PR was certainly huge to be reviewed at once and soon became cumbersome to keep changes updated. But credits to my mentor - Attila and the constructive feedbacks from Abdeali, I could get it done the right way, making an appropriate and meaningful commits with a better code. ##### What is coala-html? coala-html is a console application that runs coala analysis and generates an interactive webpage for the user. ##### How coala-html works? coala-html, creates a webpage (Angular app) based on certain json files that are generated - first time when coala-html is run on a given repository, or updated - running coala-html again. By default, the generated webpage is served by launching a server at localhost and the json files are updated. User has an option to change this behaviour by providing the nolaunch and noupdate arguments respectively while running the coala-html. User can also provide an optional dir or directory path argument that will store the code for the webpage. You may see a brief demo below: Now as the basic functionalities are done, I am gonna work on improving the UI and writing more tests for having maximum coverage in coming weeks. Stay tuned! :) ### meetshah1995 (MyHDL) #### Invitation Game ? no please ! This week I had to get ahead with the pure python decoder implementation. I surely went ahead but it was more like a step ahead, two steps back , recursion more or less. I had to tread through the never-ending RISC-V ISA specification manual to understand each instructions encoding and write tests to make sure my decoder is working as it should - correctly! . Murphy said it righteously, if there were x mistakes one can make , I made them all implementing the decoder. But nonetheless a very basic version of the decoder is ready , I have to still polish it and rigorously test it for correctness. I expect to do that by tomorrow. So see you next week. Decoding isn't very inviting a game though folks :P . Best, MS. ### jbm950 (PyDy) #### GSoC Week 1 The first week of the Google Summer of Code is now coming to an end and I feel like I’ve hit the ground running and made a great head start. Most of the week revolved around work with creating a way to benchmark KanesMethod and LagrangesMethod classes so that activities aimed at enchancing the speed performance of these classes can be tracked. I also worked on moving some code from the pydy repository to the sympy repository and made my first attempt at reviewing a pull request. Lastly I continued researching Featherstones Method of equation of motion generation and started digging into the structure of KanesMethod and LagrangesMethod as I work towards making a base equations of motion class. The week started off by finishing the tkinter GUI and benchmarking code that I had started making from scratch during the community bonding period. I added the ability to filter the graphed results by test, python version and platform. This code was submitted to the SymPy repository in PR #11154. This PR has since been closed as Jason Moore pointed out that SymPy already has a benchmarking repository that is able to do basically what I was achieving with my code and a better solution would be to simply move my tests there. First I had to learn the airspeed velocity (ASV) package which is what the benchmarking repository uses to run it’s tests. After reading through the documentation of ASV’s homepage I altered my KanesMethod and LagrangesMethod tests to fit ASV’s formatting. This code was submitted to the sympy_benchmarks repository in PR #27. This code has since been merged though during the submission process Jason brought up that it would be a good idea to broaden the scope of testing for the equations of motion generators and mentioned a few example scripts to look through. My summary of reviewing those scripts can be found on the PR but basically some of the examples did not use additional methods but simply provided different inputs for testing equations of motion formation which is still useful. Among the scripts to review was pydy.models.py which Jason pointed out would be useful if added to the SymPy repository as it would give additional code to benchmark and test. Some tasks that would need to be completed to achieve this migration were to remove all dependence of the code on pydy and come up with some test code which I worked on the back half of this week. Also I changed the location of the theta coordinate of models.py’s second function at Jason’s request. The submission of this code to the SymPy repository is in PR #11168 which at the time of this writing is awaiting the completion of the travis ci tests. The last thing I did related to my project this week was continue to learn the math behind Roy Featherstone’s equations of motion algorithm. I finished reading through his short course on spatial vector algebra slides and their accompaning notes. Also I contined reading through A Beginners Guide to 6-D Vectors (Part 2). Lastly I began taking notes on KanesMethod and LagrangesMethod’s apis as I begin working towards creating an equations of motion generation base class. I also made my first attempt at doing a PR review this week on PR #10650. This PR had very little code to look over and I made some suggestions on basic syntax choices. After he fixed the suggestions, however, I pinged members who deal with that code as I am not confident in my ability to assess whether the code is ok for merging or if the fix is necessary. ### Future Directions Next week my plan is to jump more into figuring out the internals of the KanesMethod and LagrangesMethod classes which will most likely involve dynamics topics I am less familiar with. In addition I will keep making progress on learning Featherstone’s method of equations of motion generation. Thus it seems that next week will be focused more on theoretical learning and less on coding than this week was. ### PR’s and Issues Referenced in Post • (Closed) Mechanics Benchmarking PR #11154 • (Merged) Added a test for KanesMethod and LagrangesMethod PR #27 • (Open) Fix matrix rank with complicated elements PR #10650 • (Open) Pydy models migration PR #11168 ### Articles/Books Referenced in Post ## May 26, 2016 ### Shridhar Mishra (italian mars society) #### GSoc 2016 IMS- Community Bonding I shall be using the same blog that i created for Gsoc 2015. Huh! So its good to be back with a new project! Kudos to Italian Mars Society. I did some work before the actual coding begun and got to know the new members of my team as well. Here's the progress till now. • Successfully imported and ran all the test programs and Kinect works fine. • Stored all the readings in json format on my Windows machine for the joint movements. Further I have to send them to Linux virtual machine using vito's code. On the unity end I have been using a plugin that tracks joint movements for trial. There is also a paid plugin that uses Kinect out of the box. A decision has to be made after consulting my mentor for the usage of the paid plugin. Things are now gonna be a bit slow since my exams are from 30th May to June 12th. At this time Karan doesn't have exams so I can pick up from his work on Kinect if at all they co incide. I can work with my full capacity after 12th. Cheers! Shridhar ## May 25, 2016 ### Riddhish Bhalodia (dipy) #### What am I doing wrong? This week was pretty much ups and downs. I have gone through the code and the LocalPCA paper for about 100 times, and tried out all different things there to try, no avail! I am still missing something. So I thought to do a walkthrough of the algorithm and code simultaneously one step at a time. Update: This walkthrough helped, I picked on few mistakes and corrected them, I have put the corrected results after the walkthrough. So here we go (This one may go pretty long!) ### [1] Read Data Read the dMRI data and store the necessary parameters. Following are the lines of the code fetch_sherbrooke_3shell() img, gtab = read_sherbrooke_3shell() data = img.get_data() affine = img.get_affine()  ### [2] Noise Estimation This is separate section and will divide this into further parts #### (2.a) Get Number of b=0 Images This is pretty easy once we use the gtab object, this was described in the previous blog post as well hence not repeating the same here. #### (2.b) Create matrix X for PCA decomposition Once we decided to go for SIBE or MUBE we need to create a mean normalised matrix X of size NxK, N = number of voxels in one image, and K = number of images. We will be using the PCA trick for faster computation.  data0 = data[...,gtab.b0s_mask] # MUBE Noise Estimate # Form their matrix X = data0.reshape(data0.shape[0] * data0.shape[1] * data0.shape[2], data0.shape[3]) # X = NxK # Now to subtract the mean M = np.mean(X,axis = 0) X = X - M  #### (2.c) Create covariance matrix C = XTX and perform its EVD C is of size KxK and then we will use it for PCA trick C = np.transpose(X).dot(X) [d,W] = np.linalg.eigh(C) # d = eigenvalue vector in increasing order, W = normalised eigenvectors  #### (2.d) Get the eigenvectors of XXT We actually want the eigenvectors of the XXT but for computational efficiency we computed that of XTX, now we need to get the actual ones (this is the PCA trick) V = X.dot(W) #### (2.e) Find the eigenvector corresponding to the smallest positive eigenvalue The noise component is in the eigenvector corresponding to the lowest positive eigenvalue, so we choose that and reshape it to the original image dimensions d[d < 0] = 0; d_new = d[d != 0] # As the eigenvalues are sorted in increasing order I = V[:,d.shape[0]-d_new.shape[0]].reshape(data.shape[0],data.shape[1],data.shape[2])  #### (2.f) Computation of local mean of data and local noise variance We compute the noise field by computing local variance of 3x3x3 patches of the lowest positive eigenvector image. And local mean of data is need for sigma correction in the next step. I is the image obtained in the previous step. for i in range(1,I.shape[0] - 1): for j in range(1, I.shape[1] - 1): for k in range(1, I.shape[2] - 1): temp = I[i-1:i+2, j-1:j+2, k-1:k+2] temp = (temp - np.mean(temp)) * (temp - np.mean(temp)) sigma[i-1:i+2, j-1:j+2, k-1:k+2] += temp temp = data[i-1:i+2, j-1:j+2, k-1:k+2, :] mean[i-1:i+2, j-1:j+2, k-1:k+2, :] += np.mean(np.mean( np.mean(temp,axis = 0),axis=0),axis=0) count[i-1:i+2, j-1:j+2, k-1:k+2,:] += 1 sigma = sigma / count[...,0] mean = mean / count  #### (2.f) SNR based sigma correction This was also described in the previous blogpost, however the issue of not getting a satisfactory noise field after rescaling is solved. (See the corrected figure below) # find the SNR and make the correction # SNR Correction for l in range(data.shape[3]): snr = mean[...,l] / np.sqrt(sigma) eta = 2 + snr**2 - (np.pi / 8) * np.exp(-0.5 * (snr**2)) * ((2 + snr**2) * sp.special.iv(0, 0.25 *(snr**2)) + (snr**2) * sp.special.iv(1,0.25 *(snr**2)))**2 sigma_corr[...,l] = sigma / eta  #### (2.g) Regularise using an LPF sigma_corrr = ndimage.gaussian_filter(sigma_corr,3)  ### [3] Local PCA Now we have obtained the sigma, so now we proceed to the local PCA part. #### (3.a) For each voxel choose a patch around it for k in range(patch_radius, arr.shape[2] - patch_radius , 1): for j in range(patch_radius, arr.shape[1] - patch_radius , 1): for i in range(patch_radius, arr.shape[0] - patch_radius , 1): X = np.zeros((arr.shape[3], patch_size * patch_size * patch_size)) M = np.zeros(arr.shape[3]) # X = PCA matrix, M = mean container temp = arr[i - patch_radius : i + patch_radius + 1, j - patch_radius : j + patch_radius + 1, k - patch_radius : k + patch_radius + 1,:] temp = temp.reshape(patch_size * patch_size * patch_size, arr.shape[3])  #### (3.b) Construct matrix X for PCA We need to construct a matrix X of the size NxK, where N = number of voxels in the patch and K = number of directions. X = temp.reshape(patch_size * patch_size * patch_size, arr.shape[3]) # compute the mean and normalise M = np.mean(X,axis=1) X = X - np.array([M,]*X.shape[1],dtype=np.float64).transpose()  #### (3.c) Construct covariance matrix and perform it’s EVD The covariance matrix will be C = XTX C = np.transpose(X).dot(X) C = C/arr.shape[3] # compute EVD of the covariance matrix of X get the matrices W and D [d,W] = np.linalg.eigh(C)  #### (3.d) Threshold the eigenvalue matrix and get estimated X # tou = 2.3 * 2.3 * sigma d[d < tou[i][j][k][:]] = 0 D_hat = np.diag(d) X_est = X.dot(D_hat)  #### (3.e) Recover patch and update matrices for over-complete averaging X We have to recover the patch from the estimated X, then we have to compute theta of the patch and update the matrices for overcomplete averaging. temp = X_est + np.array([M,]*X_est.shape[1], dtype = np.float64).transpose() temp = temp.reshape(patch_size, patch_size, patch_size, arr.shape[3]) # update the theta matrix # update the estimate matrix(thetax) which is X_est * theta theta[i - patch_radius : i + patch_radius + 1,j - patch_radius : j + patch_radius + 1, k - patch_radius : k + patch_radius + 1 ,:] = theta[i - patch_radius : i + patch_radius + 1, j - patch_radius : j + patch_radius + 1,k - patch_radius : k + patch_radius + 1 ,:] + 1/(1 + np.linalg.norm(d,ord=0)) thetax[i - patch_radius : i + patch_radius + 1,j - patch_radius : j + patch_radius + 1, k - patch_radius : k + patch_radius + 1 ,:] = thetax[i - patch_radius : i + patch_radius + 1, j - patch_radius : j + patch_radius + 1,k - patch_radius : k + patch_radius + 1 ,:] + temp / (1 + np.linalg.norm(d,ord=0))  #### (3.f) Get the denoised output denoised_arr = thetax / theta  ### [4] Rician Adaptation This is done to correct the bias introduced by the Rician noise in dMR images. This is done by creating a look up table (between y and phi) of the expression given below, where phi = (bias corrected value/ sigma) and y = (uncorrected denoised value/ sigma). However the look up table for such a large data is very computationally expensive, this is the current implementation. # eta_phi = LUT, and as the lookup table was generated by keeping phi values as # phi = linspace(0,15,1000) and y is the output of expression # we need to find the index of the closest value of arr/sigma from the dataset corrected_arr = np.zeros_like(denoised_arr) y = denoised_arr / np.sqrt(sigma) opt_diff = np.abs(y - eta_phi[0]) for i in range(eta_phi.size): if(i!=0): new_diff = np.abs(y - eta_phi[i]) corrected_arr[new_diff < opt_diff] = i opt_diff[new_diff<opt_diff] = new_diff[new_diff<opt_diff] # we can recover the true value from the index corrected_arr = np.sqrt(sigma) * corrected_arr * 15.0/1000.0  This concludes the paper and the implementation I have done. All the codes can be found in in the following PR. LocalPCA function Noise Estimation Function LocalPCA example (currently not using function for easier debugging) Noise estimation example (currently not using function for easier debugging) ## Rician adaptation This week I understood the rician adaptation described in [1] and implemented it. I generate the LUT in the localPCA function itself. The problem is the time, as we need to compare every element of the denoised array which is very big in size (189726720 for sherbrooke data) with the lookup table and find the nearest value to it. This is very computationally expensive task, and hence I need to find a better way to use the LUT. ## Updated Results The code walkthrough helped me get few things corrected and so I will be posting few of the updated results 1. The Noise Estimation This matches reasonably (in visual context) with that in the paper. 2. Local PCA Denoising The images are still overly smooth and the time to denoise one dataset is huge. (~ 1.5 hrs for (128,128,60,193) dataset). Computing the covariance matrix takes the most time in local PCA and will have to work on speeding that up. Also probably will have to look closely at the implementation to figure out why it is overly smooth. ## Things left to be done … 1. Debug 2. Improve for computational efficiency 3. HPC dataset test 4. Validation ## References [1]Diffusion Weighted Image Denoising Using Overcomplete Local PCA Manjón JV, Coupé P, Concha L, Buades A, Collins DL, et al. (2013) PLoS (Pub Library of Science) ONE 8(9): e73021. ### Levi John Wolf (PySAL) #### Call notes about my Request for Comment The followng were comments I recieved on my Request for Comment submitted a bit ago. • Questions about Request for Comment: • What should I prioritize? NOGR or Labeled Array Interface? • Labeled Array. This is critical to get correct, and will make NOGR need and scope clearer. • How deep into PySAL should the Labeled Array interface go? • Design it like the library were getting built now. • Do not fail on import. Instead, use soft dependencies/optional import patterns • if necessary, write Python3-only components safely, so that new features can be leveraged. • What should get deprecated? • Anything that looks less smooth in the unlabeled array interface should get flagged with a depwarning. • If the tabular IO is smooth and works parallel to the older interface, then throw a deprecation warning on the FileIO components. • Deliverables in the medium term (targeting midterm eval for GSOC): • Two Contrib Modules: • GeoTable: interfaces between PySAL labeled arrays & Geopandas arrays • Pdio: extend and improve tabular interface already in PySAL • Some work in core: • Polymorphic weights constructors • i.e. work on any arbitrary iterable of shapes • return correct weights object from the iterable • possibly indexed by a second collection of indices • Revamp & scaffold new IO system revolving around multiple alternative packages & their drivers: • expose all pandas read_ functions • ensure pysal objects get serialized correctly into wkb/wkt by to_ on dataframes • wrap Fiona & geopandas constructors to provide identical output to pdio.read_files • Plan to connect with new geopandas contributors at SciPy • Investigate possibility of serializing with Libfeather (remote, if time remaining) ### fiona (MDAnalysis) #### What is this MD thing anyway? With the GoSC coding period now started, here’s that background and for my project I promised last post! I’ve tried to start from the (relative) basics so the context of what I’m trying achieve with my project is more clear; if you already know how Molecular Dynamics and Umbrella Sampling work (or you’re just not that interested), feel free to skip ahead to the overview of my GoSC project or projected timeline. ### So what is MD? Experimental approaches to biochemistry can tell us a lot about the function and interactions of biologically relevant molecules on larger scales, but sometimes we want to know the exact details – the atomic scale interactions – that drive these processes. You can imagine it’s quite difficult experimentally to see and follow what’s going on at this scale! This is where Molecular Dynamics, or MD, comes in. What we do is use a computer to model our system – say, a particular protein and its surrounding environment (usually water and ions) – on this scale and simulate how it behaves. To make our simulations feasible, we make some approximations when considering how the atoms interact: in general, they are simply modelled as a spheres connected by springs to represent molecular bonds, with additional forces to represent e.g. charge interactions. The set of parameters that describe these interactions form a ‘force field’ and are chosen to best replicate various experimental and other results. Using this force field, we can add up the total force on each atom due to the relative location of the remaining atoms. We make the approximation that this force is approximately constant over the next small length of time dt, which allows us (given also each particle’s velocity) to calculate how far and in which direction they’ll move over that time period. We move all our particles appropriately, and repeat the process. The force isn’t really constant, so for our simulations to be accurate this dt needs to be very small, typically on the order of a femtosecond (a millionth of a billionth of a second!), and we need to iterate many, many times before we approach the nanosecond or greater timescales relevant for the events we’re interested in. It would not be feasible to keep a record of every single position we calculate for every single timestep. Even when our output record of the simulation – the trajectory – is updated at a much lower frequency, a typical simulation may still end up with coordinates (and possibly velocities and forces) for each of tens of thousands of atoms in each of thousands of timesteps – it’s quite easy to get buried in all that data! This is where tools like MDAnalysis help out – by making it easier to read through and manipulate these trajectories, isolating atom selections of interest and performing analysis over the recorded timesteps (known as frames). If you want to know more about Molecular Dynamics, particularly the more technical aspects, you could head over to the Wikipedia page or read one of the many review articles, such as this one by Adcock and McCammon. ### But what do umbrellas have to do with anything? Sometimes we want to know more about our system than ‘molecule A and molecule B spend a lot of time together’ – we want to quantify how much molecule A likes molecule B, by calculating a free energy for the interaction. The lower the free energy of an interaction or particular conformation of our system, the more favoured it is. We can also talk about free energy landscapes which show how the free energy of our system changes as we vary part of it, for example bringing two molecules together. Say you want to figure out which toys your cat likes the best. You can wait and see where he spends his time in relation to them, and if you watch long enough you can make a probability distribution of how likely you are to find your cat at any place at a point in time. This probability will correlate with how much he favours each toy, and so we can calculate from the probability distribution the free energy of cat-toy interactions. The problem is, you might be watching for a long time. The same is true in biochemistry. If a particular interaction is very favourable, we’re not likely to see the molecules separate within the timeframe of our MD simulation: without some sort of reference state, we can’t calculate our free energy, and we never see any less-favourable states that might exist. Instead, what we do it put out cat on a lead to force him to stay near various locations. There’s still a bit of give in the lead, so for each location we restrain him, he’ll still shift towards the things he likes, and away from the things he doesn’t. Combining the probability distribution from each these we get a much more complete (though biased) probability distribution than before. These can be unbiased to get our free energy landscape, and from this we can determine the most favoured toy: it’ll be the one corresponding to the lowest free energy. (Naturally, it’s the random box.) This is more or less the Umbrella Sampling (US) method. To be a little more technical, we determine the free energy between two states (such as two molecules interacting or separate; a ‘binding’ free energy) by first picking a reaction coordinate that will allow us to move from one state to the other (e.g. distance between the two molecules). We then perform a series of MD simulations (or windows) with the system restrained at various points along the reaction coordinate by adding an additional term to the force field, usually a harmonic potential - the shape of which gives the name ‘umbrella’ sampling. The probability distributions are calculated by recordng the value of the reaction coordinate (or, equivilantly, force due to the harmonic potential) during each window, and unbiased and combined using an algorithm such as WHAM (I’ll be talking about this in more detail in the future!). The free energy landscape along the reaction coordinate we get out is called a Potential of Mean Force (PMF) profile. Again, if you’re interested in reading more, you could try the Wikipedia page, the original US article by Torrie and Valleau or the read up on WHAM. ### So what am I doing for GoSC? Currently, MDAnalysis doesn’t have tools for dealing particularly with US simulations, both in terms of specific analysis (including WHAM) and handling the trajectories and associated extra data from each window. I’m hoping to make this a reality. My project is currently divided into three main parts. Below is a brief overview for now; I’ll talk more about each as I encounter them and form a clearer picture of how they’ll play out. Each is associated with an issue on the MDAnalysis GitHub page, linked below, so you can pop over there for more discussion there. • Adding auxiliary files to trajectories. Usually when performing US simulations, the value of the reaction coordinate/force is recorded at a higher frequency than the full set of coordinates stored in the trajectory. Being able to add this or other auxiliary data to a trajectory, and dealing getting the data aligned when they’re saved at different frequencies, will be generally useful to MDAnalysis (not just for US). See the issue on GitHub here. • An implementation of WHAM, perhaps a wrapper for an existing implementation, so WHAM analysis can be performed from within MDAnalysis. See the issue here • An umbrella class, to deal with the trajectories for each window, loading the auxiliary reaction coordinate/force data, and storing other relevant information about the simulations such as the restraining potentials used; passing the relevant information on to run WHAM and allowing other useful analysis, such as calculating sampling or averages of other properties of the system in each window or as a function of the reaction coordinate. Read the issue here. ### Timeline Here’s my current projected timeline for dealing with these three tasks: This is likely to change as I encounter new ideas and issues along the way, but hopefully by mid-August I’ll have something nice for dealing with US simulations in MDAnalysis to show off! Phew, well that ended up being a monster of a post - worry not, I don’t expect any posts for the immediate future to be this long! I’ll be back soon, with more on the adding-auxiliary-data part of this project and how my first week has gone. I’m currently experimenting with the design on this blog, and writing an ‘About’ page and a proper ‘Home’ page, so watch out for those, and see you again soon! ## May 23, 2016 ### Levi John Wolf (PySAL) #### GSOC Introduction Hey all! While I may have put the cart before the horse in doing an RFC before an introduction, it’s never too late :) My name is Levi John Wolf. I am a PhD candidate at Arizona State University. I study spatial statistics, econometrics, & polimetrics, focusing on campaigns, elections, & inequality. I am a Google Summer of Code student for the Python Spatial Analysis Library (PySAL), under the Python Software Foundation (PSF). My project reolves around building a better core datamodel for PySAL. My aim is to make our statistical routines more generalizable, and make it more clear how to work between PySAL and other common packages in the PyData ecosystem, like Pandas or Patsy. At its core, this will target abstracting specific implementation choices that tightly couple our statistical or computational geometry algorithms to specific interfaces, objects, or file formats. I’ve had past contributions to this library and know quite a few of the contributors well. So, during the community bonding period, I did what I usually do, participating on the project gitter channel, and engaging in issue tracker work. I’m an enthusiastic gitter user, having been one of the first developers in the channel. In addition, I’m very familiar with the project contribution guidelines, having contributed before. In addition, I co-taught a workshop on PySAL this morning with one of my project mentors (& PhD advisor), Serge Rey, at the University Consortium for Geographic Information Science meeting. Overall, I’m quite excited for the start of the coding period, and can’t wait for my deliverables to start coming together! ### Ranveer Aggarwal (dipy) #### #0 Dot Init This past month, I have been working hard to get familiar with (my mentors harder to get me familiar with) one of the largest visualisation engines available on the web today. VTK, while being very powerful, is a tad low on documentation, and that has resulted in a slower start than I expected. This week I worked on a small UI element. A button (grey) and a cube (red) ### VTK Here’s how VTK’s visualisation pipeline looks like: Image Courtesy: Visualization Toolkit ( VTK ) Tutorial Here are the resources I am using: ### Creating a button My first task is to create the most basic visual interface element one can think of - a button. Since we need more control over how the UI should look like, we’re going to have to do something more than using the built-in button widget. Currently, I’m using an ImageActor. But since I need to create an overlay, I’ll be needing a 2D actor instead of a 3D one. That’s what I’ll be doing next. Here’s how it looks like currently: A button click moves a cube I’m maintaining the code here. ### Next Steps Next up, I’ll be getting this button to work as a 2D overlay and also working on its positioning. Furthermore, I’ll be working on 3D sprites that always face the camera. ### tushar-rishav (coala) #### PyAutoGUI Recently, I came across PyAutoGUI, a cross platform Graphical User Interface automation Python module. The module allows us to programmatically control the mouse and keyboard. That means we can write scripts to automate the tasks that involved mouse movements/clicks or inputs from keyboard. To understand better let’s write a simple script that would draw a Symbol of Peace for us. If you don’t have any paint tool then you may try online for free at SumoPaint. So before our script executes, we will have Brush tool selected. We could handle the selection of brush tool but it all depends on the position of the brush tool in Paint and it differs for various Paint softwares. So let’s get started. Importing required modules. Nothing cool here. Ideally, we would want to have control over our automation script even in situations when things go wrong. We could ask script to wait after every function call, giving us a short window to take control of the mouse and keyboard if something goes wrong. This pause after each function call can be implemented by setting a certain numeric value to PAUSE constant in pyautogui module. We may also want to add an initial delay to let user select an appropriate paint tool. Screen size can be obtained using size method. If you observer carefully, Symbol of Peace is a big circle enclosing an inverted Y. Circular path can be traced using parametric equation. Let us assume screen center as circle center. Mouse clicks can be implemented using pyautogui.click method. A mouse click is a combination of the two events: • Pressing the button. • Releasing the button. Both combined makes one click. pyautogui.click takes x and y coordinates of the region to click upon. If these params are not passed then a click is performed at the current mouse position. Let’s implement mouse click to focus on the paint region. Apart from a click, we can also drag mouse cursor. PyAutoGUI provides the pyautogui.dragTo() and pyautogui.dragRel() functions to drag the mouse cursor to a new location or a location relative to its current one. dragTo takes x and y coordinate of the final position and dragRel takes x and y coordinates and interprets it relative to the current position. The origin lies at top-left corner of screen and the x-coordinates increase going to the right, and the y-coordinates increase going down. All coordinates are positive integers; there are no negative coordinates. Now next few lines would create a circular path with enclosed inverted Y. The idea is to use parametric equation of circle and keep incrementing the angle until one complete revolution or 2*PI angle has been swept. Combining all together You may try and run this script after selecting a brush tool. Here is a Demo Please note that this was only a brief introduction to GUI automation using Python. PyAutoGUI also provides a bunch of other functions like to perform hotkeys operations (Ctrl + c for copy) etc. If you find this module interesting you should check out its documentation. Cheers! ### sahmed95 (dipy) #### GSoC 2016 begins with Dipy under Python Software Foundation ! Hi, I am excited to announce that the proposal to implement IVIM techniques in Dipy as part of Google Summer of Code 2016 has been accepted and I will be working on it over the summer under the guidance of Ariel Rokem, Eric Peterson and Rafael NH (who was a GSoC 2015 student working on DKI implementation in Dipy). Dipy is a python library for analysis of diffusion­ weighted MRI (dMRI). Diffusion patterns can reveal microscopic details about tissue architecture and is used in clinical as well as neuroscience research. The intra­voxel incoherent motion (IVIM) model describes diffusion and perfusion in the signal acquired with diffusion MRI. Recently the interest has expanded and applications have emerged throughout the body including kidneys, liver, and even the heart. Many more applications are now under investigation such as imaging for cancer (prostate, liver, kidney, pancreas, etc.) and human placenta. One of its largest uses is in brain mapping and neuroscience research such as Parkinson’s disease where it is used to study aging and structural degeneration of fibre pathways in the brain. In the presence of magnetic field gradient pulses of a diffusion MRI sequence, the MRI signal gets attenuated due to motion, effects of both diffusion and perfusion. The IVIM model uniquely describes the diffusion and perfusion from data with multiple diffusion encoding sensitivities (b­values). The IVIM model represents the attenuation in the signal acquired with diffusion MRI using the following equation given by Le Bihan [1988] $\frac{S}{S_0} = f_\mathrm{IVIM} F_\text{perf} + (1- f_\mathrm{IVIM}) F_\text{diff} \,$ where is the volume fraction of incoherently flowing blood in the tissue, the signal attenuation from the IVIM effect and is the signal attenuation from molecular diffusion in the tissue. I will be writing a new module in Dipy to calculate the various parameters of this model using this equation and hence obtain separate images for diffusion and perfusion. You can find the complete proposal and details here - IVIM in Dipy. Le Bihan's original paper : http://www.ncbi.nlm.nih.gov/pubmed/3393671 ### meetshah1995 (MyHDL) #### Community.bond() Community bonding period is over and now comes the real fun part - coding :D. To summarize my community bonding period , I completed the following exercises : • Get familiar with the inherent decorators and generator structure of myHDL. • Learn to model sequential and combinational logic hardware components in myHDL. • Learnt to read Chisel and be able to interpret model from code which maybe needed in further parts of the RISC-V CPU design. • Read and understand the RISC-V and its instruction set model. • Search , understand and assimilate existing RISC-V implementations and choose one to focus on. • Get started with the decoder implementation. Other than these exercises , I also got to know more about my mentors and had fruitful discussions on the road ahead. See you next week ! MS ## May 22, 2016 ### Redridge (coala) #### Europython ## Summer trip So as mentioned in my previous post, I will be meeting some of my mentors and fellow GSoC interns at Europython. For those who haven't heard about Europython, it is a conference revolving around the python programming language (wow you did not expect that did you?). To be honest I don't know much myself since this will be the first conference of this kind that I am attending. ## Europython = <3 This summer trip is going to take place between 17 - 25 July. From a student's perspective, covering the expenses for such a trip is not trivial. In my case these expenses include: • Plane tickets • Accommodation • Conference ticket • On site expenses Here I would like to thank the people at Europython for sponsoring us by giving us free tickets. It means a lot (of money)1 to us. With that all being said, I have already booked plane tickets and planned accommodation with my fellow coalanians23. ## Coding starts My post marks the end of the community bonding period. Stuff is about to get real now. It is 23 May 2 a.m. so let's get coding. 1. This is a joke. But it is true 2. People the are part of the coala community 3. I just discovered these footnotes and I am very happy ### Prayash Mohapatra (Tryton) #### Hello Fellow Developers around the World I spent the time fixing another easy issues. By now I am pretty confident about navigating through the codebase for fixing the bugs (grep -rl to the rescue!). As I have already posted, my project is about Porting the CSV Import/Export module from GTK Client (Tryton) to the Web Client (SAO). I am trying to make sense of problems faced by people when they post on mailing lists and mentors always there replying. Then again it feels good that real people whom I may have never known are using a project where I will be contributing. Honestly, I could have spent more time in community bonding period. Just waiting for my semester exams to end. ### udiboy1209 (kivy) #### Python Strings and the Headaches of Encoding ## Life is never simple For the last week of the community bonding period, I was finishing off the animation module for KivEnt. I added a JSON parser inside AnimationManager to read and store animation lists stored on disk in JSON format. Making this module was fairly simple. I had to use python’s json library to get a dict from the JSON file for reading, and convert a dict to a JSON string to store to a file. You must be thinking this would have gotten finished pretty easy, adding two functions in the AnimationManager and making a simple test JSON file in the twinkling stars example accompanying this module. But alas, things are never so simple in life. I mean, NEVER! Every python user, if they have worked with python 2, will have faced this infamous error: UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in \ position 0: ordinal not in range(128)  What this essentially means is that the python string we have has a byte of value 0x80 somewhere and our poor old ASCII. encoding, in which python strings are encoded, has no character for that value. The JSON parser module very smartly returns Unicode encoded strings, which is a character encoding different from ASCII. But the animation names in my module were beign stored as python strings, which are encoded in ASCII, and hence the error. The JSON parser has to do this because the JSON files would almost always be in UTF-8 format which follows the Unicode encoding. Unicode allows support for a lot more characters than ASCII (The Euro symbol for example), hence is more widely used. Let me explain in detail what encodings are. ## Encodings - Why do they exist? Computers, being the stupid chips of silicon they are, do not understand text as we do. They love numbers, binary numbers, and hence store, interpret and communicate all kinds of data in that format. So humans need to devise a system to convert that binary data to an understandable format, which is the desired symbol of the alphabet in the case of text data. A very simple example is ASCII, which is basically a table mapping a 7-digit binary no. to an English letter or symbol. This is still quite popular and in use (Python 2 uses this encoding for the str data type, as we saw above). ASCII originally was and still is only a 128-symbol encoding (7 bits) but there are a lot of extended versions to make use of the remaining 128 numbers left unassigned. So you have accented characters, currency signs and what not in those extended encodings (ISO Latin-1 is one such example). The problem is that there is no global standard which is followed for these extended encodings. So a file containing accented characters written in one standard, might be rendered as something else entirely on some other computer which uses some other standard. That’s the problem python had with the number 128 (or 0x80) as one of the characters in the string. ASCII, encoding used by python 2, has no symbol mapped to that number. Or any number above 127. The number 128 stands for the Euro sign(€) in the Latin-1 encoding. What if some European programmer wanted to print out the Euro sign in one of his very useful programs which calculates your monthly expenditure. Or what if a Japanese programmer wanted to output logs in Japanese and not English? ASCII doesn’t have characters for that! Japanese people came up with their own standards to map japanese symbols, ended up with 4 different such standards, with no way to interchange between them and made a mess of the ‘Information Interchange’ problem ASCII had tried to solve. We need a standard which: • Has support for all possible symbols and languages in use • Is followed world-wide • Can represent any symbol as 8-bit bytes (called octets in the Unicode specification) ## Enter Unicode Unicode is like a mega mega version of ascii. In the simplest sense, what it does is assign a number to every possible symbol that you would ever want to display on a screen. These numbers are decided by the Unicode Consortium, a group of large tech companies who are concerened about text encodings. From Unicode’s Wikipedia entry: The Unicode Standard, the latest version of Unicode contains a repertoire of more than 120,000 characters covering 129 modern and historic scripts, as well as multiple symbol sets. Okay so our first 2 criterias are satisfied. We just need to convert these numbers into an octet format. We could just convert each number to binary, and split it into how many ever octets are required. Unicode is able to assign symbols to numbers upto 0x10FFFF which is 3 octets. This is a very wasteful way to convert Unicode to octets, because most of English text files would use 24 bits for each character when one only requires 7 bits. With each character we have 17 redundant bits. This is bad for storage and data-transfer bandwidths. Most of the internet today uses a standard called ‘UTF-8’ for this conversion. It is an ingenious way which enables one to represent all of Unicode’s code points (numbers mappable to a symbol). It can be defined using these very simple rules: • A byte starting with 0 is a single byte character (0XXXXXXX). • A byte starting with 110 denotes the starting of a two byte character (110XXXXX 10XXXXXX) • A byte starting with 10 is a continuation byte to a starting byte for a character. • 1110XXXX and 11110XXX denote starting bytes for 3 and 4 byte long characters. They will be followed by the required continuation bytes. • The character’s code point is determined by concatenating all the Xes and reading them as a single binary number. It becomes immediately obvious by the single byte characters that such an encoding supports ASCII and represents it by default as a single byte. Any valid ASCII file is a valid UTF-8 file. For an example lets try decoding an emoji character: 0xF0 0x9F 0x98 0x81  In binary: 11110000 10011111 10011000 10000001 Look how the first byte will tell its a four byte character, so three continuation bits are to be expected. Removing the UTF-8 header bits (highlighted ones) and concatenating the remaining bits: 000 011111 011000 000001 = 0x1f601  Looking up U-1f601 in the Unicode table we find that it is the grin emoji :D! Most of the web has moved to using UTF-8. The problem with the python 2 data type str is that we need to explicitely declare it to be the unicode data type if we want support for special characters. And you need to be extra careful using one in place of the other, in which case you need to convert them. Else python throws you that headache of an error because it can’t decode a few characters in ascii. But because UTF-8 is backwards compatible with ASCII, python 3 uses UTF-8 encoded Unicode as its standard string encoding. I wonder why we haven’t all dumped python 2 in some history archive and moved to python 3 yet :P! I have had too much of UnicodeDecode errors! ## Credits This awesome video by Computerphile <3: https://www.youtube.com/watch?v=MijmeoH9LT4 #### Other News I’m anxious for the start of the coding period for GSoC! And I also just launched my club’s new website: https://stab-iitb.org/electronics-club/ which is getting awesome reviews from my freinds so I’m kinda happy and excited too! I also recently read ‘The Martian’ by Andy Weir, after having watched the really great movie. The book is atleast 10 times more awesome than the movie! Its a must read for any hacker, tinkerer and automater! ### Anish Shah (Core Python) #### GSoC'16: End of Community Bonding Today is the last day of the community bonding period. I start working full-time on GitHub integration from tomorrow. :) I had planned to blog at the end of very week during community bonding period, but I couldn’t. I will definitely blog every weekend during the coding period. This blog post is about all the work that I have done in the last three weeks. ## Linking GitHub Pull Requests in issue comments On GitHub, #<number> automatically links on an issue or a pull request. Likewise on Python issue tracker, strings like "#<number>" and "PEP <number>" get linked to an issue and PEP respectively. Since we are integrating GitHub into our issue tracker, we would want to automatically link strings like "PR <number>" to a GitHub Pull Request. This was a fairly easy task to do as I got familiar with the code while doing the first task. You can find the patch that I submitted here. It was an easy task as I just had to add a regex to match with strings like "PR <number>" and some tests for the same. ## Adding GitHub PR to a b.p.o issue GitHub links a pull request to an issue if the title or comment has a string like "Fixes #123". Likewise, we want to automatically link a GitHub PR to an issue on Python Issue Tracker. For example, if a title/comment has a string like "Fixes bpo123", then the PR should be linked to issue 123 on Python Issue Tracker. This can be easily done using GitHub Webhooks. Webhooks allows us to subscribe to certain events on Github.com and when those events are triggered, Github sends a HTTP POST to a configured URL. GitHub Webhooks are pretty easy to setup and use. However, since I was not much familiar with the roundup codebase, I was not sure about the correct way to create an endpoint for the webhooks. Thanks to my mentor Maciej, I got a clear about the roundup codebase. At first, the code that I had written was very poor. But, I tried refactoring and I think the current code quality is much better than the previous one. But, it can still be improved a lot. I still need to add tests for this. You can find the current work here. After this is done, I also need to show and update PR status on the issue tracker. Thanks for reading. I’m very excited for the next 12 weeks. Let me know if you have any questions about my project. :) ## May 21, 2016 ### Avishkar Gupta (ScrapingHub) #### Community Bonding Hey there, this is the first in the series of posts aimed at documenting my experience as a GSoC 2016 student for regular reporting to the org, and for my own reference to have something to look back on in retrospect. For the first post, let me start with spouting off a little about myself, and then I’ll talk about how the experience has been to date. I’m a(or rather was) a final year student of Computer Engineering at a college in New Delhi, and I’ve been a part of the GSoC program back in 2014, where I worked for MATE desktop. For the last year or so, I’ve been dabbling in data science and machine learning, and came to know of Scrapy when I used it for one such project early on to scrape opinions from an e-commerce site. Fast forward a year and casually browsing through PSF’s list of GSoC organizations I got to know that these guys were also participating and decided to give it a shot. Given how hectic the schedule was for me back in February, I would be lying if I said that the reason for my selection was me being some sort of a big-shot programmer who came in all guns blazing. I approached this as passively as one possibly could, it was due to the efforts of the ScrapingHub suborg admin Paul who showed interst in my proposal, gave me an interview and put me to work on a bug which also made me familiar with the actual inner workings of Scrapy that I was able to gain footing on the project. My project for this summer is going to deal with re-factoring the Scrapy signaling API, in an effort to move away from the PyDispatcher library which would greatly enchance the performance of signals. Django moved away from PyDispatcher in 2001, and they reportedly observed an increase of upto 90% in efficiency. I intend to build off their work and assume we would see similar results in Scrapy. The Scrapy community are a really active lot, and my “community bonding” started just a couple days into the announcement of me being selected for the project, which is good because I’ve had exams for the past couple of weeks or so and was AFK for a major part of them(of course informing my mentors about the same first). I had a video chat with my mentor Jakob where we figured out how reporting etc. would work for the summer, our next chat is scheduled for the 24^th, 25^th where we shall discuss how the actual implementation of the project will be and what plans I have for the same. I also finalised the work on a bug I was working on and had submitted in the form of a patch as a part of my proposal. The Scrapy community is really responsive, and I’m honored to be a part of it and to be working with all the people here. I hope my work is up to their standards at the end of this. Since there is not much Technical content to write about at this point, with the coding period having not yet started so we’ll keep this one short, I’ll bore you with the technical details in the next one. Thanks for reading, the next post will go up on Sunday the 28th. Signing off. ### tushar-rishav (coala) #### EuroPython 2016 I am really pumped up for the forthcoming EuroPython conference being held in Bilbao, Spain from July 17-24. ### EuroPython in brief The EuroPython conference series was initiated by the European Python community in 2002. It started in Charleroi, Belgium, which attracted over 200 attendees and have surpassed the 1000 attendee mark in 2014. It’s the second largest Python conference world-wide and the largest in Europe. If you are interested, you should buy the ticket as long as they last. :) ### Purpose of visit I shall be conducting a session on a guide to make a real contribution to an open source project for novice and also be accompanied by the awesome coalains who will be attending this conference. We have a sprint scheduled too. :D I am grateful to the Python Software Foundation and EuroPython for sponsoring the accommodation and ticket for the conference. Without such aid it wouldn’t be possible to meet the awesome community out there. I must also pay my gratitude to Lasse Schuirmann for encouraging and helping to make this participation actually happen. :) Stay tuned! ### Adrianzatreanu (coala) #### Europython & GSoC start So this weekend has 2 huge things. First of all, as most may know, this weekend announces the end of the Community Bonding period. This means that the coding session begins on Monday. However, this is not the only thing that happens. Two days ago I just found out that I’m going to be sponsored by Europython with a ticket there! ## What is Europython? EuroPython was the first major Python programming language community conference ever organized by volunteers. It started 2002 in Charleroi, Belgium, which attracted over 200 attendees. It now is the largest European Python conference (1200+ participants in 2014 and in 2015), the second largest Python conference world-wide and a meeting reference for all European programmers, students and companies interested in the Python programming language. ## When? I have already bought tickets and I’m planning to go there this year. It will be held from 17 to 24 July and it’s probably gonna be one of the most amazing experiences in my life. I’m going there with the amazing coalaians that I’ve been working with on the coala project (see https://github.com/coala-analyzer/coala for more information about what coala is) including (hopefully) my mentor. I’m so hyped about the conferences, the talks, the workshops and especially the coding & drinking sessions I will be having alongside my co-stayers. ### mike1808 (ScrapingHub) #### GSOC 2016: Hi # Who am I? Hi. My name is Mikael Manukyan (you can call me Michael or just Mike). I am a student from Armenia. Right now, I am in the last year of Russian Armenian University doing my master’s degree in Computer Science. Also, I have almost 3 years of experience in Web Development using Node.js and I am CTO of some local software development company. My fields of interest are Coding, Machine Learning, Neural Networks, Computer Vision and Natural Language Processing. I am a part of small team of highly motivated students who are passionate about Neural Networks. We are trying to find our place in the enormously fast growing field of Neural Networks. Our team is called YerevaNN. And our latest work is an implementation of Dynamic Memory Networks. I am happy to take part of this wonderful program which Google organize each year. A company for which I will work during the summer is scrapinghub and the project is Splash. # Why Splash? So, let’s explain to you why I chose this particular project. First for all, what is Splash? According to its documentation, Splash is a lightweight, scriptable headless browser with an HTTP API. It is used to: • properly render web pages that use JavaScript • interact with them • get detailed information about requests/responses initiated by a web page • apply Adblock Plus filters • take screenshots of the crawled websites as they are seen in a browser But it isn’t the reason why I chose it, the main reason is that Splash consists from: • Qt - for web page rendering • Lua - for interaction with web pages • Python - to glue everything together The variety of different programming languages and their interrelation is the main aspect why I thought: “This is a project on which I will love to work!”. Splash has a feature to interact with web pages using Lua scripts. Therefore, scripting is an experimental feature for now and it has a lack of necessary functions. And making Splash scripts more practical and useful is my main work for the summer. # What I will do for Splash? There are three main features/modules that I am going to add to Splash: 1. In Lua scripts ability to control the flow of commands execution (particularly, splash#go) using new API splash#wait_for_event 2. In Lua scripts ability to interact with DOM elements using new API splash#select 3. User plugins support ### splash#wait_for_event The current implementation of the splash#go method returns the control to the main program only when the page, which is currently loading, returns loadFinished signal. Signals are a part of Qt which are used for communication between various modules. For this particular case, signals are used to notify that page, e.g., has finished loading or some error has occurred during its load. The current behavior doesn’t allow to do something when the page hasn’t been fully loaded (e.g. there are some resources on the page that took very long time to load). I am going to add a new method which will allow to catch various type of signals and along with that a new parameter for splash#go event which will return the control back to the Lua translator right after its execution, without waiting for the page load. This method will allow to control the flow not only for splash#go but for the all other methods which depends on signals. ### splash#select Currently, in the Splash scripting there is no convenient way to click on the element, fill the input, scroll to the element, etc. This method will find the specified DOM element and return the instance of Element class. This class will manipulate with DOM elements in Lua. Adding new utility functions is the main part of my summer work. I decided that adding utility functions on the splash object (like splash#click) is not as good as adding utility functions to some class (like element = splash:select() and then element:click()). ### User plugins During my project exploration, I noticed on TODO comment. TODO: users should be able to expose their own plugins And I thought: “why not to add user plugins support?”. It is going to work the following way. If the user wants to add her own plugins, she specifies --plugins /path/to/plugins argument when starts Splash server. Plugins folder should contain two subfolders: lua and python. For Lua and Python files respectively. Lua folder is used to load custom Lua modules. Python folder contains a list of python classes. This classes should be inherited from SplashScriptingPluginObject class which will allow to user to load Lua modules. # What I did during Community Bonding Period? As I mentioned in the begging of the post I am graduating this year, so during Community Bonding Period I was working on my final exams preparation and my final project. Hence, I got very little time for GSOC. However, I did quite a big work exploring the inner structure for Splash before my GSOC acceptance. During my first PR I understood how the different parts of Splash are communicating with each other, how the tests are implemented and how the docs are written. Alongside, I tried to fix some active issues. Unfortunately, I didn’t manage to do it, however, when I tried to find the cause of bugs I dig very deep into Splash implementation. Also, I really thankful to my mentor who understood my current situation and allowed me to focus on my education. # What now? I think, this summer will be the productive one and I don’t miss any my deadlines. Wish to all GSOC students the same. ### Have fun and code :wink: ## May 20, 2016 ### liscju (Mercurial) #### Community Bonding - Part II(and the last one) In this week i have been working on the few issues, as well as preparing development environment. I prepared special repository on bitbucket to publish my commits there: https://bitbucket.org/liscju/hg-largefiles-gsoc It should make it easier for mentors/reviewers to see my current work and review it before being send to mercurial-devel mailing list. Second thing i did was to play a bit with amazon aws, later it will be used to test my work. Another thing i did was dealing with verify command for largefiles extension. So far it sent stat calls to remote server to get information if given file in store was valid. I changed it to send stat calls only for files that are not available locally, it reduces the round trip between server/client and also makes using verify without network connection possible. Patch for this is in review phase: https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-May/084071.html Next thing i continue to work on was to make largefiles compatible with python3. This issue deals with removing cycle import between localstore and basestore. So far we(me and developers on mailing list) did not find clear solution how this should be divided, probably the best we can do now is to move basestore._openstore(which is cause of cycle) to new module, i dont know if this module will have any other functionalities beside keeping this method. My previous tries to resolve this issue can be found here: https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-May/084224.html Another thing i started to work on was the problem with excessive password prompt when cloning repository with largefiles. User fills password information once and after downloading "normal" repository when largefiles want to download large files from server it asks for password once again. It is because the httppeer object connecting to the server is created separately for hg core and for largefiles extension. Each of those object has own version of passwordmanager remembering password. So far the best solution i found is to make passwordmanager singleton object which reuses password information for given url and user. I have no idea if this is good enough, so far i sent WorkInProgress patch to mailing list: https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-May/084490.html From the next week the "proper" part of gsoc begins, i hope it will be great :P ## May 19, 2016 ### Yashu Seth (pgmpy) #### Support for Continuous Nodes in pgmpy We are almost done with our community bonding period now. So, here I present an overview on the first phase of my GSoC project that would keep me busy for the first three weeks. This will be the first step in extending support for the continuos random variables. Here, I will be describing the chief features of the class ContinuousNode and how it can be used. Keeping in mind the interest level of the readers, I have dealt with examples only and have avoided the mathematical details on purpose. In case you are interested, in the mathematical jargon, you can have a look at the API documentation of this class. Hope you will enjoy going through the post :-) Continuous node Representation The class ContinuousNode is derived from scipy.stats.rv_continuous. It can represent all univariate continuous distributions with a valid probability density function (pdf). To create an instance of this class, all you have to do is give a probabilty density function as input. You don’t have to worry about the other methods of the distribution. They are automatically created. So methods to compute the cumulative density function, the nth order moment etc. are automatically available. This is a property derived from its parent class. You can have a look at the scipy.stats.rv_continuous documentation to get more details on the various other methods that are supported. Let us see an example. Here, I will represent a standard normal distribution. As explained earlier, we need a function to compute its pdf first. So let us define this, >>> import numpy as np >>> std_normal_pdf = lambda x: np.exp(-x*x/2)/np.sqrt(2*np.pi)  Let us now create an instance of this class and give this pdf as an input parameter. >>> from pgmpy.factors import ContinuousNode >>> normal_node = ContinuousNode(std_normal_pdf)  As I already mentioned this is a subclass of scipy.stats.rv_continuous hence we can use all the useful methods of this continuous distribution. For example, >>> normal_node.cdf(0) 0.49999999 >>> normal_node.moment(1) 0.0  Now you must be thinking, why not use the scipy.stats.rv_continuous class directly? Or how can we use these nodes to create CPDs and incorporate them in our pgmpy models? The ContinuousNode.discretize method answers all these questions. It is a specific method that is used to convert these continuous distributions into discrete probability masses. My next post will be dedicated solely to this method and its examples. It will also explain how we can create CPDs when there are continuous as well as discrete random variables involved in the models. I will try and finish my next post as soon as possible and will be back soon . Thanks, for taking your time out. I hope I will continue to interest you in my future posts as well. Thank You. ## May 18, 2016 ### Ramana.S (Theano) #### GSoC: Week 0 A little late post, but better late than never. So, Yay! My proposal got accepted for Google Summer of Code 2016 under Theano, a sub organisation of Python Software Foundation under the mentorship of Frédéric Bastein and Pascal Lamblin!! 😄 For those who are unaware of Theano, it is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation. My work for this summer will be focussed on improvising large graph’s traversal, serialization of objects, moving computations to GPU, creating a new optimizer_excluding flag, speeding up the slow optimizing phase during compilation and faster cyclic detection in graphs. The entire proposal with timeline of deliverables could be viewed here[1]. As the community bonding period is nearing it's end, I was finally done with my end semester exams last week and it was a pretty hectic couple of weeks. I had started my work in the reverse order with respect to my proposal, with "Faster cyclic detection in graphs". The work on new algorithm for detecting cycles in graph has been drafted by Fred during last November. I have resumed over that and carried it from there until we hit a road block, where the graphs do not pass the consistency checks with the new algorithm. So, I have moved on to the next task, and will come back to this once Fred’s schedule eases and when he could help me with this more rigorously as the code complexity is a little high for my understanding level. The next task (current) that I am working on is the optimization that move the computation to GPU. Stay tuned for more updates! [1]https://goo.gl/RBBoQl Cheers.🍻 ### Raffael_T (PyPy) #### GSoC 2016, let the project begin! First of all I am really excited to be a part of this years Google Summer of Code! From the first moment I heard of this event, I gave it my best to get accepted. I am happy it all worked out :) About me I am a 21 years old student of the Technical University of Vienna (TU Wien) and currently work on my BSc degree in software and information engineering. I learned about GSoC through a presentation of PyPy, explaining the project of a former participant. Since I currently attend compiler construction lectures, I thought this project would greatly increase my knowledge in developing compilers and interpreters. I was also looking for an interesting project for my bachelor thesis, and those are the things that pretty much lead me here. My Proposal The project I am (and will be) working on is PyPy (which is an alternative implementation of the Python language [2]). You can check out a short description of my work here. Here comes the long but interesting part! So basically I work on implementing Python 3.5 features in PyPy. I already started and nearly completed matrix multiplicationwith the @ operator. It would have been cool to implement the matmul method just like numpy does (a Python package for N-dimensional arrays and matrices adding matrix multiplication support), but sadly the core of numpy is not yet functional in PyPy. The main advantage of @ is that you can now write: S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r) instead of: S = dot((dot(H, beta) - r).T, dot(inv(dot(dot(H, V), H.T)), dot(H, beta) - r)) making code much more readable. [1] I will continue with the additional unpacking generalizations. The cool thing of this extension is that it allows multiple unpackings in a single function call. Calls like function(*args,3,4,*args,5) are possible now. That feature can also be used in variable assignments. The third part of my proposal is also the main part, and that is asyncio and coroutineswith async and await syntax. To keep this short and understandable: coroutines are functions that can finish (or 'return' to be precise) and still remember the state they are in at that moment. When the coroutine is called again at a later moment, it continues where it stopped, including the values of the local variables and the next instruction it has to execute. Asyncio is a Python module that implements those coroutines. Because it is not yet working in PyPy, I will implement this module to make coroutines compatible with PyPy. Python 3.5 also allows those coroutines to be controlled with “async” and “await” syntax. That is also a part of my proposal. I will explain further details as soon as it becomes necessary in order to understand my progress. Bonding Period The Bonding Period has been a great experience until now. I have to admit, it was a bit quiet at first, because I had to finish lots of homework for my studies. But I already got to learn a bit about the community and my mentors before the acceptance date of the projects. So I got green light to focus on the development of my tasks already, which is great! That is really important for me, because it is not easy to understand the complete structure of PyPy. Luckily there is documentation available (here http://pypy.readthedocs.io/en/latest/) and my mentors help me quite a bit. My timeline has got a little change, but with it comes a huge advantage because I will finish my main part earlier than anticipated, allowing me to focus more on the implementation of further features. Until the official start of the coding period I will polish the code of my first task, the matrix multiplication, and read all about the parts of PyPy that I am still a bit uneasy with. My next Blog will already tell you about the work of my first official coding weeks, expect good progress! ## May 17, 2016 ### chrisittner (pgmpy) #### MLE Parameter Estimation for BNs At the moment pgmpy supports Maximum Likelihood Estimation (MLE) to estimate the conditional probability tables (CPTs) for the variables of a Bayesian Network, given some data set. In my first PR, I’ll refactor the current MLE parameter estimation code to make it a bit nicer to use. This includes properly using pgmpy’s state name feature, removing the current limitation to int-data and allowing to specify the states that each variable might take in advance, rather than reading it from the data. The latter will be necessary for Bayesian Parameter estimation, where non-occurring states get nonzero probabilities. Update: #694 has been merged, and MaximumLikelihoodEstimator now supports the above features, including non-numeric variables: import pandas as pd from pgmpy.models import BayesianModel from pgmpy.estimators import MaximumLikelihoodEstimator model = BayesianModel([('Light?', 'Color'), ('Fruit', 'Color')]) data = pd.DataFrame(data={'Fruit': ['Apple', 'Apple', 'Apple', 'Banana', 'Banana'], 'Light?': [True, True, False, False, True], 'Color': ['red', 'green', 'black', 'black', 'yellow']}) mle = MaximumLikelihoodEstimator(model, data) print(str(mle._estimate_cpd('Color')))  Output: ╒═══════════════╤═══════════════╤══════════════╤═══════════════╤═══════════════╕ │ Fruit │ Fruit(Apple) │ Fruit(Apple) │ Fruit(Banana) │ Fruit(Banana) │ ├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤ │ Light? │ Light?(False) │ Light?(True) │ Light?(False) │ Light?(True) │ ├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤ │ Color(black) │ 1.0 │ 0.0 │ 1.0 │ 0.0 │ ├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤ │ Color(green) │ 0.0 │ 0.5 │ 0.0 │ 0.0 │ ├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤ │ Color(red) │ 0.0 │ 0.5 │ 0.0 │ 0.0 │ ├───────────────┼───────────────┼──────────────┼───────────────┼───────────────┤ │ Color(yellow) │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ ╘═══════════════╧═══════════════╧══════════════╧═══════════════╧═══════════════╛  ### Karan_Saxena (italian mars society) #### Starting to apply for Google Summer of Code 2016 !!! Just a heads up that I'm applying for Google Summer of Code 2016 under Italian Mars Society, under Python Software Foundation umbrella. I hope they select my proposal and allow me to contribute during summers :) I will keep you updated. Onwards and upwards, Karan ### mr-karan (coala) #### Community Bonding Period As the purpose of Community Bonding Period is to discuss your project with your mentor, take feedback and get familiar with code base of so that you can get started on your project ASAP from Day 1, I have written a small overview of what I’ll be doing in my GSoC work period. I have listed out a brief description of all the tasks, as it will help me and my mentor to plan things accordingly. ## Phase 1 I’ll be starting my work in the coala-bears repo. This will be a command line tool, where the user will be required to specify the bear name, the language that it will lint, etc. The bare minimum stuff required for a bear to run will be set up for the user as a boilerplate code, for the user to work with the test files and then the user can modify according to the linter executable. coala-bears --create will be a CLI which will ask the user certain questions about the bear the user is going to create. A sample file for bear will be created based on the values entered by the user. questions will be like  - is it an autocorrect bear or regex based? - bear name - language it will be used to lint  It will be packaged in a python module and uploaded to PyPi. ## Phase 2 Will be to work on Lint, Output and Results. • Navigation of Results: This change will be done in Console Interaction class, where currently the user can only go further in the errors reported by bear. It would be nice to have a “go back to the previous” option as well. • Embedded Source Code Checking : The code can be divided into different sections and then the appropriate bear can run on that section. The aim is, that if a file has multiple syntax like PHP has html code, the linter for PHP should only run in the section which is PHP. • Multiline Regex: Some linters have multiline error messages but if the regex only parses single line, then the remaining part is lost. I will add a variable to make multiline and it can be used in such cases. • Multiple Lint Bear: The idea is that if I have a project with multiple files of different programming languages then the bear specified by user in a comma separated list should only lint the files it’s meant to and not other files. ## Phase 3 Will be to modify the UI of coala-bears. In this phase, I plan to implement certain features like Autocompletion, Syntax Highlighting and a Status Bar to the CLI. I’m going to use an excellent library Python Prompt Toolkit to achieve this task. • For Autocompletion, I’m going to specify a way that all bear names are added. Fuzzy string matching can also be implemented. • For Status Bar, I’m planning to show details of the filename, the bottom toolbar can have specific messages from the linter. I am still working on what more details can be added in these sections. • I’m going to use Pygments lexers, for syntax Highlighting. Languages variable will be mapped to different pygments lexers and for a particular file extension, the appropriate lexer would run. ## Phase 4 This phase is an optional one, based on whether I will be able to complete the remaining phases as scheduled and if I have some time remaining in my GSoC period to do this. In this phase, I will be working on making a website for coala-bears where the user can see all the bears listed at one place, categorized according to languages. There will be information of every bear extracted from docstrings. A neat table will also be present for every bear where we could have all the statistics, more info about linter, the author of bear, asciinema URL. I hope to discuss these points with my mentor and based on that I will be starting my coding on or before May 23. Stay tuned for the next updates from my side :) ### Preetwinder (ScrapingHub) #### Google Summer of Code Hello, My name is Preetwinder, I am a 2nd year IT student in India. I usually program in Python, but with some practice can find my around in Java, C++ and Haskell. I have a great deal of interest in Information Retrieval, Programming Languages and Distributed Systems. I have been selected to work on Python 3 support for frontera under the Google Summer of Code program in which Scrapinghub is participating as an mentoring organization. My mentors will be Paul Tremberth(main mentor), Alexander Sibiryakov and Mikhail Korobov. I am very exited to be a part of this program and the frontera community. I hope to make a useful contribution to frontera. Frontera is a distributed web crawling framework, which when coupled with a Fetcher(such as scrapy) allows us to store and prioritize the URL ordering in a scalable manner. You can read about frontera in greater detail here. The past few weeks have been the community bonding phase of the program, during this time the candidates are supposed to get familiar with their mentors and the codebase of their organizations. During this time I have prepared a better timeline, discussed the changes to be made with my mentors, and have improved my understading of the working of frontera. I have split my task into two phases, in the first phase I will focus on improving tests and bring python 3 support to the single process mode. In the second phase(post mid-term evaluation) I will focus on improving tests and extend python 3 support to distrubuted mode. The major challenges in this project will be testing of some components which are a bit tricky to test, and getting unicode/bytes to work correctly. I hope to successfully port frontera, and have a productive summer. Google Summer of Code was originally published by preetwinder at preetwinder on May 17, 2016. ### udiboy1209 (kivy) #### SegFaults At 3 In The Morning Segmentation Faults are a nasty sucker. Any moderately experienced C/C++ programmer would know that. But well, I didn’t. I had started coding in Java initially, which has an exception throwing tendency which is the inverse of Perl’s :P! And it throws very clear and precise exceptions with a stack trace to point out exactly where you messed up. After Java I moved to Python, which is no different in terms of how it behaves with run-time errors. Modern programmers love exception-handling and stack traces it seems! Most of the contributions I have done till now for Kivy have involved only dealing with python code, but Kivy heavily relies on Cython. ## Cython - The backbone of computational python! Cython is a C-Extension for python, which means it compiles python code to C to optimize and improve speed and computational power. Cython also lets you write C -like code the “python way” and integrate it with existing python code. Naturally Kivy relies heavily on it – any Native UI application requires speed. Something like KivEnt would have been out of the question if Cython didn’t exist. You just cannot get the speed you need for a game engine with pure python. A lot of scientific computation libraries like numpy and sympy use it too. These past two weeks I have been working on a new animation system for KivEnt, which let me get my hands dirty in Cython for the first time. I really liked the experience, and I’ll also need it for most of my GSoC! I have always found python insufficient or lacking when I’m using it for computationally intensive tasks. Cython takes all those worries away! ## KivEnt’s new AnimationSystem KivEnt essentially works using modular, inter dependant systems. These systems define how to process specific types of data components for each entity. Take the PositionSystem and Renderer (system which draws objects on the screen) – The PositionSystem will update a PositionComponent in the entity’s data to change physical position of the entity and Renderer will use the values stored inside this PositionComponent to draw that entity on the screen. Thats a very basic example, now try to think how a VelocitySystem would interact with PositionSystem and use VelocityComponent to modify PositionComponent and you’ll get the general idea of how KivEnt systems work :P. The AnimationSystem is something which can render animations. Animations are extremely simple. You display an image for some duration, and then switch it with another image. Each image displayed for a certain duration is called a frame. So for making an animation, you need to specify a bunch of frames which is essentially a list of {image, duration} values. Rendering images is handled by the Renderer so all the AnimationSystem has to do is wait for a frame to complete its duration and then change the render images for the entity, then again wait for its next frame. It’s job is so simple it could have been directly coded in python. In fact here is an example displaying just that: Simple Animation But we need a faster and more powerful alternate. Something which can handle thousands of entities in each update. Cython! Plus we also need to make an AnimationManager to load/handle all animations, auto load animations from files, etc. So it needed to be done in Cython! ## The mother of all SegFaults: Bad Pointers Wait why do we need to deal with pointers in python? Is there even such a thing? There is in Cython. I mentioned before that Cython is a C-Extension. That means you have to pre-declare all your variables. With type. Which in turn leads to having to declare whether your variable is some type or a pointer to that type. You don’t have Python here to automatically assume (or work around) them for you during assignment. The segfault I was encountering was because one of the functions was getting a null value instead of a pointer parameter. I had found this out by adding print statements to every function and checking where my program got stuck. This is a pretty stupid thing to do with segfaults. I wasted one whole day looking in the function which was apparently throwing the segfault, never realizing that the problem was in some other function passing the wrong parameter. Well, I relayed this to my mentors and they suggested using this awesome tool for debugging: GNU Debugger. It can do a lot of uber-cool ninja-level stuff which I still have to learn but the one thing that it surely does is give me a stack trace of the error which led to the segfault. But again, gdb stack traces for Cythonized C code are nasty as hell. Here’s an example: #0 0x000000000000000a in ?? () #1 0x00007fffe7ea8630 in __pyx_f_11kivent_core_15memory_handlers_5block_11MemoryBlock_allocate_memory_with_buffer (__pyx_v_self=0x7fffd9fdccd0, __pyx_v_master_buffer=0x8fb4d0 <_Py_NoneStruct>) at kivent_core/memory_handlers/block.c:1162 #2 0x00007fffe22d0e61 in __pyx_pf_11kivent_core_9rendering_9animation_9FrameList___cinit__ (__pyx_v_name=<optimized out>, __pyx_v_model_manager=<optimized out>, __pyx_v_frame_buffer=<optimized out>, __pyx_v_frame_count=<optimized out>, __pyx_v_self=0x7fffe0fa5e60) at ./kivent_core/rendering/animation.c:2022 #3 __pyx_pw_11kivent_core_9rendering_9animation_9FrameList_1__cinit__ (__pyx_kwds=<optimized out>, __pyx_args=<optimized out>, __pyx_v_self=0x7fffe0fa5e60) at ./kivent_core/rendering/animation.c:1888 #4 __pyx_tp_new_11kivent_core_9rendering_9animation_FrameList (t=<optimized out>, a=<optimized out>, k=<optimized out>) at ./kivent_core/rendering/animation.c:2900 #5 0x00000000004b6db3 in ?? () #6 0x00007fffe24d9280 in __Pyx_PyObject_Call (kw=0x0, arg=0x7fffe544e1b0, func=0x7fffe24d5900 <__pyx_type_11kivent_core_9rendering_9animation_FrameList>) at ./kivent_core/managers/animation_manager.c:2124 #7 __pyx_pf_11kivent_core_8managers_17animation_manager_16AnimationManager_4load_animation (__pyx_v_loop=<optimized out>, __pyx_v_frames=<optimized out>, __pyx_v_frame_count=<optimized out>, __pyx_v_name=<optimized out>, __pyx_v_self=0x7fffe56d3410) at ./kivent_core/managers/animation_manager.c:1427 #8 __pyx_pw_11kivent_core_8managers_17animation_manager_16AnimationManager_5load_animation (__pyx_v_self=0x7fffe56d3410, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at ./kivent_core/managers/animation_manager.c:1387 #9 0x00000000004c4d5a in PyEval_EvalFrameEx () #10 0x00000000004ca39f in PyEval_EvalFrameEx () #11 0x00000000004c2e05 in PyEval_EvalCodeEx () #12 0x00000000004ded4e in ?? () #13 0x00007fffe28ef5a9 in __Pyx_PyObject_Call (kw=0x0, arg=0x7fffe0fac9d0, func=0x7fffe54510c8) at ./kivent_core/gameworld.c:13802 #14 __Pyx__PyObject_CallOneArg (func=func@entry=0x7fffe54510c8, arg=arg@entry=0x7fffe80ca670) at ./kivent_core/gameworld.c:13839 #15 0x00007fffe2901f2d in __Pyx_PyObject_CallOneArg (arg=0x7fffe80ca670, func=0x7fffe54510c8) at ./kivent_core/gameworld.c:13853 #16 __pyx_pf_11kivent_core_9gameworld_9GameWorld_6init_gameworld (__pyx_self=<optimized out>, __pyx_v_callback=<optimized out>, __pyx_v_list_of_systems=<optimized out>, __pyx_v_self=<optimized out>) at ./kivent_core/gameworld.c:5916 #17 __pyx_pw_11kivent_core_9gameworld_9GameWorld_7init_gameworld (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at ./kivent_core/gameworld.c:5630 #18 0x00000000004b1153 in PyObject_Call ()  Google Summer of Code is a really great platform for students to learn, because everybody is assigned one or more mentors to help them out. I do too. So why debug yourself :P. Just kidding! I had no clue how to interpret this because I was kinda new to Cython. Also it was 3AM in the morning at that point and I wa just too sleepy to look at any more of this stack trace! My mentor told me to show him the stack trace and he helped me find the culprit. It was this: __pyx_v_master_buffer=0x8fb4d0 <_Py_NoneStruct>  The parameter master_buffer is being passed a None value! It was an easy debug after this. I wish I knew about this earlier. But quoting Kovak, one of my mentors: Some of the most valuable experience is knowing what not to do. After this I encountered another segfault, and debugging that was a breeze. I had made a pointer assignment inside an if and used it somewhere outside. ## Twinkling stars! So twinkling stars is an example I was trying to debug my new code with. It loads 50 animations with three frames, each having three successive images of a twinkling star animation. The difference between each of the 50 is the duration of each frame, which is randomly assigned. I thought it would look beautiful. The results are pretty great: This was a pretty satisfying result for me :D! I still have to add and test a few features before I can do a performance test, but this has 3000 stars with 1 of 50 different animations, and it runs pretty smooth on my machine! ## May 16, 2016 ### shrox (Tryton) #### Simplified Proposal In this post, I will explain my GSoC proposal for absolute noobs who have no idea of anything software or open source. Just as Microsoft Word has docx as its default format for saving documents, the default container for Open Document files, such as used by Libre Office or Open Office is odt (which stands for open document text). The way ODT works is that it stores a number of files in a compressed form, specifically in a zip container. These files are nothing but XML files. XML files, as you must have often heard in the context of the web, are simply files that contain data, sorted with the help of tags. This data could be anything, and so could the tags. The following is an example of simple XML from w3schools <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don’t forget me this weekend!</body> </note> As can be seen from the above XML, an XML file is only human readable text, inside human readable tags. This make XML files amazing for version control. Okay, here’s another software management term that I feel I should explain. Version control is keeping track of versions of your code base so that you can revert to any version later. What’s important is that all the versions should be human readable so that they can be compared with each other. Coming back to ODT, we have seen that an odt file is simply a lot of XML files in a zip container. In simpler words, an odt is a zip file. Yes. You can extract it using any zip-unzip utility. Go on, give it a try. You might need to rename it to .zip if you’re using Windows, though (go Linux!). All in all, ODT files are zip files that contain a number of human readable XML files that are good for version control. But herein lies the problem - ODT files are not good at all for version control, since zip files need to be extracted and are not human readable! But what if, what if we could take all those XML files and put the content of all those files into a single XML file? Why compress the files at all, right? That is, after all, where the problem lies. So my primary work in this project will be to convert an FODT file to ODT. The fodt will be used where compression plays spoilsport - the part where there will be code and version control may be required. But the output that will be read by the end user still should ideally be in ODT as that is the document format of choice FODT, or flat ODT files, are simply XML files that will open just the same in a word processor like Libre Office. They are not frequently used as they are bigger in size that ODT files since they are not compressed, but that makes them very useful if code in a document needs to be compared. Of course, document files are not meant to store code, and hence FODT files may only have a niche use. Have a look at the title and abstract of my proposal on the GSoC website here. ### aleks_ (Statsmodels) #### Hello GSoC! I am very happy that the statsmodels community has accepted my proposal and I am proud to be part of this year's GSoC. Now the last week of the bonding period (which is meant to get in touch with the community/mentors) is approaching. This also means one week left for final preparations before the coding starts. My feeling regarding the upcoming challenge is good - also because of my helpful mentors. They have given me advice related to the best setup of the development environment and have supplied me with useful papers. Thank you Kevin, thank you Josef! For all of you curious about the goals of my time-series-related project, take a look at its description. With that, thanks for reading! ### ghoshbishakh (dipy) #### Google Summer of Code with Dipy I know I am a bit late with this blogpost, but as you probably guessed from the title I made it into Google Summer of Code 2016!! Through out this summer I will be working with DIPY under the Python Software Foundation. ### So how did I make it? To be frank although I dreamt of getting into GSOC from 10th standard I never tried it whole heartedly before. And it was partly because I did not know how and where to start. But this time I was determined and more familiar with different open source projects and I started early getting involved with the community. After trying many organisations I finally found one where I could contribute something, be it tiny code cleanups or small enhancements. And trust me it feels just amazing when your first patch (pull request) gets merged into the master branch! Then I selected a project in this organisation, prepared an application and in the whole process my mentors helped me a lot with their valuable suggestions. And after that here I am! :) ### Project Overview The aim of my project is to develop a new website for Dipy from scratch with a custom content management system and admin functionality for maintenance. Another key feature of the website will be continuous generation of documentation from the dipy repository and linking with the website. This means that whenever a new version of dipy will be released the website will be automatically updated with the new documentation. Some other features include a visualization of web analytics and github data to showcase the fact that the dipy project is spreading worldwide and a tool to generate documentation of command line utilities. The backend of the website will be built using django, and some other python libraries like markdown and python-social-auth. For visualization I plan to use D3js library. For me the most challenging and interesting part of the project will be continuous generation of documentation. There can be many ways this can be achieved. For now we have thought of a process in which for every commit or release a build server will be triggered which will build the documentation using sphinx and this documentation will then be uploaded to the website. In this process the documentation of the command line utilities will also have to be generated and that is a challenge of its own. ### Community Bonding Period This part of the Google Summer of Code (April 23, 2016 - May 22, 2016) is called Community Bonding Period and I am discussing and refining the ideas with my mentors. We have weekly meetings and frequent communication through email and gitter. I have also set up my development environment and getting ready to start work. Although I have developed several small projects using django for my college and clubs I have never tried anything of this scale. So I am learning about the different challenges of deployment, security and scalability. I am trying to get familiar with the best practices and design patterns of django and learning how to test my code. Hope to have an amazing summer! :) ## May 15, 2016 ### Utkarsh (pgmpy) #### Google Summer of Code 2016 with Python Software Foundation (pgmpy) This all started around a year back, when I got introduced to open source (Free and Open Source, Free as in speech) world. Feeling of being a part of something big was itself amazing and to add someone will be using my work in something great this proved to be more than driving force needed to get me going. The more I worked the more addicted I got. In around October 2015 through my brother and some answers on Quora I came to know about pgmpy(A python library for Probabilistic Graphical Models), and since then I have been contributing continuously. Working with pgmpy have been a great learning experience, I learned lots of new things about python which I didn’t know earlier and of-course Probabilistic Graphical Models. I also came to know about PEP (Python Enhancement Proposals), and especially PEP8 , which made Python code more beautiful to read. ## The Proposal My Proposal deals with adding two new sampling algorithms in pgmpy namely: • Hamiltonian Monte Carlo or Hybrid Monte Carlo (HMC) • No U Turn Sampler (NUTS) If they don’t click anything to you, then no need to worry, even I wasn’t familiar with them before February 2016. Some more blog posts from my side and hopefully you will feel at home with these terms. These two algorithms have become quite popular in recent time due to there accuracy and speed. Hamiltonian Monte Carlo (HMC) and No U Turn Sampler(NUTS) are Markov chain Monte Carlo (MCMC) algorithms / methods. Markov Chains is a transition model with property that the probability distribution of next state in the chain depends on the transition function associated with current state, not the other preceding states in the process. A random walk in Markov Chain gives a sample of that distribution. Markov Chain Monte Carlo sampling is a process that mirrors this behavior of Markov Chain. Currently pgmpy provides two sampling classes, A range of algorithms namely Forward sampling, Rejection Sampling and Likelihood weighted sampling which are specific to Bayesian Model and Gibbs Sampling a MCMC algorithm that generates samples from both Bayesian Network and Markov models. Hamiltonian/Hybrid Monte Carlo (HMC) is a MCMC algorithm that adopts physical system dynamics rather than a probability distribution to propose future states in the Markov chain. No U Turn Sampler (NUTS) is an extension of Hamiltonian Monte Carlo that does not require the number of steps L (a parameter that is crucial for good performance in case of HMC). Post Script: When I’ll be finished with my first half of the project I’ll write a series of posts which will serve as an introduction to probabilistic sampling and Markov Chain Monte Carlo, specifically with introduction to Hoeffding’s inequality, Markov Chains, MCMC techniques such as Metropolis-Hastings, Gibbs sampler, and HMC. ## May 11, 2016 #### Hello World Hello world! I recently was given the amazing opportunity to contribute to MDAnalysis, an open source Molecular Dynamics simulation Analysis project through the Google Summer of Code initiative. I’ve been encouraged to maintain a blog by those giving me this opportunity so I’ll start things off by explaining how I got this great summer job. To summarize it quickly, Google sponsors a program in which college students apply to work on projects for open source software organizations. I was very lucky to have my research advisor, Dr. Ashley Ringer McDonald, encourage me to apply. I satisfied the first application requirement by learning how to use git and closing an issue on the MDAnalysis Github page. After that, I spent about 40 hours of concentrated effort over spring break studying dimensionality reduction and molecular dynamics in order to write a coherent application. By turning in a rough draft early I ensured the process was iterative; the contributors to MDAnalysis were very helpful with their critiques of my application. After turning in my final application, the process didn’t really stop. I made sure to keep making pull requests and to learn more about development workflow. I have learned so much about workflow and how to get over the dread of starting a pull request in the past few months. And then I got the news! I had been accepted to the Google Summer of Code! I was and still am extremely excited. With that being said, success made me lazy and somewhat complacent. Recently, I have been doing the bare minimum in terms of work and that is about to change. Even if no one is reading this, consider this blog as the first step in accountability for the rest of the summer. I will be using this to keep a record of everything I am working on day to day. With the exception of this introduction post, every post will attempt to keep a focus on a particular issue. I might write a post about a topic related to my Summer of Code work, or something related to my many other interests. I endeavor to remain positive and thoughtful, I will work on my clear overuse of commas, and I will try to make my readers laugh. I look forward to keeping this up! JD out. ## May 10, 2016 ### Adrianzatreanu (coala) #### Community bonding So for this part of the project, called “Community bonding”, I have to prepare the things for my project. What is a better way of doing this besides talking to my mentor? Probably none. Today I had my first project-only related online conversation on Skype, and it was amazing. I got to talk to this man that is going to help me finish my project throughout this summer. ### What do I have to do? I have to get my stuff done in this pre-coding part. And what does it involve? Well, as written in my proposal, I will have to settle two things: • how will I make the installation as great as possible, so basically the user experience must be amazing • decide how to make the package managing as convenient as it can be ### How will I do it? This is where I go into what was being discussed today. The answer for the first one is quite debatable. We both had many ideas, but after brainstorming a little and deciding better, we ended up with an amazing solution. Old installation wizard experience. The installation script will have at first 3 options: 1. Install all the bears 2. Install the recommended bears (which will probably include the general ones and a few for, let’s say, Python, C maybe? this is an easy aspect and not so important) 3. Custom installation So basically the custom installation will throw a numbered list with all the bears in the terminal interface, and the user will have to input the numbers of bears that he wants. This will probably be the one that will take the most time, but I still think it’s cool, giving the user the liberty of having anything he wants. For the second question, there’s no optimal solution reached yet. But we still have a lot of time, right? 13 more days. However, we didn’t want to call it a day without having at least a solution. And there it is: Package managing will be done by simply updating the requirements for each bear upon the time the requirements are updated. Sure, there may be some bears in worst case scenario that use many. But that is rare. Well, this is a solution after all, and whether it may not be so efficient, there’s still time, and it works, which may be enough for what we need for this project. ### Redridge (coala) #### Becoming a google summer of code student ## Acceptance First of all, I want to follow my first ever blog post, with good news. As you probably guessed from the title I got accepted for the Google summer of code program!!! Yoohoo!!! I will be working with coala under the Python Software Foundation. ### Community While the coding hasn't yet started, Gsoc is already under way with the first phase called community bonding. Basically you have to get to know the community and the project's codebase better. This proved to be quite an interesting experience for me since I have to schedule calls with my mentor who is from India and the time gap is something I am not really used with. Since I wrote a proposal for OWASP also, and I had the chance to interact with another community, I think it is safe to say at this point that coala admittedly has one of the friendliest and well structured communities. Gitter channels, google groups, mailing lists, all set up just to make sure everyone stays in touch. And not everything is about coala, for example, a couple of days ago some of us met online to do a bit of bonding via gaming (very good choice btw). I have to admit that I could not enjoy the game to its fullest because of the latency but it was still fun. I hope we do more of these gaming/anything meetups. ### Getting ready to work Yes, it is the community bonding period and indeed you are not supposed to code pretty much anything now, but there are a couple of things that have to be done. In my proposal I left some issues to be discussed later on. My mentor Udayan suggested that this would be a good time to talk about these and "list down the finer details". ### Europython One of the best parts of being a Gsoc student for coala is that we will meet at Europython this summer. I really look forward to this trip since I haven't participated in other conferences like that before. Also I have never gone to Spain :D ### Wrap up All in all, I am eager to start doing the stuff that really matters, or maybe am I just eager for the summer to come faster? I don't really know but anyways, there goes my first ever blog post after my first ever blog post. ## May 09, 2016 ### Vikram Raigur (MyHDL) #### A play with MyHDL Long time since a update. Sorry everyone for the late updates. This week is full of surprises for me. I started using MyHDL and I came to know how things work exactly. I will explain my experience with different attributes and modules one by one. I will be using verilog much in my blog because I feel comfortable using verilog. 1. The MyHDL signal is similiar to VHDL signal. I felt an analogy of the MyHDL signal’s next attribute with the non blocking Verilog assignments. always@(clk) a <= b // non blocking assignment c <= d // non blocking assignment Now coming to MyHDL analogy with VHDL Signal @always_seq(clk.posedge, reset = reset) def logic(): a.next = b // non blocking assignment c.next = d // non blocking assignment Where a,b,c,d are Signals in MyHDL. 2. I was practising with MyHDL and I had to assign some statement like : assign a = b assign c = d I did a straight forward assignment and it do not work. I contacted Chris Felton with my problem and he provided a nice way to do such assignments. @always_comb def assign(): a.next = b c.next = d return assign 3. Following the journey, I tried to check whether MyHDL accepts a 2-D array as input in the module. Unfortunately I was unable to convert my code because Verilog do not accept 2-D array inputs, it was a mistake from me to expect such feature. This can be a new feature in MyHDL soon. Then I tried to give a List of Signals as input and it eventually failed during conversion. We all know verilog do not accept list of signals as input. unless input is declared as wire i.e input wire [4:0] inputlist [0:63] // do not confuse it with 2-D array its a list with 5 bit data in each block To solve this issue My Mentor Josyb said to use a wrapper which will take N Signals as input wrap them into an array ( not an input array ). Processing them and then unwrapping them. I also tried a different method shown as follows : def test(): iPixelBlock = [Signal(intbv(0, -1 << 11, -(-1 << 11) + 1)) for _ in range(64)] clk = Signal(INACTIVE_HIGH) enable_in, enable_out = [Signal(INACTIVE_LOW) for _ in range(2)] reset = ResetSignal(1, active=ACTIVE_LOW, async=True) inst = huffman(huffman, enable_out, iPixelBlock, enable_in, clk, reset) return inst toVerilog(test) It works well and everyone knows why it works. List of signals is not an input to the block we are converting. 4. We have two reference designs on which we have to work. The VHDL version by Michal Krepa and The verilog version by david klun. Josyb, Mkatsimpris and I decided to focus more on the VHDL version because the cores in the VHDL version are more modular and scalable. Also, they are very comfortable for independant testing. The next post will contain my Github link and some modules which I designed for practise. Thanks for going through the post. Have a nice day . ## May 07, 2016 ### mr-karan (coala) #### Participating in GSoC 2016 ## Introduction: So my project proposal got accepted at Google Summer of Code’16. Hurray! I will be working with coala under Python Software Foundation. ## How it all began I had discovered coala from Indiahacks Open source track and began contributing since the last week of February. In almost a week, I had read the documentation and learnt some advanced git workflow like squashing commits. I had also solved some newcomer issues to get a feel of how to contribute to open source software. The experience of contributing to coala has been great, mainly because the community is very friendly and they guide you just enough so that you stand on your feet for yourself. Also, getting your code accepted upstream is a not an easy task. The community is very strict about top notch commit style, which greatly improved my skills not only as a good developer but also how to communicate effectively with other teammates. This is one thing that I really liked about coala and I stuck with them because they appreciate the inner details and will accept your bug fixes only if you meet their high standards. The result of this was, I learnt how to write meaningful commit messages and not just ambiguous one-liners. ## The journey so far I am really excited about my project Extend Linter Integration and I will be implementing the following changes in the coming summer. • A new coala-bears --create command line tool to ease the process of creating new Bears. • Work on extending Lint class. • Provide command line interface improvements using Python Prompt Toolkit. Contributing to open source is an amazing thing that every self-respecting developer should experience. It helps you grow at a much faster rate and you interact with some of the best minds, sitting in the other half of the world. Hope to have a great time ahead in the summers. Cheers! ## May 06, 2016 ### Abhay Raizada (coala) #### GSoC: My first blog ever! My proposal for GSoC ’16 has been selected and so my journey towards an awesome summer has begun. When i started contributing to open source i had only the bookish knowledge learned from school and college and only a little bit of practical experience, i had no idea how a real life software was run and maintained, then i found syncplay . It was a software I used daily and got to know after a little while that it was Open Source, by that time i knew what Open Source meant but I hadn’t dove deep into it yet, so i began making some changes to it’s code-base(all hail open-source!). A week or two later i had learned a lot, i learned about decorators, socket-programming, the twisted framework, utf-8 encoding and also some of the nitty-gritties of coding. I had known about the GSoC Program(one of my friends had participated earlier), and got to know(from the contributors) that syncplay wouldn’t be participating, so i started searching for organisations that’d be looking to participate and stumbled upon coala which is a static-code analyzer(though saying just this is doing it injustice). the coala-community in one word is awesome! i have never seen a community in my little experience that is so helpful to newcomers! it took a little time getting used to learning how to operate the software at first but once i got used to it, it was and is still an amazing experience. Working beside sils, AbdealiJK, Makman2 and all of the coala community has been an awesome learning/entertaining experience till now and i can’t wait to imagine what it would be like once the summer gets started!. As far as my Project goes it deals with creating Indentation algorithms that would be language independent. It would automatically correct indentation(thanks to awesome diff management by coala) and would also look for cases when lines get too long and would even break the lines once it gets completed. Open source has been a really fascinating experience so far. Apart from GSoC i’m looking to contribute to a lot of open source in my spare time i’d like to finish my PR for syncplay, and find some other projects to contribute, i have a few ideas myself, let’s hope they see the light of day😉. All in all I’ve learned a lot from these few months contributing. I’ve developed habits like watching talks, reading blogs, all of them being so informative! I’ve learnt a lot about programming writing not only code, but writing efficient, well formatted code. I’ve had glimpses of frameworks, learnt new types of software. Being an IT student was never as exciting as it is now. ## May 05, 2016 ### shrox (Tryton) #### The Beginning This is where my GSoC journey begins, for all practical purposes. Right now in the ongoing community bonding period I hope to do the following - 1. Get done with issue5258 that I had assigned to myself a while back. 2. Get better acquainted with lxml, a Python XML processing library. 3. Refer to Relatorio’s codebase in order to understand how the final converter will be used. 4. Use Relatorio to generate reports with the existing codebase. ## May 03, 2016 ### Sheikh Araf (coala) #### Google Summer of Code '16 with coala So, my proposal for Google Summer of Code 2016 got accepted. Yay! And now that my exams are over, I finally have some time to blog about it. So here we go. Earlier this year I decided to dip my toe in the water of contributing to open source projects. I came across coala, a language independent static code analysis framework. I started off by fixing some of the simple newcomer issues. This helped me understand the code base of the project. The best part was that the coala community was extremely friendly and always helpful. I came with no long-term plans, but I had a really fun time learning new stuff, so I had to stay. Later Google Summer of Code was announced and coala was participating under Python Software Foundation. I submitted a proposal and it got accepted. So this summer I will be building a plugin to integrate the awesome coala code analysis framework with the Eclipse IDE. I have a coarse idea of the project and I look forward to discuss it with my mentor Harsh Dattani in the next few weeks of the community bonding period. ### kaichogami (mne-python) #### GSoC 2016 with MNE Hello everyone. Its been a while since I lasted posted here. I am glad and excited to say that my project was accepted under GSoC 2016. I will be working under MNE-python, a library to process brain signals. I am grateful to Denis and Jean for helping me out at every point of time. The proposal looks so good only because of them. I will begin working on project as soon as my exams finish. An update listing out the new changes should be out every 2 weeks ideally. I have created a checklist to easy manage the deadline completion. Here is a link to proposal. In short my project involves changing various transformers to follow “2-D X,Y” input/output and creating a Pipeline class to chain various transformers. Thank you for reading! ## May 02, 2016 ### Yen (scikit-learn) #### Hello Google Summer of Code! In this summer, I will participate in Google Summer of Code (GSoC for short), a program offers student developers stipends to write code for various open source projects. My proposal for GSoC, Adding fused types to Cython files, which aims to enhance the popular machine learning library scikit-learn has been accepted and will be supervised by two mentors from scikit-learn community: Jnothman and Mechcoder. Below, I’ll briefly describe the work I’d like to achieve during GSoC. ## Proposal Abstract The current implementation of many algorithms in scikit-learn, such as Stochastic Gradient Descent, Coordinate Descent, etc. only allow input with np.float64 and np.int64 dtypes due to the adoption of Cython fused types may result in explosion of the generated C code. However, since scikit-learn has removed Cython files from the repo and re-generate them from every build, it provides a good chance to refactor some of the “.pyx” files by introducing Cython fused types. This will allow those algorithms to support np.float32 and np.int32 dtypes data, which is currently casted into np.float64 and np.int64 respectively, and therefore reduce the waste of memory space. You can find the detailed version of my proposal here! ## Example Here, I’ll use an example to illustrate how Cython fused types can benefit the whole project. mean_variance function in scikit-learn, like some algorithms I mentioned in my proposal abstract above, will explicitly cast np.float32 data into np.float64 before this pull request, which yields waste of memory. However, after we introduce Cython fused types into this function’s implementation, it can now accept np.float32 data directly. Results of this enhancement can be visualized via memory profiling figures showed below: • Memory usage before using fused types • Memory usage after using fused types As one can see, memory usage surrounded by the bracket drastically decrease. ## Summary I believe that scikit-learn’s memory efficiency can be hugely improved after I add fused types into existing Cython files in the project. On the other hand, great thanks to scikit-learn community for giving me this golden opportunity to work on an open source projects I use every day. Really Looking forward to this productive summer! ## May 01, 2016 ### Anish Shah (Core Python) #### GSoC'16: Community Bonding (1st Week) It is already one week into the community bonding period and I have already done a lot of new things. I talked to my mentor Maciej Szulik over email and he gave me some tasks - to setup the b.p.o environment locally and then to add a GitHub Pull Request URL field on issues page. You can find all the details about what I learnt this week below. :) ## Docker I have been hearing about Docker for many months now. But, I have never got an opportunity to use it, as I generally use pip and virtualenv to quickly setup most of my Python projects. But, what’s awesome about Docker is that it is not just limited to Python projects. It allows you to package any project and its dependencies into a single unit. Generally, people think of Docker as a VM. VMs generally run Guest OS on top of hypervisors. However, Docker creates containers that include applications and its dependencies. They run as an isolated process on the host OS. Virtual Machine and Docker Architecture Picture courtsey: Docker.com This allows the developers to quickly setup any application on any computer. It eliminates environment inconsistencies. I setup the Python Issue Tracker on my local machine using Docker. If you want to set it up locally, you can find the repository here. You can easily build the docker image using the following command  docker build -t <image-name> <path-to-Dockerfile>


To run the Docker container, you can run the command below. You can read more about Docker commands here.

$docker run [OPTIONS] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]  ## Template Attribute Language (TAL) I have been using Django and Flask to create some web apps. They have template engine to create dynamic pages. Flask uses Jinja template engine and Django has its own template engine. Python Issue Tracker uses a templating language called as Template Attribute Language (TAL) to generate dynamic HTML pages. TAL statements are embedded inside HTML tags. It uses tal as namespace prefix. You can read more about TAL here. ## First GSoC’16 Task To get familiar with the b.p.o codebase, I was given a small task by my mentor. I had to add a new field on the issue page, so that Developers can submit GitHub pull requests URL. The issue page should show a table of GitHub PRs related to the issue. I completed the task and submitted a patch for it. You can follow the progress here. That’s it for this week. Thank you for reading. Do comment down below about what do you think about this post or any questions for me. See you guys next week. ## April 29, 2016 ### Avishkar Gupta (ScrapingHub) #### The First Post This blog will contain weekly reports I write as a GSoC student for Scrapinghub. ### Nelson Liu (scikit-learn) #### (GSoC Week 0) How fast is fast, how slow is slow? A look into Cython and Python The scikit-learn tree module relies heavily on Cython to perform fast operations on NumPy arrays, so I've been learning the language (if you can even call it that) in order to effectively contribute. At first, I was a bit skeptical about the purported benefits of Cython -- it's widely said that Python is "slow", but how slow is "slow"? Similarily, C code is known to be "fast", but its hard to get a grasp on the performance difference between Cython and Python without directly comparing them. This post summarizes a quick (and extremely unscientific) experiment I did comparing the performance of raw Python, Python code running in Cython, and Python code with static typing running in Cython. These results may not generalize to whatever application you have in mind for Cython, but they're suitable for seeing the existence of performance differences on a CPU-heavy task. ## Why would I want to use Cython? Cython combines Python's ease of use with C performance to help developers optimize their Python code or create a fast Python interface to their C code. To understand how Cython improves the performance of Python code, it is useful to have some knowledge of how code in Python and C is run. Python is a dynamically typed -- this means that variables do not have to be fixed at compile time, and a variable that starts as an int can be set to a list or even a custom Python object at any time. On the other hand, C is statically typed -- variable types must be defined at compile time, and they are generally that type and only that type. Also, Python is an interpreted language; this indicates that there is no compile step necessary to run the code. C is a compiled language, and files thus must be compiled before they are runnable. Given Python's nature as a dynamically typed, interpreted language, the interpreter must spend time to figure out what type each variable is at runtime, extract the data from these variables, run the low-level machine instructions, and then place the result into a (possibly new) Python object that is returned. In C, the compiler can figure out at compile time all the details of low-level functions / data to use; a compiled C program spends almost all its runtime calling fast low-level functions, making it much faster than Python. Cython attempts to improve the performance of Python programs by bringing the static typing of C to Python, a dynamic language. With a few exceptions, valid Python code is also valid Cython. To demonstrate what sort of speed gains are possible with Cython, we turn to the classic example of calculating fibonacci numbers. ### Python vs Cython Below is a simple recursive function to calculate the nth Fibonnaci number in Python def fibonacci_py(n): a, b = 0, 1 for _ in range(1, n): a, b = b, a + b return b  Let's see how long the Python function takes to calcuate several values of fibonacci %timeit fibonacci_py(0)  1000000 loops, best of 3: 436 ns per loop  %timeit fibonacci_py(70)  100000 loops, best of 3: 4.89 µs per loop  Now, let's turn the above function into a Cython function without changing anything (remember that most valid Python code is valid Cython) and evaluate performance again. %load_ext Cython  %%cython def fibonacci_cy_naive(n): a, b = 0, 1 for _ in range(1, n): a, b = b, a + b return b  %timeit fibonacci_cy_naive(0)  1000000 loops, best of 3: 227 ns per loop  %timeit fibonacci_cy_naive(70)  100000 loops, best of 3: 2.1 µs per loop  Now let's add static typing to the naive Cython code. %%cython def fibonacci_cy_static(n): cdef int _ cdef int a=0, b=1 for _ in range(1, n): a, b = b, a + b return b  %timeit fibonacci_cy_static(0)  10000000 loops, best of 3: 59.3 ns per loop  %timeit fibonacci_cy_static(70)  10000000 loops, best of 3: 126 ns per loop  As you can see, it took Python 436 ns per loop to calculate fibonacci(0) and 4.89 µs per loop to calculate fibonacci(70). Simply using Cython without any changes to the Python code more than doubled the performance, with 227 ns per loop to calculate fibonacci(0) and 2.1 µs per loop to calculate fibonacci(70). However, the most dramatic performance increase came from using statically typed C variables (defined with cdef). Using statically typed variables resulted in 59.3 ns per loop when calculating fibonacci(0) and 126 ns per loop when calculating fibonacci(70)! In the case of calculating fibonacci(0), this represents a 3x speed improvement over the naive Cython function and a 7x performance increase over the Python function. The speedup is even more pronounced when calculating fibonacci(70); using statically typed variables gave a speedup of almost 17x from the naive Cython version and approximately a 39x improvement over the normal Python version! Cython gives massive performance achievements on this simple fibonacci example, but it's worth nothing that this example is completely CPU bound. The performance between Python and Cython on a memory bound program would likely still be noticeable, but definitely not as dramatic as this toy example. ## Conclusion While learning Cython, I wrote a short iPython notebook tutorial on Cython pointers and how they work geared toward developers relatively fluent in Python but unfamiliar in C -- it's mainly intended to be practice / quick reference material, but you might find it handy if you want to learn more. Additionally, the contents of the majority of this post are in an iPython notebook here. For next week, I'll be providing a brief introduction to regression trees and some basic splitting criterion such as mean squared error (MSE) and mean absolute error (MAE). If you have any questions, comments, or suggestions, you're welcome to leave a comment below :) Thanks to my mentors Raghav RV and Jacob Schreiber for their constant support, and to the larger scikit-learn community for being a great place to contribute. You're awesome for reading this! Feel free to follow me on GitHub or check out Trello if you want to track the progress of my Summer of Code project. ### Prayash Mohapatra (Tryton) #### Accepted into GSoC! I am accepted into Google Summer of Code 2016. Will be working on Tryton (under Python Software Foundation). I will be developing the CSV Import/Export feature for their Web Client codenamed SAO. I am very enthusiastic about this as I could finally write a proposal for GSoC after failing to do so for two years in a row. Would be working mostly in JavaScript and the many tools that come with it. Would be working from home this summer. Really wanted to visit another city. Just waiting for the semester examinations to end. ~Try Miracle ## April 27, 2016 ### tsirif (Theano) #### Google Summer of Code adventure begins Finally, about a month after proposal submissions, Google Summer of Code announced which projects will participate in this year’s coding summer adventure. My proposal to Python Software Foundation was accepted together with other 1205 proposals to 178 - in total - open-source organizations. Check the official blog announcement for more statistical references. So, this way begins my involvement with Theano, an open-source project initiated by people in the MILA lab at the University of Montreal. Theano is a mathematical Python library which allows to define, optimize, and evaluate symbolic expressions, in a way that resulting computations are using the most out of the available computational resources. It uses underlying Python, C and CUDA implementations of generic mathematical operations and combines them according to a user-defined symbolic operation graph in order to achieve an optimized computation on the available software and hardware per platform. I am going to contribute in extending GPU support with more implementations of operations and with more functionality for multi-gpu and multi-node/gpu infrastructures. See here the abstract of my proposal! If you are interested on this project’s progress follow my fork on github. More information on the project details at the next post! ### sahmed95 (dipy) Hi, I will be writing about my experiences over the summer working with Dipy in this blog. I am a Physics and Electronics double major student from BITS Pilani, Goa Campus, India with an active interest in Mathematical modeling, statistics and coding using Python. I will be contributing to Ddipy by integrating models for diffusion imaging such as IVIM and Rohde over the summer. ## April 26, 2016 ### Aron Barreira Bordin (ScrapingHub) #### GSoC - Support for Spiders in Other Programming Languages Hello Everyone ! My name is Aron Bordin and I’m Brazilian Computer Science Student and AI Researcher. I’m studying Computer Science at São Paulo State University, and always coding something fun on my free-time :) I’m very happy to announce that my Google Summer of Code proposal has been accepted :tada: ## About this Project Scrapy is one of the most popular web crawling and web scraping framework. It’s written in Python and known by its good performance, simplicity, and powerful API. However, it’s only possible to write scrapy’s Spiders using the Python Language. The goal of this project is to provide an interface that allows developers to write spiders using any programming language, using json objects to make requests, parse web contents, get data, and more. Also, a helper library will be available for Java, JS, and R. Read more here ## This Blog I’ll use this blog to post updates about the project progress. GSoC - Support for Spiders in Other Programming Languages was originally published by Aron Bordin at GSoC 2016 on April 26, 2016. ### udiboy1209 (kivy) #### I Got Selected For GSoC 2016! I’m really excited to say that my project was selected for Google Summer of Code 2016! This is a really great opportunity for me to get to code throughout the summer, and coding is something I dearly love :D ! ## GSoC? What’s that? So this is how GSoC works. There are open source organizations who have a list of projects they want to see implemented, and they are willing to mentor enthusiastic people for it. GSoC provides a way for students all over the world to take up these projects, along with a stipend. As a student you are supposed to submit a proposal for whichever project you want to do to the respective org. Then you await the org mentors to review your proposal, compare it with countless other submissions and finally deem you worthy/unworthy of their mentorship :) . ## Yay! I submitted my proposal to Kivy, a python framework to create UI apps for various platforms like Android, iOS, Windows, Raspberry Pi ( :O ! I was surprised too!) and obviously linux. You couldn’t have imagined my excitement when I saw my name (actually my nick) and my project show up on Python Software Foundation’s projects list. My project is to implement a module for Tiled maps in Kivy’s game engine KivEnt. Jacob Kovak, Mathieu Vibrel and Akshay Arora will be mentoring me. I’ve always wanted to work on game engines which makes this project all the more fascinating for me! ## How these past months have been A bunch of seniors in my college have done GSoC in the past years, and they all had the same advice to give: Start contributing to the org you like, it gives you a much better chance of getting selected. So I started contributing to Kivy sometime in the winter last year. It was pretty tough at first, dealing with such a huge codebase. But the people who maintain kivy are really helpful with the tiniest of things. And they are extremely appreciative of the contributions you make too :D! I remember one of them commenting “Beautiful!” on one of my PRs before merging it, which left me wondering what was so great in this teensy contribution. But it did have a positive impact on me. For a beginner, positive feedback never hurts ;)! I have come quite ahead from that beginning stage. I even earned a bounty on one of the bugs I fixed for kivy :D! The experience has been awesome. I’m getting to know this wonderful community of people who work towards kivy’s development and I feel glad I am starting to be a part of that community! And I believe there much more great times ahead! ## What’s more? Well, GSoC requires me to blog about the developments of my project. It will help my mentors review my progress. So I will be using this blog to post updates and developments! Stay tuned :D! ## April 25, 2016 ### chrisittner (pgmpy) #### GSoC proposal accepted! My proposal for Google Summer of Code 2016 has been accepted :). This means that I will spend part of my summer working on the pgmpy library. I will implement some techniques for Bayesian Network structure learning. You can have a look an my proposal here. As a first step, I set up this blog to document my progress. It is built with the Pelican static-site generator and hosted on GitHub pages. Updates will follow! ### SanketDG (coala) #### GSoC 2016, here I come! So, I am participating in GSoC 2016 with coala! My proect is based on language independent documentation extraction, which will involve parsing embedded documentation within code and separate them into description, parameters and return values (even doctests!). After this is done for several languages(Python, C, C++, Java, PHP), I will implement the following as the functionality of a bear: • Documentation Style checking as specified by the user. • Providing aesthetic and grammar fixes. • Re-formats the documentation (indentation and spacing) ## April 24, 2016 ### Pulkit Goyal (Mercurial) #### Getting into GSoC Hello everyone, recently I made into GSoC 2016. So I will be writing about how to get into GSoC. I have proposed a project under Python Software Foundation for an organisation named Mercurial. **Mercurial** is a cross-platform, distributed revision control tool for software developers. I will be describing about my organisation more in upcoming posts. For now I will be talking about GSoC. ### Nelson Liu (scikit-learn) #### An Intro to Google Summer of Code I'm participating in the Google Summer of Code, a program in which students work with an open source organization on a 3 month programming project over summer; I'll be working with the scikit-learn project to add several features to the tree module. You can read my proposal here. The program also requires that I publish a blog about my project and work; as a result, this series of posts will recount progress, what I've been up to, and what I've learned. I'll be prepending all of the posts in this series with GSoC Week #, and they will be tagged under gsoc. ## April 23, 2016 ### Adhityaa Chandrasekar (coala) #### GSoC '16! Great news! I've been selected for this year's GSoC (Google Summer of Code) under coala, a powerful static-code analysis tool that is completely modularized. You should definitely use it in your projects if you want a tool that will completely automate huge segments of code review, thereby rapidly fast-forwarding the production cycle. I've been contributing for a couple of months and the the experience has been nothing short of being phenomenal! I was recently given contributor status too :) Over the course of this summer, I'll be working on a project called Settings Guessing. Currently coala needs the user to specify the choice for each setting - whether to use spaces or tabs, whether to use snake_casing or camelCasing, whether to use K&R style or Allman style. But with this project, this would guessed automatically! Totally awesome, right? Stay tuned for more, I'll try to post updates weekly. ### Adrianzatreanu (coala) #### Accepted! The day I have waited for so long has finally come. Yesterday noon the results were shown and my project with PSF (Python Software Foundation) & coala was accepted. This is probably my biggest life achievement so far, and as I am inside so happy, I feel such a huge responsability on my personal progress and work. I have compromised my summer by having to work hard on finishing this project. The project I got accepted on is called Decentralizing BearsThis implies making bears, which are basically plugins for coala, independent packages. This allows for easier management and improvements upon them. Right now the period is called Community Bonding, and lasts until 23th of May, when the actual work starts. This is when students and mentors get to know each other and talk about implementing the project. ### It is going to be hard. I am roughly a student in my first year of Computer Science, starting to learn programming @October last year. This project will be a real challenge for myself, one which I am planning to finish, one which I am planning to work hard on. ### What’s next? Over the next month, it will be a challenge to me to be able to manage all the homework from my college and also get prepared for this project. What is worse, the first 3 weeks after the work period on GSoC starts, my exam session starts. Yes, it’s going to be hard. But nothing comes in easy, does it? ### Thanks! This blog post intends to thank everyone from coala who has helped me achieve this. With the help of that amazing community, I was able to achieve something I never thought I would. ## April 22, 2016 ### Yashu Seth (pgmpy) #### Results Announced Hurray!! My proposal got accepted. Feeling really exited and looking forward to the project. I will keep updating my progress here. Here is my proposal abstract - Currently, pgmpy deals with only discrete random variables. In many situations, some variables are best modeled as taking values in some continuous space. Examples include variables such as position, velocity etc. The first part of the project creates a module to represent nodes having a continuous domain representation. These nodes would be used in hybrid networks comprising both continuous as well as discrete random variables. The two important features in this part would be - • Representation of User Defined Continuous Random Variables • Methods to convert continuous distributions into discrete factors. The second part of the project will deal with Gaussian distributions. Gaussians are a particularly simple subclass of distributions that make very strong assumptions, such as the exponential decay of the distribution away from its mean, and the linearity of interactions between variables. Gaussians are a surprisingly good approximation for many real world distributions. There will be support for variables comprising the most popular forms of representation in Gaussian distributions - • Linear Gaussian Distribution • Joint Gaussian Distribution • Canonical Forms You can have a look at my entire proposal here - pgmpy: Support for Continuous Random Variables ## March 25, 2016 ### tsirif (Theano) #### Welcome to COSA! Friday, 25 March 2016, 05:05 AM, Thessaloniki, Greece I am beginning a blog in which I am going to narrate my coding adventures. Purpose of this is to initially describe publicly my “yet probable” activities in Google Summer of Code. For more information, check about! ### Yashu Seth (pgmpy) #### Applied for GSoC 2016 Applied for GSoC 2016 in pgmpy under the Python Software Foundation. Results will be announced on April 22, 2016. ## March 23, 2016 ### Pranjal Agrawal (MyHDL) #### First Post This is the first post the Leros Developement blog for Google Summer of Code 2016. Application ready, fingers crossed. Hopefully lots more to come soon! ## March 20, 2016 ### mike1808 (ScrapingHub) #### Welcome to My Blog! Welcome to my blog! ## March 19, 2016 ### aleks_ (Statsmodels) #### Hello world! Hello everyone, this is the blog I set up for this year’s Google Summer of Code (GSoC). During this summer posts will show up here describing my coding experiences – but only in case of a successful application, so let’s hope for the best! : ) Aleks ## March 18, 2016 ### Adhityaa Chandrasekar (coala) #### Performance benchmark: C and Python Hey everbody! Today I'll be doing a simple performance benchmark between Python and C. I knew before starting that Python will be slower than C. And it has every reason to be so: it's an interpreted language after all. But when I actually saw the results, I was blown away. I found C to be over 22 times faster! A good way to test the speed of two languages is to make them compute the first N prime numbers. And for this, I used Sieve of Eratosthenes. The reason? It's a simple, yet powerful algorithm that is very popular and is used frequently. It is, in a nutshell, a powerful benchmarking technique. Let's dive into the code. [Github repository] Here is the python code: main.py import sys MAX_N = 10000000 prime = [False] * MAX_N i = 2 while i < MAX_N: sys.stdout.write(str(i) + " ") j = i * 2 while j < MAX_N: prime[j] = True j += i i += 1 while i < MAX_N and prime[i]: i += 1  And here is the C code: main.c #include <stdio.h> #define MAX_N 10000000 int prime[MAX_N]; int main() { int i = 2, j, k = 0; while(i < MAX_N) { printf("%d ", i); j = i * 2; while(j < MAX_N) { prime[j] = 1; j += i; } i++; while(i < MAX_N && prime[i]) i++; } return 0; }  As you may see, the two are almost identical in the steps used. But it's worthwhile to discuss the differences too: • In Python, due the lack of something analogous to #define like in C, we have to resort to using a normal variable MAX_N. This might lead to a slightly slower performance compared to the preprocessor directive. • In Python, we use i += 1 instead of i++ like we do in C. I'm not too sure about the performance impacts of using either, but intuitively, I feel i++ is faster since since processors may have dedicated instructions for them. Again, I'm unsure about this, but felt it was necessary to point out this difference. • In Python, you may see the prime = [False] * MAX_N compared to the C equivalent of int prime[MAX_N]. I concede that this makes it slightly slower, but on further testing, I found the impact is really negligible. And with that out of the way, let's look at the performance! $ gcc main.c
$time ./a.out > output_c ./a.out > output_c 0.43s user 0.02s system 99% cpu 0.450 total$ time python main.py > output_python
python main.py > output_python  9.54s user 0.08s system 100% cpu 9.611 total
$diff output_c output_python$


There you go! The Python code takes over 9 seconds to complete the task while C takes just 0.43 seconds! That's blazing fast when you consider that it just found all the primes under 10 million.

So there it is: while I absolutely love Python, it's simply not designed for high performance tasks. (I'm not saying I'm the first one to discover this, but I had to find it out for myself.)

Until next time,

### Shubham_Singh (italian mars society)

#### EUROPA INSTALLATION

europa-pso is a Platform for AI Planning, Scheduling, Constraint Programming and Optimization

EUROPA INSTALLATION :

There are not enough documentation about installing europa-pso other than
I have tried using these documentation and successfully installed europa on my system
operating system : Ubuntu 14.04
try to follow the following steps for easier installation :

• JDK -- sudo apt-get install openjdk-7-jdk
• ANT -- sudo apt-get install ant
• Python -- sudo apt-get install python(IF YOU ARE USING UBUNTU 14.04 OR ABOVE PYTHON IS PRE INSTALLED ,SO YOU CAN SKIP THIS STEP)
• subversion -- sudo apt-get install subversion
• SWIG sudo apt-get install swig
• for installing libantlr3c follow these steps  in this sequence
 
1. type this in terminal

•  This will create the plasma.ThirdParty directory

2. cd plasma.ThirdParty

3.Here , you will find the libantlr3c-3.1.3.tar.bz2 zip file ,unzip the file in the current directory ie inside plasma.ThirdParty

4. after extracting :   cd libantlr3c-3.1.3
5.  type :  ./configure ; make
(if you are using 64bit system type this
./configure --enable-64bit ; make

the output will not be same as the image but the last two lines should be same
6. type : sudo make install
•                                     the output should be like this
7. now unzip the europa zip file outside plasma.ThirdParty directory

• After downloading the appropriate EUROPA distribution for your system (available here), just unzip and set the EUROPA_HOME environment variable. For example, assuming that you have the EUROPA distribution in your ~/tmp directory and want to install EUROPA in your ~/europa directory, using bash you would do (modify appropriately for your os+shell) :
• mkdir ~/europa
•  cd ~/europa
•  unzip ~/tmp/europa-2.1.2-linux.zip
•  export EUROPA_HOME=~/europa
• export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$EUROPA_HOME/lib
# DYLD_LIBRARY_PATH on a Mac
•  export DYLD_BIND_AT_LAUNCH=YES    # Only needed on Mac OS X
8.
$EUROPA_HOME/bin/makeproject Light ~ cp$EUROPA_HOME/examples/Light/*.nddl ~/Light
cp \$EUROPA_HOME/examples/Light/*.bsh ~/Light
 

9     cd ~/Light
10   ant

The EUROPA GUI should appear.

The installation may result in some really inevitable java errors sometimes , but if you follow above steps sequentially you will get through the installation easily
But if you still get any error or doubt in the process or you find something wrong in the above steps please comment below

## March 16, 2016

### Kuldeep Singh (kivy)

#### Kivy: Let’s get started

Kivy is an Open Source, cross-platform Python framework for the development of applications that make use of innovative, multi-touch user interfaces.

## What’s so special about kivy?

With the same codebase. you can target Windows, OSX, Linux, Android and iOS (almost all the plaforms). Also Kivy is written in Python and Cython, based on OpenGL ES 2, supports various input devices and has an extensive widget library. All Kivy widgets are built with multitouch support making it awesome for Game Development as well.

## Who’s developing it?

There’s a sweet growing community of Kivy, it’s still samll but really efficient, the core developers are these guys and there are many more.

There are about 10 sister projects going on under Kivy Organisation, take a look and get your hands dirty by diving into their codebase.

## Getting Started

Let’s try to create a hello world example. (Example taken from the kivy website)

from kivy.app import App
from kivy.uix.button import Button

class TestApp(App):
def build(self):
return Button(text='Hello World')

TestApp().run()

Save this program in your .py file and run it, you should see something like this.

st application in Kivy which will work on all the platforms.

Now try something fun, try to package your application for either of the platforms.

You can find everything in Kivy Documentation. (pdf)

## Found some bug?

Report it via their issue tracker.

## March 05, 2016

### ghoshbishakh (dipy)

#### Making rtl8723be Wireless Adapter Work In Linux

Till last year whenever I encountered a laptop with WiFi not working in linux it was a Broadcom Wireless Adapter.

But this year things are different. Nearly all new HP laptops are having problems with WiFi in linux (ubuntu, arch, manjaro). And surprisingly the problem is not that the WiFi driver is not working at all. But it is worse, the signal strength received is so weak that it is absolutely unusable.

A quick lspci | grep Wireless shows the Wireless Adapter in your system. In my case the device causing problem was a Realtek:

RTL8723be

After scanning through numerous threads I finally found the solution in this github issue:

https://github.com/lwfinger/rtlwifi_new/issues/88

## So here is the step by step procedure to solve the issue:

First some make sure the dependencies for building the driver are installed:

In Ubuntu:

In Arch:

Now clone the rtlwifi_new repository:

Checkout the branch rock.new_btcoex

Now build and install the driver

Reboot the system.

Now disable and enable the driver with proper parameters.

NOTE: If this does not work try: sudo modprobe -v rtl8723be ant_sel=1

#### Building VTK with python bindings in linux (arch)

I came across VTK while building the docs for DIPY and what I needed was the python bindings.

I use arch linux so installing from pacman is simple:

But this fails to install the python bindings properly and when you try:

it throws the error:

That leaves no other way except to build VTK from source including the python wrapper, for the python version you want to use vtk in.

## So here is the step by step procedure:

From vtk website download the latest source tarball. For me it is VTK-7.0.0.tar.gz then extract it:

Now configure cmake properly for building python wrappers:

This will give you an interface like: Now use your arrow keys to select the option you want to change and press enter to change value.

Toggle VTK_WRAP_PYTHON on.

Toggle VTK_WRAP_TCL on.

Change CMAKE_INSTALL_PREFIX to /usr

Change VTK_PYTHON_VERSION to 2.7 (or the version of python you want to use vtk in)

Now press [c] to configure

Then press [g] to generate and exit

Note: Sometimes you need to press c and g again.

Now run:

This will create a directory: Wrapping/Python

Now install the python bindings:

Hopefully that should install vtk properly.

To check, in python run:

This should give something like:

‘/usr/lib/python2.7/site-packages/vtk/init.pyc’

## January 28, 2016

### Pulkit Goyal (Mercurial)

#### Introduction

Hello readers, as this is an introductory post so I must introduce myself a bit. Myself Pulkit Goyal, a sophomore in an engineering college in India, pursuing bachelors in Computer Science. I am a good competitive programmer and a data science enthusiast. I love teaching, solving problems, optimizing algorithms and has keen interest in the field of data science especially machine learning. I believe in a quote by Margaret Fuller that

## October 17, 2015

### kaichogami (mne-python)

#### Huffman loss-less compression

Hello everyone!
This time I wrote scripts to compress text files based on Huffman compression algorithm. And yes, it really compresses text files and we all know how much applications does compression have in real world. As always, every code that I write here will be in python, and written in very easy way. I hope its clear enough for you all to understand!

Firstly lets understand the intuition behind the algorithm. Every data type that we use in language for example, int, they all have certain size associated with them. For example int generally has a size of 4, double a size of 8, character a size of 1 and so on. These numerical values representing the size, indicates the number of bytes they take up on the memory. Each byte is of 8 bits. So an int data type takes 32 bits of memory. For a reminder, a bit is just a on or off signal for the computer. Everything that we use in computers are just a combination of 1s(on) and 0s(off).

Every text that we use is of ASCII encoding format. For more details you can refer wikipedia page. In Huffman algorithm we exploit two facts:

• ASCII format contains 128 different characters.
• Frequency of characters in a given text varies. Sometimes a lot.

Lets say a text is “zippy hippy deny meni”. Clearly this text contains a lot of ‘p’. ASCII code for ‘p’ is 112. Binary representation is ‘11100000’. So each ‘p’ occurrence in the text uses one byte of memory. This utilizes lot of space unnecessarily. What we do is, we take the frequency of all the characters of the text, and assign our own binary representation for characters. More the occurrence of a character in the text, shorter binary representation for that character. Pretty cool isn’t?

How is the idea implemented? That is the main question. In my implementation of the above idea, I have split it into two modules. One for encoding the data, other for decoding. It uses binary tree data structure so you might want to brush it up in case you have forgotten.

Lets begin with encoding.

Algorithm

1. We create a single node binary tree. The key is the character and value of the node is the frequency of character occurring in the tree.
2. Store these trees in ascending order of value(frequency) in a list.
3. Repeatedly pop the last two trees and combine them into one, with their parent value as the sum of frequency and its left and right node as the original tree nodes.
4. Adjust the list in a sorted manner accordingly.
5. Repeat step 3 and 4 till only one tree is left.

This will create a so called encoding tree. If we think of left as 0 and right as 1 and begin traversing from root, each leaf node will represent a new binary value for a character. Note that every leaf is a tree with key as the character. Larger frequency value of nodes have shorter path. This forms the basis of encoding our data into its representation.

Next part of challenge is to find each path of the leaf node and then save it in dictionary(hash map). Now if you think about it a bit, you will notice that the binary tree is a full binary tree, meaning a node is either a leaf, or has two children or is a root. Also as noted earlier, every leaf node is the character. To reach a particular leaf node from root, we follow a path, which will always be unique. We represent them with 0s and 1s as mentioned above. The reason for this is, since every character encoding could be of arbitrary length, we cannot split the string into particular length and parse it. Using a binary tree, we will know when it is suppose to end by the leaf node.
To find path of leaf nodes, we write a recursive function.

def _find_path(tree, path):
if type(tree.key) == str:
return [tree.key, ''.join(path)]

left = self._path_leaf(tree.left, path+'0')
right = self._path_leaf(tree.right, path+'1')

ans = []
ans.extend(left)
ans.extend(right)

return ans


Take a minute to look and understand it. These are best explained when you think about it!😀

Now after finding the paths of each character and storing it in a list, we can easily convert it into a dictionary.

 def _create_dict(self, ans):
temp_dict = {}
for x in xrange(0,len(ans),2):
temp_dict[ans[x]] = ans[x+1]

return temp_dict



Following the above recursive function, we get a list with character at even index and frequency at odd index. We simply save it in a dictionary.

Next part is to convert the original string that we wish to compress, into a string of 0s and 1s using the dictionary that we just created. These 0s and 1s will then become the bits of single byte. This byte is then written into file.
To write bits in byte, we use bitarray library of python. Since implementation will be different for different users, I will not go into details and leave a task for you. If anyone gets stuck anywhere, please feel free to contact me!

Decoding

Decoding involves retrieving the original text from the compressed file using a meta file(the tree we created above) as a source. We can directly somehow use the created tree object in our decode program, or we can save the dictionary values in a file and re-create the tree. Former method will occupy a lot of space, since tree is a custom data type. Latter will involve re-creating a tree, but since creating tree is relatively fast considering a small input(Note that the characters to be encoded will be less, since most of the characters involves alphabets and numbers) we will use the second method.

Challenge is to read individual bits from a byte. File.read() method reads only a single byte. Here we will take the help of bit operators, particularly arithmetic right shift operator(>>).

 def _get_bits(self, f):

byte = (ord(x) for x in f.read())
for x in byte:
for i in xrange(8):
yield (x >> i) & 1



The arithmetic right shift operator right shits by i position. The operator pads with the most significant bit of the number. Say if a binary number is 101(5). 5>>1, will give 110. Performing the & operation with one, 110 & 100 will give 0. Therefore we successfully extracted the second bit of the binary string. We continue to do this till 8 iterations to extract 8 bits of every character. Here yield is a python keyword which is similar to return, but differs in the sense that after returning a value it will continue to return from that point this loop ends.

After extracting the bits, we can construct the tree from the meta file, and traverse the tree till we reach a leaf node and write the value in a string to give the final output. Here is the code snippet to traverse the tree.

 length = len(self.binary_bits)
index = 0

while index < length:
if temp.key != None:
original += temp.key
temp = self.meta
continue

if self.binary_bits[index] == '1':
temp = temp.right

else:
temp = temp.left

index += 1


Thats about it. The rest would involve to make it more functional by using a class. Here I wrote down the main algorithm and designing part of Huffman implementation. If you want to see the whole working code, you check my github repo and use it.

I hope you learnt something new, atleast 10% of what I said. If you found it interesting then go ahead and try it for yourself. You can share your implementation here. Thank you for reading and as always I am open to questions and feedback!

## September 18, 2015

### Preetwinder (ScrapingHub)

#### First Post

Hello this is my first post on my blog. The blog is hosted on Github pages and was generated using Jekyll. The theme used is So Simple.

First Post was originally published by preetwinder at preetwinder on September 18, 2015.

## September 12, 2015

### kaichogami (mne-python)

Hello everyone!😀
I hope you are all doing well. Whenever I write, I feel like I wrote yesterday! Time passes so quickly!

Just now I finished working on a manga downloader. What it does is, it will let you download any number of manga chapters and save it in your folder.  Its especially useful if you like reading manga in one go or perhaps get a limited internet connectivity or perhaps you happen to be a collector of mangas!😛
Its usage is fairly simple. First download this zip file and extract the contents. Then run the “main.py” file. You will have to provide arguments in the command line to download. For example “python main.py fairy_tail 2 3”. This will download fairy tail manga’s chapter 2 and 3 under the download folder. You will have to replace space with “_” to make it work. Like it?

This also comes with a spell corrector which is much needed in this case. We are all not Japanese neither do we know Japanese. So its highly likely that you will make mistake while downloading mangas such as “Karate Shoukoushi Kohinata Kairyuu”. The spell checker included in the software loads a manga list, with the names of huge collections of mangas. Then it uses the edit distance algorithm to find the nearest string to the user’s given string, if the given name is not in the list. The running time is O(m*n), which is slow, but not so slow considering our string length will almost never exceed even 70.

After correcting the spelling, urllib2 is used to download an image. First the html response is captured. Then in every html data, the image file is extracted. I have used a very naive logic to do that. I simply look for the first “.jpg” in the string and find the link associated with it. There are chances of various “.jpg” image, but it chooses the first one. I have noticed a bug with this is logic that it will download starting from the 2nd page. This is because the first “.jpg” associated link probably leads to the second page. I have yet to confirm this. Also I have used User Agents as normal programmatic way of opening the website and saving image was not working. It always saved a corrupted folder. I could not find a solution to that. Mostly likely the error resides somewhere in my use of “urllib2.retreive(” ..”)” method.
Downloading is is done until the end of chapter is reached. We know we have reached the end of chapter if it returns a 404 error. Unfortunately I do not have an elegant way of knowing if a certain chapter is the last chapter.
Everything is stored in a folder named “download” in the main directory. It first creates a directory named with the manga name and saves each chapter in different directory.

Cheers!

## August 19, 2015

### Shridhar Mishra (italian mars society)

#### Finals

The final model of the project is in place and the Europa planner is working the way its supposed to be.
The code in the repository is in a working condition and has the default NDDL plan on it which moves the rover from the base to a rock and collect the sample.
Integration with the Husky rover s underway and the code is being wrapped up for the final submission.

Shridhar

## July 02, 2015

### Shridhar Mishra (italian mars society)

#### Mid - Term Post.

Now that my exams are over i can work with full efficiency and work on the project.
the current status of my project looks something like this.

Things done:

• Planner in place.
• Basic documentation update of europa internal working.
• scraped pygame simulation of europa.

Things i am working on right now:
• Integrating Siddhant's battery level indicator from Husky rover diagnostics with the planner for more realistic model.
• Fetching things and posting things on PyTango server. (Yet to bring it to a satisfactory level of working)
Things planned for future:
• Integrate more devices.
• improve docs.

## June 20, 2015

### Shridhar Mishra (italian mars society)

#### Update! @20/06/2015

Things done:

• Basic code structure of the battery.nddl has been set up.
• PlannerConfig.xml has is in place.
• PyEUROPA working on the docker image.

Things to do:
• test the current code with pyEUROPA.
• Document working and other functions of pyEUROPA(priority).
• Remove Arrow server code from the existing model.
• Remove Pygame simulation and place the model for real life testing with Husky rover.
• Plan and integrate more devices for planning.

## February 20, 2015

### Leland Bybee (Statsmodels)

#### Topic Coherence and News Events

One important issue that has to dealt with when you get output from a topic model is, do the topics make sense to a reader? An intuitive approach here is to look at the top X words sorted by how common it is for a word to appear with a topic. This is the beta parameter in LDA. However, this approach isn’t very rigorous. In order to formalize the approach beyond just eyeballing a word list, a number of coherence measures have been proposed in the literature. I focus on a variant of the UCI measure proposed by Newman et al. [1]

The UCI measure relies on the pointwise mutual information (PMI) to calculate the cohesion of a topic. The PMI for a pair of words indexed i and j is

    PMI(w_i, w_j) = log(p(w_i, w_j)/(p(w_i)p(w_j))) = log((D(w_i, w_j)N)/(D(w_i)D(w_j)))


where D(w_i, w_j) is the number of documents where both words appear simultaneously and D(w_i) is the number of documents where word i appears. The way that the UCI measure works is that for each pair of words in the top X terms for a given topic, the PMI score is calculated – in the original article using an external corpus, like Wikipedia – and the median PMI score is used as the measure of coherence for a topic. Newman et al. find that this measure performs roughly as well as manually determining the coherence.

I want to use the coherence score to compare a number of methods for estimating topics, as well as to compare a number of different data sets and a number of sorting for the word proportions. However, I also want to have some sort of test to get a sense for what is a coherent topic in general. To do this, I decided to not only compare the coherence scores of the different approaches, but also calculate the probability of observing my coherence scores assuming that the word pairs were drawn at random.

I should note here that I differ from the UCI measure to some extent in that I just use the source corpus instead of the external corpus. I certainly don’t think this would be a problem for the WSJ abstracts corpus or the NYT corpus that I have given their size, though it may cause some problems for the WSJ articles corpus. Down the road I’d like to compare the performance to the Wiki corpus but given that other coherence measures have been developed that work similarly to the UCI measure and use the source corpus [2], I’m not too worried.

To build my null model, I do two different samplings of word pairs from my source corpus. The first is to do uniform sampling of word pairs, the second is to weight the sampling by the Tf-Idf score of each word. For this second method, what I should get is a sample with more word pairs that contain words with high Tf-Idf scores than with the uniform sampling. The reason for doing both forms of sampling is to test whether high Tf-Idf terms are more coherent with the text corpus as a whole. If this were true and the terms that appear in the top word lists for my topics have higher Tf-Idf scores in general as well this could cause trouble for my test. The histogram below shows the distribution of the PMI scores for the uniformly sampled pairs (blue) and the TF-Idf weighted sampled pairs (red)

So it looks like the Tf-Idf sampling doesn’t have much of an effect on the distribution of PMI scores and that most of the PMI scores are grouped around 0.

So moving on to the actual data. I wanted to compare two methods for detecting the topics, as well as my three data sets and three different word sortings. The two methods that I am currently playing around with are exploratory factor analysis (EFA) and latent dirichlet allocation (LDA). EFA isn’t a topic model but the loadings can be thought of as the word proportions for LDA so I’m going to calculate the coherence in the same way. The three sortings that I want to look at are no sorting, sorting by the proportion scaled by the corpus proportion

    p(w_i | \theta_j) / p(w_i)


and a variation on Tf-Idf designed for the top word lists. This Tf-Idf sorting takes the number of top word lists that a word appears in as the document frequency and then uses the raw proportion as the term frequency. Looking at each model, data set, sorting I get the following 2 tables. The first table shows the mean coherence score for each group while the second table shows the proportion of the topics that are significant at the 95% level. This significance is calculated based on the null model with uniform sampling.

model word sorting   NYT WSJ Abs. WSJ Art.
Topic Number 25 40 30
EFA Raw   -0.12 0.56 0.88
EFA Post   0.16 0.63 0.98
EFA Tf-Idf   -0.12 0.56 0.88
LDA Raw   -0.54 0.01 0.04
LDA Post   0.42 1.10 0.90
LDA Tf-Idf   0.16 0.75 1.03
model word sorting   NYT WSJ Abs. WSJ Art.
Topic Number 25 40 30
EFA Raw   0.52 0.43 0.83
EFA Post   0.72 0.50 0.9
EFA Tf-Idf   0.48 0.43 0.83
LDA Raw   0.00 0.00 0.00
LDA Post   1.00 0.95 0.77
LDA Tf-Idf   0.56 0.55 1.00

The post sorting refers to sorting by

    p(w_i | \theta_j) / p(w_i)


Additionally, it is worth noting that the currently used topics were selected rather arbitrarily for each data set. I’m still cleaning up some of the results so I haven’t pinned the optimal number of topics down yet. These are all ok approximations for now. Having compared these results to the topic coherence results for other numbers of topics I don’t think it is going to have a major effect.

The posterior sorting seems to perform the best. In all cases but one (LDA Tf-IDF WSJ Art.), it performs better than the other two sortings. Tf-Idf and unsorted appear to perform comparably for EFA but there is a difference when you use LDA.

[1] Newman, Bonilla and Buntine. Improving Topic Coherence with Regularized Topic Models. 2011.

[2] Mimno, Wallach, Talley, Leenders, and McCallum. Optimizing Semantic Coherence in Topic Models. 2011.

## February 06, 2015

### Leland Bybee (Statsmodels)

#### Clustering News Events

I’ve been working on a project for some time now, with Bryan Kelly, to detect “news events” in a text corpus of Wall Street Journal abstracts that we scraped back in July of 2014. I’ve written some on this in the past and the project has gone through a number of iterations since then. We are now working with more data than just the WSJ abstracts. We have also been doing work with set of WSJ articles for a smaller period of time, along with the first paragraph of New York Times articles going back to the 19th Century. Right now, my focus has primarily been on making a convincing argument for existence of these “news events” and their usefulness for explaining other response variables.

One way that I’ve been approaching this problem is to develop a classification system for the topics that we extract. In general, it appears that news events will pop up for some subset of the observations and then drop off again as time progresses. What I want to do is start getting a sense of the patterns within the subsets of observations where there is signal. What I develop here is a clustering system that I use to begin giving some structure to the topics.

One way to think of the topics is as a distribution over time. When you think about them this way, you can imagine that each observation for a topic represents the density of that topic in that period with some noise. My goal is to isolate the periods where a topic has some discernible signal. What I want to do is first remove the noise from the density and then drop periods where there doesn’t appear to be any signal.

To do this, what I do is first perform local linear regression and use the fitted curves as my new topics. This largely removes the noise and gives me something cleaner to work with. I set the bandwidth using leave-one-out cross validation. With the resulting fitted curves, I then do thresholding using

    sd_i sqrt(2 * log(N) / N)
`

as my threshold. This comes out to about 0.01 for most topics. Any observations with a fitted topic proportion below the threshold are then dropped. This leave me a subset of observations for each topic where the topic appears to be relevant. What I can do with these subsets is then produce 4 groups of topic variants. The first is the raw observations for the full period, the second is the fitted observations for the full period, the third is the raw observations for each topic’s corresponding subset of relevant observations, and finally, the fourth is the fitted observations for each topic’s corresponding subset of relevant observations.

What I want to do then is build clusters for each of the groups to get a sense of the different shapes we see in the topics. The standard approach for calculating the distance between different time series, in this case the topics, is dynamic time warping. DTW works by calculating the distance between every pair of points in the two time series and using a function of these distances to get this distance between the two series. It is a very flexible approach and the two time series can be of different lengths. What I do here is estimate a matrix of DTW distances between each pair of topics that I get out of LDA (or any other latent variable approach) and do k-means clustering based on the distance matrix. The following plot shows the explained variation for each group of topic variants over the number of clusters. It seems that for all variants the sweet spot is around 5-6 clusters.

What I have below are four plots, for each of the four groups of topic variants, of the topic proportions for each of the 30 topics. The topics are color coded for their corresponding topic.

The raw topic proportions over the full period

The fitted topic proportions over the full period

The raw topic proportions over the corresponding subset

The fitted topic proportions over the corresponding subset

I find that the fitted topic proportions over the corresponding subset give the best sense of the actual grouping of the data. One thing that the full period plots do help with though, is giving a sense of the edge cases. What we see is that two of the clusters represent topics where we don’t get to see the full support because some of it lies outside of our observed time series. The clustering devotes a cluster to topics where the right side support is cut off and a cluster for topics where the left side support is cut off.

The clusters give some sense of the data, though it isn’t immediately clear what the difference is between the orange, blue and green clusters for the fitted subsets. However, looking at each of the curves for the fitted subsets, side by side, you can begin to get a sense of the shapes that appear in the topic distributions. An approach that might work well would be to classify the topics based on their skewness and some measure of multimodality. I think it would be best to throw out the cases where the full support doesn’t lie within the observed periods since we can’t get a