Python's Summer of Code 2017 Updates

August 16, 2016

GSoC week 12 roundup

@cfelton wrote:

This is the last roundup, as posted in previous GSoC roundups the GSoC program has outlined that all students have their final code committed by 20-Aug. If you have not committed your final code make sure to do so in the next couple days and prepare your final blog post that will be used in your evaluation submission.

IMPORTANT NOTE TO MENTORS
All mentors need to provide a summary of the students final evaluation to me (@cfelton) via email by 22-Aug. The assigned mentors were never corrected in the GSoC system, I will need to complete all the final evaluations again. Because of schedule conflicts and PSF requirements I will be completing all the evaluations by the 23rd, please provide the final review as soon as possible.

** Student final project submission **
Students, make sure to have your final blog post, the final post should review what you completed and what is outstanding. I should be able to easily understand what is working in the projects, what is missing, and what doesn't work. This detailed final post is required for a passing evaluation.

The GSoC work product submission guidelines outline what the final post should have. Take the time required to generate the final blog post that you will link in your submission, it should have:

1. Description of the work completed.
2. Any outstanding work if not completed.
3. Link to the main repository.
4. Links to the PRs created during the project.

Review the submission guidelines page in detail.

The idea of GSoC isn't that students churn out code -- it's important that the code be potentially useful to the hosting Open Source project!

Also make sure the README on the project repositories is complete, it should give an overview of the project and instructions for a user to get started: install, run tests, the core interfaces, and basic functional description.

Student week12 summary (last blog, commits, PR):

jpegenc:
health 87%, coverage 97%
@mkatsimpris: 12-Aug, >5, Y
@Vikram9866: 07-Aug, >5, Y

riscv:
health 96%, coverage 51%
@meetsha1995: 11-Aug, >5, Y
@srivatsan: 14-Aug, >5, N

gemac:
health 93%, coverage 92%
@ravijain056, 02-Aug, >5, N

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 10

Participants: 6

July 31, 2016

GSoC week 10 roundup

@cfelton wrote:

The slump for most students has continued, all students need to
be making daily commits, weekly PRs, and weekly blogs. If you
(student) are not following through with these you are in jeopardy
of failing.

Student week 10 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 95%
@mkatsimpris: 28-Jul, >5, Y
@Vikram9866: 25-Jun, >5, Y

riscv:
health 96%, coverage 91%
@meetsha1995: 14-Jul, >5, N
@srivatsan: 21-Jul, >5, Y

gemac:
health 87%, coverage 89%
@ravijain056, 22-Jul, 0, N

pyleros:
health missing, coverage 70%
@formulator, 26-Jun, 0, N

@meetsha1995 and @sriramesh4: there has been low activity on

@Ravi_Jain: there has been no publicly available progress,

if you agree or disagree with the fail evaluation.

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 5

Participants: 4

July 20, 2016

Scipy 2016!

Last week I went to Austin, TX to Scipy2016. I wasn’t sure what to expect. How would people communicate? Would I fit in, what talks would interest me? Fortunately the conference was a huge success. I have came away a far more confident and motivated programmer than when I went in.

So what were the highlights of my experience at Scipy?

On a personal level, I got to meet some of my coworkers, the members of the Beckstein Lab. Dr. Oliver Beckstein, David Dotson, and Sean Seyler are brilliant physicists and programmers who I have been working with on MDAnalysis and datreant. It was surreal to meet the people you have been working with over the internet for 3 months and get an idea of how they communicate and what they enjoy outside of work. It was the modern day equivalent of meeting penpals for the first time. I especially appreciated that David Dotson and Sean Seyler, both approximately four years my senior, provided invaluable advice to a recent graduate. If you’re reading this, thanks guys.

The most valuable moments were the conversations I had in informal settings. There is a huge diversity in career trajectories among those attending Scipy, everyone has career advice and technical knowledge to impart upon a young graduate as long as you are willing to ask. I had excellent conversations with people from Clover Health, Apple data scientists, Andreas Klockner (Keynote Speaker), Brian Van de Ven (Bokeh Dev), Ana Ruvalcaba at Jupyter, the list goes on…

Fascinating, Troubling, and Unexpected Insights

• Scipy doubled in size in the last year!
• So many free shirts (and stickers), don’t even bother coming with more than one shirt, also nobody wears professional attire.
• Overheard some troubling comments made by men at Scipy, e.g. “Well, all the women are getting the jobs I’m applying for…” (said in a hallway group, this is not appropriate even if it was a joke)
• The amount of beer involved in social events is kind of nuts; this probably comes with the territory of professional programming.
• There are a lot of apologists for rude people, someone can be extremely nonverbally dismissive and when you bring it up to other people they will defend him (yes, always him) saying something to the effect of ‘he has been really busy recently’. Oliver Beckstein is a shining example of someone who is very busy and makes a conscious effort to always be thoughtful and kind.
• Open source does not always imply open contribution, some companies represented at Scipy maintain open source projects while making the barriers to contribution prohibitively high.
• A lot of people at Scipy apologize for their job (half-seriously) if they aren’t someone super-special like a matplotlib core developer or the inventor of Python. Your jobs are awesome people!
• It is really hot in Austin.
• git pull is just git fetch + git merge.
• A lot of women in computing have joined and left male dominated organizations not because people are necessarily mean, but because they’ve been asked out too much or harassed in a similar fashion. Stay professional folks.
• Cows turn inedible corn into edible steak.
• As a young professional you have to work harder and take every moment more seriously than those older than you in order to get ahead.
• Breakfast tacos are delicious.
• Being able to get out of your comfort zone is a professional asset.
• Slow down, take a breath, read things over, don’t make simple mistakes.

July 19, 2016

GSoC week 8 roundup

@cfelton wrote:

There has been a little bit of a slump after the midterms
hopefully this will not continue throughout the rest of
the program

The end will approach quickly.

• August 15th: coding period ends.
• August 15-20th: students submit final code and evaluations.
• August 23-25th: mentors submit final evaluations.

Overall the progress being made is satisfactory. I am looking
forward to the next stage of the projects now that majority of the
implementation is complete: analysis of the designs, clean-up,
and documentation.

One topic I want to stress, a program like GSoC is very different
than much of the work that is completed by an undergrad student.
This effort is the students exposition for this period of time
(which isn't insignificant - (doh double negative)). Meaning, the
goal isnt' to simply show us you can get something working but you
are publishing your work to the public. Users should easily be able
to use the cores developed and the subblocks within the cores.
Developers, reviewers, contributors should feel comfortable reading
the code. The code should feel clean [1]. You (the students) are
publishing something into the public domain that carries your name
take great pride in your work: design, code, documenation, etc.

As well as the readability stated above the code should be
analyzed for performance, efficiency, resource usage, etc.
This information should be summarized in the blogs and final
documentation.

Student week8 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 97%
@mkatsimpris: 10-Jul, >5, Y
@Vikram9866: 25-Jun, >5, Y

riscv:
health 96%, coverage 91%
@meetsha1995: 14-Jul, 1, N

hdmi:
health 94%, coverage 90%
@srivatsan: 02-Jul, 0, N

gemac:
health 93%, coverage 92%
@ravijain056, 04-Jul, 2, N

pyleros:
health missing, 70%
@formulator, 26-Jun, 0, N

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 2

Participants: 1

July 11, 2016

GSoC week 7 roundup

@cfelton wrote:

Student week7 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 95%
@mkatsimpris: 03-Jul, >5, Y
@Vikram9866: 25-Jun, >5, Y

riscv:
health 96%, coverage 91%
@meetsha1995: 24-Jun, 0, N

hdmi:
health 94%, coverage 90%
@srivatsan: 02-Jul, 0, N

gemac:
health 87%, coverage 89%
@ravijain056, 04-Jul, >5, Y

pyleros:
health missing, 70%
@formulator, 26-Jun, 0, N

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 1

Participants: 1

June 29, 2016

Principal Component Analysis

My next subject for bloggery is Principal Component Analysis (PCA) (its sibling Multidimensional scaling has been left out for a future post, but it is just as special, don’t worry). If I were to give a talk on PCA, the slides would be roughly ordered as follows:

• A very short recap of dimension reduction
• PCA, what it stands for, rough background, history
• Eigenvectors (what are those?!)
• Covariance (Because variance matrix didn’t sound cool enough)
• The very fancy sounding method of Lagrange Multipliers (why they aren’t that hard)
• Explain the PCA Algorithm
• Random Walks: What are they, how are they taken on a configuration space
• Interpreting the results after applying PCA on MD simulation data

In reality not going to follow these bullet points, if you want to get information pertaining to the first two points, please read some of my previous posts. The last two points are going to be a subject for a post next week.

Here are some good sources for those seeking to acquaint themselves with Linear Algebra and Statistics. Multivariate Statistics and PCA (Lessons two through 10) and the Feynman Lecture on Physics: Probability. The Feynman lectures on physics are so good and so accessible. Richard Feynman certainly had his flaws but teaching was not one of them. If you’re too busy to read those, here’s a quick summary of some important ideas I will be using.

What is a Linear Transformation, John?

Glad you asked, friend! Let’s just stick to linearity. for a function to be linear, it means that $f(a+b) = f(a) + f(b)$. As an example of a non-linear function, consider $y = x^{3}$ . After plugging some numbers in we can see this is non-linear $2^{3} \neq 1^{3} + 1^{3}$.

A transformation at its most abstract is the description of an algorithm that gets an object A to become an object B. In linear algebra the transformation is being done on vectors belonging to a domain (where vectors exist before the transformation) space $V$ and a range (where vectors exist after the transformation) space $W$. For the purposes of our work, these are both $R^{n}$, the Cartesian product n-times of the real line. (The standard (x,y) coordinate system is the Cartesian product of the real line twice, or $R^{2}$)

When dealing with vector spaces, our linear transformation can be represented by a $m$-by-$n$ matrix, where $n$ is the dimension of the space we are sending a vector into (always less than or equal to m), and $m$ is the dimension of the vector space in which our original vector (or set of vectors) being transformed originally exists in. So if we have some set of $k$ vectors being transformed, the matrices will be have row-by-column sizes: $$[ k-by-m ] [m-by-n] = [k-by-n]$$. These maps can be scalings, rotations, shearings and more.

What is a vector, what is an eigenvector?

Good question! A vector has a magnitude (which is just some positive number for anything that we are doing) and a direction (a property that is drawn from the vector space to which the vector belongs). Being told to walk 10 paces due north is to follow a vector with magnitude 10 and direction north. Vectors are presented as if they are centered at the origin, and their head is reflects their magnitude and direction. This allows some consistency when discussing things, when we are given the vector $(1,1)$ in $R^2$ we know it is centered at the origin, and thus has magnitude (from the distance formula) of $\sqrt{2}$ and the direction is 45 degrees from the horizontal axis. An eigenvector sounds a lot scarier than it is; the purpose of an eigenvector is to answer the question, ‘what vectors don’t change direction under a given linear transformation?’

This picture is stolen from wikipedia, but it should be clear that the blue vector is an eigenvector of this transformation, while the red vector is not.

The standard equation given when an eigenvalue problem is posed is: $$Mv = \lambda v$$

$M$ is some linear transformation, $v$ is an eigenvector we are trying to find, and $\lambda$ is the corresponding eigenvalue.

From this equation, we can see that eigenvector-eigenvalue pairs are not unique; direction is not a unique property of a vector. If we find a vector $v$ that satisfies this equation for our linear transformation $M$, scaling the vector by some constant $\alpha$ will simply change the eigenvalue associated with the solution. The vector $(1,1)$ in $R^2$ has the same direction as $(2,2)$ in $R^2$. If one of these vectors isn’t subject to a direction change (and therefore an eigenvector), then the other must be as well, because the eigenvector-ness (yes, I just coined this phrase) applies to all vectors with the same direction.

For those of you more familiar with Linear Algebra, this should not be confused with the fact that a linear transformation can have degenerate eigenvalues. This concept of degenerate eigenvalues comes up when the rank of the matrix representation of a linear transformation is less than its dimension, but given that our transformation has rank equal to to the dimension of the vector space, we can ignore this.

Statistics in ~300 words

Sticking one dimension, plenty of data seems to be random while also favoring a central point. Consider the usual example of height across a population of people. Height can be thought of as a random variable. But this isn’t random in the way that people might think about randomness without some knowledge of statistics. There is a central point where the heights of people tend to cluster, and the likelihood of someone being taller or shorter than this central point decreases on a ‘bell-curve’.

This is called a normal distribution. Many datasets can be thought of as describing the a set of outcomes for some random variable in nature. These outcomes are distributed in some fashion. In our example, the mean of the data is the average height over the entire population. The variance of our data is how far the height of a is spread out from its mean. When the distribution of outcomes follows a bell-curve such as it does in our example, the distribution is referred to as normal. (There are some more technical details, but the phrase normal stems from the fact that total area under the bell-curve defined by the normal distribution is equal to one.) When the data we want to describe is reflects more than one random variable this is a multivariate distribution. Statistics introduces the concept of covariance to describe the relationship that random variables have with one another. The magnitude of a covariance indicates in some fashion the relationship between random variables; it is not easily interpretable without some set of constraints on the covariances, which will come up in Principal Component Analysis. The sign of the covariance between two random variables (X,Y) indicates if the two points are inversely related or directly related. A negative covariance between X and Y means that increasing X decreases while a positive covariance means that increasing X increases Y. The normalized version of covariance is called the correlation coefficient and might be more familiar to those previously acquainted with statistics.

Constrained Optimization

Again, please do look at the Feynman Lectures, if you’re unfamiliar with statistics look at the Penn State material for the ideas I just went over to better understand them. The last subject I want to broach before getting into the details of PCA is optimization with Lagrange Multipliers.

Lagrange Multipliers are one method of solving a constrained optimization problem. Such a problem requires an objective function to be optimized and constraints to optimize against. An objective function is any function that we wish to maximize or minimize to reflect some target quantity achieving an optimum value. In short, the method of Lagrange Multipliers creates a system of linear equations such that we can solve for a term $\lambda$ that shows when the objective function achieves a maximum subject to constraints. In the case the of PCA, the function we want to maximize is $M^{T} cov(X) M$, describing the covariance of our data matrix $X$.

PCA

Although it is not quite a fortuitous circumstance, the principal components of the covariance matrix are precisely it’s eigenvectors. For a multivariate dataset generated by $n$ random variables, PCA will return a sequence of $n$ eigenvectors each describing more covariance in the dataset than the next. The picture above presents an example of the eigenvectors reflecting the covariance of a 2-dimensional multivariate dataset.

To be perfectly honest, I don’t know a satisfying way to explain why the eigenvectors are the principal components. The best explanation I can come up with is that the algorithm for PCA is in correspondence with the algorithm for eigenvalue decomposition. It’s one of those things where I should be able to provide a proof for why the two problems are the same, but I cannot at the moment. (Commence hand-waving…)

Let’s look at the algorithm for Principal Component Analysis to better understand things. PCA is an iterative method that seeks to create a sequence of vectors that describe the covariance in a collection of data, each vector describing more than those that will follow. This introduces an optimization problem, an issue that I referenced earlier. In order to guarantee uniqueness of our solution, this optimization is subject to constraints using Lagrange Multipliers. remember, the video provides an example of a constrained optimization problem from calculus. Principal Component Analysis finds this set of vectors by creating a linear transformation $M$ that maximizes the covariance objective function given below.

PCA’s objective function is $trace(M^{T} cov(X) M)$, we are looking to maximize this. From the Penn State lectures:

Earlier in the course we defined the total variation of X as the trace of the variance-covariance matrix, or if you like, the sum of the variances of the individual variables. This is also equal to the sum of the eigenvalues.

In the first step of PCA, we can think of our objective function as: $$a^T cov(X) a$$

$a$ in this case is a single vector

We seek to maximize this a term such that the sum of the squares of the coefficients of a is equal to one. (In math terms this is saying that the $L^2$ norm is one). This constraint is introduced to ensure that a unique answer is obtained, (remember eigenvectors are not unique, this is the same process one would undertake to get a unique sequence of eigenvectors from a decomposition!). In the second step, the objective function is : $$B^T cov(X) B$$

$B$ consists of $a$ and a new vector $b$

We look to maximize $b$ such that it explains covariance not previously explained by the first, for this reason we introduce an optimization problem with two constraints:

• $a$ remains the same
• The sum of the squares of the coefficients of $b$ equals one
• None of the covariance explained by vector $a$ is explained by $b$, (the two vectors are orthogonal, another property of eigenvectors!)

This process is repeated for the entire transformation $M$. This gives us a sequence of eigenvalues that each reflect some fraction of the covariance, and sum to one. For our previously mentioned n-dimensional multivariate dataset, generally some number $k \lt n$ of eigenvectors explain the total variance well enough to ignore the $n - k$ vectors remaining. This gives a set of principal components to investigate.

It might follow that this corresponds to an eigenvalue problem: $$cov(X)M = \lamba M$$ If it doesn’t appear to be clear, let’s step back and look again at eigenvectors. As I said earlier, eigenvectors provide insight into what vectors don’t change direction under a linear map. We are trying find a linearly independent set of vectors that provide insight into the structure of our covariance from our multivariate distribution. Eigenvectors are either the same or orthogonal. This algorithm we described, is precisely the algorithm one would use to find the eigenvectors of any (full-rank) linear transformation, not just the linear transformation done in Principal Component Analysis.

Chemistry application and interpretation:

Before PCA is done on an MD Simulation, we have to consider what goals we have for the analysis of results. We are searching to arrange the data outputted by PCA such it gives us intuition into some physical behavior of our system. This is usually done a single structure, and in order to focus insight on relative structural change rather than some sort of translational motion of the entire structure, an alignment of an entire trajectory to some structure of interest by minimization of the Root Mean Square Distance must be done prior to analysis. After RMS alignment, any variance in data should be variance due to changes in structure.

Cecilia Clementi has this quote in her Free Energy Landscapes paper:

Essentially, PCA computes a hyperplane that passes through the data points as best as possible in a least-squares sense. The principal components are the tangent vectors that describe this hyperplane

How do we interpret n-dimensional physical data from the tangent vectors of an k-dimensional hyperplane embedded in this higher dimensional space? This is all very mathematical and abstract, . What we do is reduce the analysis to some visually interpretable subset of components, and see if there is any indication of clustering that occurs.

Remember, we have an explicit linear map relating the higher dimensional space to the lower-dimensional space . By taking our trajectory and projecting it onto one of the set of eigenvector components of our analysis, we can extract embeddings in different ways, from Clementi again:

So, the first principal component corresponds to the best possible projection onto a line, the first two correspond to the best possible projection onto a plane, and so on. Clearly, if the manifold of interest is inherently non-linear the low-dimensional e mbedding obtained by means of PCA is severely distorted… The fact that empirical reaction coordinates routinely used in protein folding studies can not be reduced to a linear combination of the Cartesian coordinates underscores the inadequacy of linear dimensionality reduction techniques to characterize a folding landscape.

Again, this is all esoteric for the lay reader. What is a manifold, a reaction coordinate, a linear combination of cartesian coordinates? All we should know is that PCA is a limited investigational tool for complex systems, the variance the principal components explain should not necessarily be interpreted as physical parameters governing the behavior of a system.

My mentor max has a great Jupyter notebook up demonstrating PCA done on MD simulations here. All of these topics are covered in the notebook and should be relatively accessible if you understand what I’ve said so far. In my next post I will write about how I will be implementing PCA as a module in MDAnalysis.

-John

June 28, 2016

A Note from the Author

In the last blog post I wrote the most common critique I received was that I alienated myself from most of my potential audience. In an email I expressed to my Summer of Code mentor Max Linke my problem:

My number one worry in all of these matters is coming off as unrigorous or pseudo-scientific, and I think I probably overcompensate by being borderline inaccessible. I think this stems from some time spent enjoying a lack of rigor and the fun that is being pseudo-scientific.

For a while, after perusing blogs and social media, I thought, “Hey, I strongly identify with this ‘impostor syndrome’ thing.” Now I realize that’s a pretty ignorant and borderline insulting view to have. From what I can tell, I may have insecurities, but the difference between my anxieties and people with who struggle with true ‘impostor syndrome’ is that someone has to have experience tangible evidence that they are an outsider. As a straight white male, I don’t have these problems. So I guess in the future I will refrain from letting these anxieties be confused with something more serious — I have it pretty easy.

Going further, as a student and a tutor I noticed that far too often when people were in over their heads, they would get quiet and close off to the outside world. Especially in my math classes; a professor could be explaining Jordan Normal Forms, reciting proofs and corollaries and lemmas as if they were gospel, and although everyone was baffled, they would stay quiet. Nobody likes it when someone dominates a lecture with their own questions and at the same time a lot of people have missed fundamentals out of fear of sounding stupid. If I’m ever asked in a job interview to give a personal strength, it would be that I ask questions that might seem stupid with reckless abandon.

These posts are intended for people working to teach themselves a some difficult topics. I apologize for being obtuse and abstract and abstruse earlier. I will do my best to teach things from an intuition-first standpoint from here on and provide resources for refreshing on math and statistics topics. Please get in touch with me if something I say is unclear or wrong; this blog is as much for my own education as it is others.

-John

June 27, 2016

GSoC week 5 roundup

@cfelton wrote:

The last week we had the mid-term reviews, unfortunately there was
a communication error and many of our mentors are not marked as
mentors in the GSoC system. @mentors in the future, we need to get
our reviews in 96 hours before the GSoC deadline. PSF requires
48 hours before (for review) and I require 48 hours for review.
Please be respectful of everyones time involved and don't wait until
the last minute to do the reviews.

Consistent progress was made on the projects this week by all
students.

Student week 5 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 95%
@mkatsimpris: 26-Jun, >5, N
@Vikram9866: 25-Jun, >5, N

riscv:
health 96%, coverage 91%
@meetsha1995: 24-Jun, >5, Y

hdmi:
health 94%, coverage 90%
@srivatsan: 11-Jun, >5, Y

gemac:
health 87%, coverage 89%
@ravijain056, 17-Jun, 3, Y

pyleros:
health missing, 70%
@forumulator, 26-Jun, >5, Y

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 3

Participants: 2

June 23, 2016

@cfelton wrote:

This category is used to post weekly GSoC student project summaries.

Posts: 1

Participants: 1

June 18, 2016

GSoC week 4 roundup

@cfelton wrote:

All the midterm reviews are due by the 25th, the reviews open up
next week and primary mentors are encouraged to completed the
reviews as soon as possible.

Students, try and complete your milestones by Monday :), write
a longer midterm summary blog, and in the blog if you are not on
schedule include a modified proposal with new milestones.

Student week4 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 95%
@mkatsimpris: 15-Jun, >5, Y
@Vikram9866: 03-Jun, 5, Y

riscv:
health 96%, coverage unknown
@meetsha1995: 12-Jun, 3, N

hdmi:
health 94%, coverage 90%
@srivatsan: 11-Jun, >5, N

gemac:
health 57%, coverage 0%
@ravijain056, 17-Jun, >5, N

pyleros:
health missing, coverage missing
@formulator, 23-Mar, 0, N

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 1

Participants: 1

June 15, 2016

Diffusion Maps in Molecular Dynamics Analysis

It occurs to me in my previous post I didn’t thoroughly explain the motivation for dimension reduction in general. When we have this data matrix $X$ with $n$ samples and each sample having $m$ features, this number m can be very large. This data contains information that we want to extract, in the case of molecular dynamics simulations these are parameters describing how the dynamics are occurring. But this data can be features that distinguish faces from others in the dataset, handwritten letters and numbers from other numbers, etc. As it is so eloquently put by Porte and Herbst at Arizona

The breakdown of common similarity measures hampers the efficient organisation of data, which, in turn, has serious implications in the field of pattern recognition. For example, consider a collection of n × m images, each encoding a digit between 0 and 9. Furthermore, the images differ in their orientation, as shown in Fig.1. A human, faced with the task of organising such images, would likely first notice the different digits, and thereafter that they are oriented. The observer intuitively attaches greater value to parameters that encode larger variances in the observations, and therefore clusters the data in 10 groups, one for each digit

Here we’ve been introduced to the idea of pattern recognition and ‘clustering’, the latter will be discussed in some detail later. Continuing on…

On the other hand, a computer sees each image as a data point in $R^{nm}$, an nm-dimensional coordinate space. The data points are, by nature, organised according to their position in the coordinate space, where the most common similarity measure is the Euclidean distance.

The idea of the data being in a $nm$ dimensional space is introduced by the authors. The important part is that a computer has no knowledge of the the patterns inside this data. The human brain is excellent at plenty of algorithms, but dimension reduction is one it is especially good at.

Start talking about some chemistry John!

Fine! Back to the matter at hand, dimension reduction is an invaluable tool in modern computational chemistry because of the massive dimensionality of molecular dynamics simulations. To my knowledge, the biggest things being studied by MD currently are on the scale of the HIV-1 Capsid at 64 million atoms! Of course, these studies are being done on supercomputers, and for the most part studies are running on a much smaller number of atoms. For a thorough explanation of how MD simulations work, my Summer of Code colleague Fiona Naughton has an excellent and cat-filled post explaining MD and Umbrella Sampling. Why do we care about dynamics? As Dr. Cecilia Clementi mentions in her slides, ‘Crystallography gives structures’, but function requires dynamics!’

A molecular dynamics simulation can be thought of as a diffusion process subject to drag (from the interactions of molecules) and random forces, (brownian motion). This means that the time evolution of the probability density of a molecule occupying a point in the configuration space $P(x,t)$ satisfies the Fokker-Plank Equation (This is some complex math from statistical mechanics). The important thing to note is that the Fokker-Plank equation has a discrete eigenspectrum, and that there usually exists a spectral gap reflecting the ‘intrinsic dimensionality’ of the system it is modeling. A diffusion process is by definition markovian, in this case a continuous markov process, which means the state at time t is solely dependent on the instantaneous step before it. This is easier when transferred over to the actual discrete problems in MD simulation, the state at time $t$ is only determined by the state at time $t-1$.

Diffusion maps in MD try to find a discrete approximation of the eigenspectrum of the Fokker-Plank equation by taking the following steps. First, we can think of changes in configuration as random walks on an infinite graph defined by the configuration space. From Porte again:

The connectivity between two data points, x and y, is defined as the probability of jumping from x to y in one step of the random walk, and is

$$connectivity(x,y) = p(x,y)$$

It is useful to express this connectivity in terms of a non-normalised likelihood function, k, known as the diffusion kernel:

$$connectivity \propto k(x,y)$$

The kernel defines a local measure of similarity within a certain neighbourhood. Outside the neighbourhood, the function quickly goes to zero. For example, consider the popular Gaussian kernel:

$$k(x,y) = \exp(-\frac{|x-y|^{2}}{\epsilon})$$

Coifman and Lafon provide a dense but extremely thorough explanation of diffusion maps in their seminal paper. This quote screams molecular dynamics:

Now, since the sampling of the data is generally not related to the geometry of the manifold, one would like to recover the manifold structure regardless of the distribution of the data points. In the case when the data points are sampled from the equilibrium distribution of a stochastic dynamical system, the situation is quite different as the density of the points is a quantity of interest, and therefore, cannot be gotten rid of. Indeed, for some dynamical physical systems, regions of high density correspond to minima of the free energy of the system. Consequently, the long-time behavior of the dynamics of this system results in a subtle interaction between the statistics (density) and the geometry of the data set.

In this paper, the authors acknowledge that oftentimes an isotropic kernel is not sufficient to understand the relationships in the data. He poses the question:

In particular, what is the influence of the density of the points and of the geometry of the possible underlying data set over the eigenfunctions and spectrum of the diffusion? To address this type of question, we now introduce a family of anisotropic diffusion processes that are all obtained as small-scale limits of a graph Laplacian jump process. This family is parameterized by a number $\alpha$ which can be tuned up to specify the amount of influence of the density in the infinitesimal transitions of the diffusion. The crucial point is that the graph Laplacian normalization is not applied on a >graph with isotropic weights, but rather on a renormalized graph.

The derivation from here requires a few more steps:

• Form a new kernel from anisotropic diffusion term: Let $$q_{\epsilon}(x) = \int k{\epsilon}(x,y)q(y) \,dy$$
Where $$k{\epsilon}^{(\alpha)} = \frac{k{\epsilon}(x,y)}{q{\epsilon}(x) q{\epsilon}(y) }$$
• Apply weighted graph Laplacian normalization: $$d{\epsilon}^{(\alpha)}(x) = \int k{\epsilon}^{(\alpha)}(x,y)q(y) \,dy$$
• Define anisotropic transition kernel from this term $$p{\epsilon,\alpha}(x, y) = \frac{k{\epsilon}^{(\alpha)}(x,y)}{d_{\epsilon}^{(\alpha)}(x)}$$

This was all kinds of painful, but what this means for diffusion maps in MD is that a meaningful diffusion map will have an anisotropic, (and therefore unsymmetric kernel). Coifman and Lafon go on to prove that for $\alpha$ equal to $\frac{1}{2}$ this anisotropic kernel is an effective approximation for the Fokker-Plank equation! This is a really cool result that is in no way obvious.

Originally, when I studied diffusion maps while applying for the Summer of Code I was completely unaware of Fokker-Plank and the anisotropic kernel. Of course, learning these topics takes time, but I was under the impression that diffusion kernels were symmetric across the board, which is just dead wrong. This of course changes how eigenvalue decomposition can be performed on a matrix and requires a routine like Singular Value Decomposition instead of Symmetric Eigenvalue Decomposition. If I had spent more time researching literature on my own I think I could have figured this out. With that being said, there are 100+ dense pages given in the citations below.

So where are we at? Quick recap about diffusion maps:

• Start taking random walks on a graph
• There are different costs for different walks based on likelihood of walk happening
• We established a kernel based on all these different walks
• For MD we manipulate this kernel so it is anisotropic!

Okay, so what do we have left to talk about…

• How is epsilon determined?
• What if we want to take a random walk of more than one jump?
• Hey John, we’re not actually taking random walks!
• What do we do once we get an eigenspectrum?
• What do we use this for?

Epsilon Determination

Epsilon determination is kind of funky. First off, Dr. Andrew L. Ferguson notes that division by epsilon retrains ‘only short pairwise distances on the order of $\sqrt{2\epsilon}$’. In addition, Dr. Clementi in her slides on diffusion maps notes that the neighborhood determined by epsilon should be locally flat. For a free-energy surface, this means that it is potentially advantageous to define a unique epsilon for every single element of a kernel based on the nearest neighbors to that point in terms of value. This can get painful. Most researchers seem to use constant epsilon determined from some sort of guess and check method based on clustering.

For my GSoC pull request that is up right now, the plan is to have an API for an Epsilon class that must return a matrix whose $ij th$ coordinate is $\frac{d(i,j)^2}{\epsilon_ij }$. From here, given weights for the anisotropy of the kernel, we can form the anisotropic kernel to be eigenvalue-decomposed. Any researcher who cares to do some complex choice of epsilon based on nearest-neighbors is probably a good enough hacker to handle implementation of this API in a quick script.

Length $t$ Walks

Nowhere in the construction of our diffusion kernel are we actually taking random walks. What we are doing is taking all possible walks, where two vertices on the graph are close if $d(x,y)$ is small and far apart if $d(x,y)$ is large. This accounts for all possible one-step walks across our data. In order to get a good idea of transitions that occur over larger timesteps, we take multiple steps. To construct this set of walks, we must multiply our distance matrix $P$ by itself t-times, where t is the number of steps in the walk across the graph. From Porte again (stealing is the best form of flattery, no?):

With increased values of t (i.e. as the diffusion process “runs forward”), the probability of following a path along the underlying geometric structure of the data set increases. This happens because, along the geometric structure, points are dense and therefore highly connected (the connectivity is a function of the Euclidean distance between two points, as discussed in Section 2). Pathways form along short, high probability jumps. On the other hand, paths that do not follow this structure include one or more long, low probability jumps, which lowers the path’s overall probability.

I said something blatantly wrong in my last post. I’m a fool, but still, things do get a little complicated when analyzing time series data with diffusion maps. We want to both investigate different timescale walks from the diffusion maps, but also to be able to project our snapshot from a trajectory at a timestep to the corresponding set of eigenvectors describing the lower dimensional order-parameters.

From Ferguson:

The diffusion map embedding is defined as the mapping of the ith snapshot into the ith components of each of the top k non-trivial eigenvectors of the $M$ matrix.

Here the $M$ matrix is our anisotropic kernel. So from a spectral decomposition of our kernel (remember that it is generated by a particular timescale walk), we get a set of eigenvectors that we project our snapshot (what we have been calling a both a trajectory frame and a sample, sorry) that exists as a particular timestep in our MD trajectory. This can create some overly similar notation, so I’m just going to avoid it and hope that it makes more sense without notation.

Using Diffusion Maps in MDAnalysis

Alright, this has been a lot to digest, but hopefully you are still with me. Why are we doing this? There are plenty of reasons, and I am going to list a few:

• Dr. Ferguson used diffusion maps to investigate the assembly of polymer subunits in this paper
• Also for the order parameters in alkane chain dynamics
• Also for umbrella sampling
• Dr. Clementi used this for protein folding order parameters here
• Also, Dr. Clementi used this for polymerization reactions here
• Dr. Clementi also created a variant that treats epsilon determination very carefully with LSD
• There are more listed in my works cited

The first item in that list is especially cool; instead of using a standard RMSD metric, they abstracted a cluster-matching problem into a graph matching problem, using an algorithm called Isorank to find an approximate ‘greedy’ solution.

There are some solid ‘greedy’ vs. ‘dynamic’ explanations here. The example I remember getting is to imagine you are a programmer for a GPS direction provider. We can consider two ways of deciding an optimal route, one with a greedy algorithm and the other with a dynamic algorithm. At each gridpoint on a map, a greedy algorithm will take the fastest route at that point. A dynamic algorithm will branch ahead, look into the future, and possibly avoid short-term gain for long term drive-time savings. The greedy algorithm might have a better best-case performance, but a much poorer worst-case performance.

In any case, we want to allow for the execution of a diffusion map algorithm where a user can provide their own metric, tune the choice of epsilon, the choice of timescale, and project the original trajectory timesteps onto the new dominant eigenvector, eigenvalue pairs.

Let’s talk API/ Actual Coding (HOORAY!)

DistMatrix

• Does frame by frame analysis on the trajectory, implements the _prepare and _single_frame methods of the BaseAnalysis class
• User selects a subset of a atoms in the trajectory here
• This is where user provides their own metric, cutoff for when metric is equal, weights for weighted metric calculation, and a start, stop, step for frame analysis

Epsilon

• We will have some premade classes inheriting from epsilon, but all the API will require is to return the manipulated DistMatrix, where each term has now been divided by some scale parameter epsilon
• These operations should be done in place on the original DistMatrix, under no circumstances should we have two possibly large matrices sitting in memory

DiffusionMap

• Accepts DistMatrix (initialized), Epsilon (uninitialized) with default a premade EpsilonConstant class, timescale t with default = 1, weights of anisotropic kernel as parameters
• Performs BaseAnalysis conclude method, wherein it exponentiates to the negative of each term given by Epsilon.scaledMatrix, performs the procedure for the creation of the anisotropic kernel above, and matrix multiplies anisotropic kernel by the timescale t.
• Finally, eigenvalue decomposes the anisotropic kernel and holds onto the eigenvectors and eigenvalues as attributes.
• Should contain a method DiffusionMap.embedding(timestep), that projects a timestep to its diffusion embedding at the given timescale t.

Works Cited:

June 13, 2016

GSoC week 3 roundup

@cfelton wrote:

For some students things slowed down this week for others the pace was maintained. Students, for the most part, are becoming more familiar with the design process and the MyHDL design patterns. It is fun watching the projects evolve.

The first week it was mentioned two tests were generated that exposed possible bugs. One of the tests has been submitted as a PR to the MyHDL repository the other still needs to be PR'd to the MyHDL repository. As mentioned in the first round-up this is great exposure to the students and open-source development.

Student week3 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 95%
@mkatsimpris: 09-Jun, >5, Y,
@Vikram9866: 03-Jun, 2, N

The students are moving onto their next blocks. The continued PRs pushed to the main repo and expansion of the tests.

riscv:
health 96%, coverage unknown
@meetsha1995: 12-Jun, >5, Y

hdmi:
health missing, coverage missing
@srivatsan: 11-Jun, >5, N

gemac:
health 57%, coverage 0%
@ravijain056, 09-Jun, >5, Y

pyleros:
health missing, coverage missing
@formulator, 23-Mar, 0, N

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 1

Participants: 1

Dimension Reduction, a review of a review

Hello! This is my first post moving over to a new site built by wintersmith. Originally I was going to use jekyll pages, but there was an issue with the latest Ruby version not being available for Linux, (maybe macs are better…). I spent way too much time figuring out how to install a markdown plugin that allowed for the inclusion of LaTex. I did this all without realizing I could simply include:

<script type="text/javascript" async
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>


below my article title and LaTex would easily render. Now that this roadblock is cleared, I have no excuses preventing me from writing a post about my work.

This post is meant to discuss various dimension reduction methods as a preface to a more in-depth post about diffusion maps performed on molecular dynamics simulation trajectories. It assumes college-level math skills, but will try to briefly explain high-level concepts from Math and Stats. Towards the end I will provide a segue into the next post.

Dimension reduction is performed on a data matrix $X$ consisting of $n$ ‘samples’ wherein each sample has a set of $m$ features associated with it. The data in the matrix is considered to have dimension $m$, but oftentimes the actual ‘intrinsic dimensionality’ is much lower. As Laurens van der Maaten defines it, ‘intrinsic dimensionality’ is ‘the the minimum number of parameters needed to account for the observed properties of the data’.

(So far, the most helpful explanation of this fact was presented in a paper on diffusion maps by Porte et al In the paper, a dataset of m-by-n pixel pictures of a simple image randomly rotated originally has dimension $mn$ but after dimension reduction, the dataset can be organized two dimensionally based on angle of rotation.)

At the most abstract level, dimension reduction methods usually are posed as an optimization problem that often requires the solution to an eigenvalue problem. What is an optimization problem you ask? That wikipedia article should help some, the optimization being done in dimension reduction is finding some linear or non-linear relation $M$ that minimizes (or maximizes) a cost function $\phi (x)$ on some manipulation of the data matrix, call it $X_{manipulated}$. Examples of various functions will be given in detail later.

In most cases this can be turned into an eigenproblem posed as: $$X_{manipulated} M = \lambda M$$

Solving this equation using some algorithm like Singular Value Decomposition or Symmetric Eigenvalue Decomposition will provide a set of m linearly-independent eigenvectors that act as a basis for a lower dimensional space. (Linear independence means no vector in the set can be expressed as some sum of the others, a basis set has the property that any vector in a space can be written as the sum of vectors in the set.) The set of eigenvectors is of given by an eigenvalue decomposition will be the ‘spectrum’ of the matrix $M$. This spectrum will have what’s referred to as a ‘spectral gap’ after a certain number of eigenvalues, where the number of eigenvalues falls dramatically compared to the previous. The number of significant eigenvalues before this gap reflects the intrinsic dimension of a space.

In some cases, the manipulation is somewhat more complicated, and creates what is called a generalized eigenvalue problem. In these situations the problem posed is $$X_a M = \lambda X_b M$$ Where $X_a$ and $X_b$ are distinct but both still generated from some manipulation on the original data matrix X.

The methods discussed so far necessitate the use of convex cost functions for an optimization. From my professor Dr. Erin Pearse (thanks!):

The term convexity only make sense when discussing vector spaces, and in that case a subset U of a vector space is convex iff any convex combination of vectors in U is again in U. A convex combination is a linear combination where the coefficients are nonnegative and sum to 1.

Convex functions are similar but not entirely related. A convex function does not have any local optima that aren’t also global optima which means that if you’re at a maximum or minimum, you know it is global.

(I think there is a reason why people in optimization refer to surfaces as landscapes. An interesting surface may have many hills and valleys, and finding an optimal path is like a hiker trying to cross a mountain path blind — potentially problematic.)

Convex functions will always achieve the same solution given some input parameters, but non-convex functions may get stuck on some local optima. This is why a method like t-SNE will converge to different results on different iterations.

Methods for dimension reduction will be either linear or non-linear mappings. In both cases, the original data matrix $X$ is embeddable in some manifold. A manifold is any surface that is locally homeomorphic to $R^{2}$. We want these mappings to preserve the local structure of the manifold, while also possibly preserving the global structure. This depends on the task meant to be done with the reduced data. I think the notion of structure is left specifically vague in literature because it is just so damn weird (it is really hard to think about things in greater than 3 dimensions…)

A great example of data embeddable in a weird, albeit three dimensional manifold is the Swiss roll: borrowed from dinoj. The many different dimension reduction methods available will have disparate results when performed on this data. When restricted to paths along the manifold, red data will be far apart from black, but if a simple euclidean distance is measured, the points might be considered close. A dimension map that uses simple euclidean distance between points to resolve structure will fail miserably to eke out the Swiss roll embedding.

When looking to investigate the lower dimensional space created by a dimension reduction, linear mappings have an explicit projection provided by the matrix formed by the eigenvectors. Non-linear methods do not have such an explicit relationship. Finding physical meaning from the order parameters given by a non-linear technique is an active area of research.

It might be too small of detail for some, but the rest of this post will be focused on providing a quick explanation of various dimension reduction techniques. The general format will be:

• optimization problem posed
• formal eigenvalue problem given
• interesting insights and relations
• pictures that I like from other work

Multidimensional Scaling (MDS), Classical Scaling, PCA

• PCA cost function: Maximizes $Trace(M^{T}cov(X)M)$
• PCA eigenvalue problem $Mv = \lambda v$ where M is this linear mapping minimizing the covariance
• Quote from a Cecilia Clementi paper on diffusion maps where she mentions PCA: ‘Essentially, PCA computes a hyperplane that passes through the data points as best as possible in a least-squares sense. The principal components are the tangent vectors that describe this hyperplane’

• Classical scaling relies on the number of datapoints not the dimensionality.

• Classical scaling cost function: Minimizes $$\phi ( Y ) = \Sigma ij = ( d{ij} - || y{i} - y{j} ||^{2} )$$ this is referred to as a strain cost function. (subscripts are currently an issue…)
• Other MDS methods can use stress or squared stress cost functions
• Classical scaling gives the exact same solution as PCA

Isomap

• Geodesic distances are computed by constructing a nearest-neighbor graph and using Djistrka’s algorithm to find short distance. Erroneous connections can be made by improperly connecting neighbors.
• Can fail if manifold has holes.
• Demonstration of failure of PCA versus success of Isomap

Kernel PCA

• Does PCA on a kernel function, retains large pairwise distances even though they are measured in the feature space

Diffusion Maps

• The key idea behind the diffusion distance is that it is based on integrating over all paths through the graph.
• Isomap will possibly short circuit, but the averaging of paths in diffusion maps will prevent this from happening, it is not one shortest distance but a collective of shortest distances.
• Pairs of datapoints with a high forward transition probability have a small diffusion distance
• Eigenvalue problem: $P^{(t)} v = \lambda v$, where $P$ is a diffusion matrix reflecting all possible pairwise diffusion distances between two samples
• Diagonalization means that we can solve the equation for t=1 and then exponentiate eigenvalues to find time solutions for longer diffusion distances
• Because the graph is fully connected, the largest eigenvalue is trivial
• The same revelation also stems from the fact that the process is markovian, that is the step at time t only depends on the step at time t-1, it forms a markov chain.
• Molecular dynamics processes are certainly markovian, protein folding can be modeled as a diffusion process with RMSD as a metric

Locally Linear Embedding:

• LLE describes the local properties of the manifold around a datapoint x i by writing the datapoint as a linear combination $w_i$ (the so-called reconstruction weights) of its k nearest-neighbors $x i_j$.
• It solves a generalized eigenvalue problem, preserves local structure.
• Invariant to local scale, rotation, translations
• Cool picture demnostrating power of LLE:

• Fails when the manifold has holes
• In addition, LLE tends to collapse large portions of the data very close together in the low-dimensional space, because the covariance constraint on the solution is too simple

Laplacian Eigenmaps:

• Laplacian Eigenmaps compute a low-dimensional representation of the data in which the distances between a datapoint and its k nearest neighbors are minimized.
• The ideas studied here are a part of spectral graph theory
• The computation of the degree matrix M and the graph laplacian L of the graph W allows for formulating the minimization problem in defined above as an eigenproblem.
• Generalized Eigenproblem: $Lv = \lambda Mv$

Hessian LLE:

• Minimizes curviness of the high-dimensional manifold when embedding it into a low dimensional data representation that is locally isometric
• What is a Hessian?. Hessian LLE uses a local hessian at every point to describe curviness.
• Hessian LLE shares many characteristics with Laplacian Eigenmaps: It replaces the manifold Laplacian by the manifold Hessian. ‘As a result, Hessian LLE suffers from many of the same weaknesses as Laplacian Eigenmaps and LLE.’

Local Tangent Space Analysis:

• LTSA simultaneously searches for the coordinates of the low-dimensional data representations, and for the linear mappings of the low-dimensional datapoints to the local tangent space of the high-dimensional data.
• Involves applying PCA on k neighbors of x before finding local tangent space

Sammon mapping

• Adapts classical scaling by weighting the contribution of each pair $(i, j)$ to the cost function by the inverse of their pairwise distance in the high-dimensional space d_ij

Multilayer Autoencoder

• Uses a feed forward neural network that has a hidden layer with a small number of neurons such that the neural network is forced to learn a lower dimensional structure
• This is identical to PCA if using a linear activation function! What undiscovered algorithms will be replicated by neural nets? Will neural nets actually hurt scientific discovery?

Alright, so that’s all the gas that is in my tank for this post. Hopefully you’ve come and understood something a little bit better than before. In my next post, I am going to focus on diffusion maps as they pertain to molecular dynamics simulations. Diffusion maps are really cool in that they really are an analogue the physical nature of complex molecular systems.

June 09, 2016

Things I wished I had known when I got serious about programming

This is my first post about actual programming! When I started doing computational research about three years ago, I was lazy and borderline incompetent. Today, the tools I have learned allow me to be equally lazy while being somewhat more competent. These range from simple lifestyle decisions to basic tech skills.

Tip 1: Install Linux

Do yourself a favor and install Linux. My first install of Linux was done on my laptop, and it turns out that the wifi was broken until I installed the new drivers. Figuring out why things broke was a bit of a slog, but I eventually stumbled upon a great demonstration of the beauty of open source software. Incredibly, this Realtek employee wrote drivers for Linux on his own time! Installing these drivers was a bit of a rabbit’s hole, but I firmly believe that system administration builds character. Also, for what its worth the install on my desktop was painless.

Tip 2: After installing linux, learn your shell commands

Navigating the command line


pwd # shows where you are in the filesystem
cd # changes directory to specfied path
ls # shows existing files in path
cp # copy specified file to new filename
grep # a tool to search for regular expressions or patterns inf iles/directories
mkdir # creates a directory with a specified name in current path
chmod # changes permissions for a file
sudo # if given administrator priveleges, this allows installation
# directories if given as a prefix to a shell command
rm # deletes stuff
which (command here) # shows the path taken to binary executable of a command


Internalizing the numerous tools at ones disposal takes some time. Part of me thinks that I have learned command line tools simply because it makes me feel like Deckard.

The workflow speed up can be tremendously useful. When it comes to tools like git, the Github Graphic User Interfaces (GUIs) available to Windows are awful in comparison and it becomes a necessity.


# change to website directory
# for me this is:
cd github/jdetle.github.io
pwd # press enter
# i am in /home/jdetlefs/github/jdetle.github.io
# trying to copy to /home/jdetlefs/github/jdetle.github.io/images
# when typing a path ‘~’ is a placeholder for ‘/home/username’
# YU is a unique identifier for this file, pressing tab twice will list
# all files with these characters as an identifier (DONT PRESS
# ENTER YET)
(continued) images/‘filename’.gif # press enter
# check that it exists in the appropriate path
ls images
# ‘filename.gif’ should exist in this path!


Initially this process may take longer than dragging and dropping files, but it quickly becomes far faster than using a GUI.

How I add a newly installed program to my $PATH  echo$path
export PATH=$PATH:/path/to/my/program echo$PATH


How I work in virtualenvs (kind of complicated)



From here, things can get complicated if you have an installation of Anaconda. Both pip and anaconda are package managers, and but when installing itself, anaconda installs pip in its own directory, and handles virtual environments on its own. Things can break when using both conda and pip because managing dependencies is ugly and awful and left for people with higher pain tolerances than me. Of course, one can use conda virtualenvs, the differences probably aren’t too significant. This is actually still a problem on my laptop, so I am going to spend this time installing pip without conda and getting virtualenvs to work on my laptop! (6 and a half hours later.) … Okay so this isn’t pretty. If anaconda3 is installed, it looks like virtualenvwrappers won’t work because virtualenvwrapper only works using python2.7? (Don’t hold me to this). My solution was to delete anaconda3 altogether. Often times I’ve learned that the brute force solution works pretty well. (Somewhere in the distance Peter Wang feels a disturbance in the force.)

rm -rf anaconda3 # CAREFUL

Be careful with this!!! It recursively deletes this directory and all files in it and can ruin your OS install.

sudo apt-get install python-pip python-dev python-cython python-numpy g++
sudo pip install virtualenvwrapper
vi ~/.bashrc


Crap! Another thing I have to explain! vi opens Vim, a text editor that keeps things fast and simple. A fresh installation of Ubuntu 14.04 will come with Vi Improved or vim, that is a superset of a vi (a relic of days of yore) but has no preinstalled functionality. To install a working version of Vim that allows for syntax highlighting and easy workflow tools, do the following.


sudo apt-get update # cant hurt
sudo apt-get install vim
# ctrl-shift-n to open a new terminal window
# vi ~/.vimrc
# (press shift+colon) i (press enter) will allow you to start inserting
# copy and paste this  with a mouse
set expandtab
set tabstop=4
set softtabstop=4
set shiftwidth=3
set autoindent
set textwidth=80
set nocompatible
set backspace=2
set smartindent
set number
set cindent
colo torte # preinstalled color schemes in /usr/share/vim/vim74/colors
syntax on
# (press shift+colon) wq (press enter) to write and quit
# x does wq simultaneously


Here is some more info on Vim. Let’s get back to editing our bash shell configuration file (~/.bashrc)


vi ~/.bashrc
# before doing anything add the path
# (shift+colon) i (enter) and line 1 should be
PATH=~/bin:$PATH # line 2 export PATH # line 3 source /usr/local/bin/virtualenvwrapper.sh # (shift+colon) x (enter) and (ctrl+shift+n) to open a new terminal  And there you go! You should have a working installation of the virtualenvwrapper such that you are ready to use virtual environments when making your first pull request on your new Linux system! Using pip virtualenvs to work on github projects Let’s make a pull request for MDAnalysis using the tools we’ve learned!  # I like having a github/ folder for my various repositories # First, let’s clone into the repo cd (press enter) # takes us to home user directory mkdir github cd github  Before moving further, you should create a github account if you haven’t already and fork MDAnalysis. This will create a clone of the repo that will function as your ‘origin’ repository. MDAnalysis will be the ‘upstream’ repository that we set up later.  git clone https://github.com/YOURUSERNAME/mdanalysis # this takes a little bit (289 megabytes) mkvirtualenv MDA workon MDA pip install numpy cython seaborn # installs dependencies pip install -e . # installs MDAnalysis such that changing files # changes how packages behave when loaded for a script  From here we can start working on establishing a git workflow using branches.  git remote add upstream http://github.com/MDAnalysis/mdanalysis git branch NEW_PULL_REQUEST git fetch upstream #checks for updates git checkout upstream/develop -B develop # creates develop # branch to rebase against later and switches to it # there might be a way to do this without checking the branch out # but I dont know how git checkout NEW_PULL_REQUEST # do work on this branch  Any time you want to save the work you’ve done, you can see the files you’ve changed with  git status  Then add them to be staged for a commit that will be merged into the upstream develop branch if the pull request is accepted.  git add file_name_here # once you’ve added everything you want to include in a PR git commit -m ‘Insert a descriptive commit message here’  If you want to make a tiny commit, and blend it into a previous commit.  git rebase -i HEAD~(# of commits back you want to go)  Use vim style interactiveness to rebase commits. Changing ‘pick’ to ‘fixup’ ‘squashes’ a commit into the previous the first pick commit above it without using the commit statement. Using squash will combine commit statments. When happy, (shift+colon) (ctrl+x) and pressing Y and enter will combine commits. If still unsatisfied you can amend the commit manually.  git commit –amend #edit the commit  When you’re ready to save your work to the origin directory.  git fetch upstream git checkout develop # if prompted: git pull # updates changes made # if your command prompt makes a recursive merge, you’ve done something wrong git checkout NEW_PULL_REQUEST git rebase develop # rebase against develop to avoid merge conflicts git push origin NEW_PULL_REQUEST  Before actually making a pull requestion on github, make sure you didn’t break any tests, and you’ve written new tests for the new code you’ve written.  cd ~/github/mdanalysis/testsuite/ pip install -e . cd MDAnalysisTests/ ./mda_nosetests (press enter)  Hopefully that helps! There is a bevy of more rigorous work that’s been done on understand git branching. A succesful Git branching model is very helpful, reading the github helps too. Atom is a very nice editor with its github integration and hackability. I like to use Jupyter as a script playground for MDAnalysis. Tip 3: Get good at googling This tip from freeCodeCamp is applicable to any problem. Read-search-ask is a strategy that will help you learn indepently and boost confidence. Adding on to this advice, I have found that if you find the email of someone knowledgable in the area you are struggling in, simply by writing an email explaining your problem you can often find the solution on your own. If you don’t figure it out, then you might just impress that person with your detailed investigation. Even if they aren’t impressed, they’ll likely help you out. People in open source are generally receptive to people who demonstrate that they are working hard at becoming self-reliant. Always err on the side of not sending that email though; nobody likes being harassed with trivial questions. Tip 4: When working, avoid distractions, double check, triple check, quadruple check… When working on projects involving non-commercial software it is especially important to think of all the possible ways you could have screwed something up. Check your code for glaring logic errors and before running an intensive calculation, run a baseline to ensure that things work. In quantum chemistry, an example for this would be running a Hartree-Fock calculation with the STO-3G basis set before doing something that scales much slower. Develop scripts to ensure you are getting expected results, become skilled at using grep and simple regex. (Regexr is a great playground to learn regex) Assume that you’ve written bad code and that bugs will be caused by small changes to input parameters. Expect things to break easily. Inspect all work exhaustively. When reading academic papers, print them out and read them away from a PC. Usually academic papers use wildly esoteric jargon. This paper on diffusion maps (the subject of my next blog post) actually features a ‘jargon box’ which is just great. Academic papers usually also assume a high level of familiarity in the subject material and are written for those who are skilled at reading papers. It is easier to dedicate the intense concentration required for most papers when unplugging from tech and using some ear plugs. Finally, when communicating over email you can embrace one of two strategies. Either you can add a ‘sent from my iPhone’ tag to everything, or before adding recipients, take a second to go get a drink of water come back and reread the message for errors. Unfortunately, people will judge you for poor grammar even if they don’t mean to. (Shoot, I just ended a sentence in a preposition…) Tip 5: Tackle what intimidates you I seriously believe that this is the number one part of becoming an adult and it is something I have only really internalized in the last year. Problems will not go away by avoiding them. Oftentimes I find myself building up things in my head as if they will be a bigger deal than they actually are. Figuring out how to us virtualenvs was one example of such a barrier that occurred recently. This occurs in my personal life as well and invariably the outcome is always better than how I imagined it would be. Having trouble getting started on a project? Unfortunately Shia isn’t much help here. Segment your work into discrete chunks. If you have a pull request you want to make, think of all the possible minutia you have to work through in order to get things done. I like to use Google Inbox’s reminder feature to constantly remind myself of these things I need to get done. When I finish a task, I can swipe it off my todo list and enjoy that feeling of catharsis. If you are a budding programmer, take an algorithms class for free here. If you still aren’t busy enough, read the MDAnalysis Guide for Developers and start learning with help from a tight-knit community of open source contributors. June 05, 2016 GSoC week 2 roundup @cfelton wrote: This week was another impressive week (for the most part). There has been lots of activity and code/HDL writing. For each package and repository being developed each student should setup their repos with the various Python integration tools, specifically: travis-ci, landscape.io (linting), and coveralls (code test coverage), and readthedocs. Each of these badges should be added to your repository. I will update the best practices with some additional information. The goals will be to have user documentation, health > 90%, and converge on 100% coverage. Student week1 summary:  @mkatsimpris, last blog update 04-Jun, commits >5, PRs Y @Vikram9866, last blog update 03-Jun, commits >5, PRs Y @meetsha1995, last blog update 03-Jun, commits >5, PRs N @srivatsan-ramesh, last blog update 03-Jun, commits >5, PRs Y @ravijain056, last blog update 02-Jun, commits >5, PRs N @formulator, last blog update 23-Mar, commits 0, PRs N Student and mentors tags: @mkatsimpris, @meetshah1995, @Ravi_Jain, @sriramesh4, @vikram, @jck, @josyb, @hgomersall, @martin, @guy.eschemann, @eldon.nelson, @nikolaos.kavvadias, @tdillon Links to the student blogs and repositories: Merkourious, @mkatsimpris: gsoc blog, github repo Vikram, @Vikram9866: gsoc blog, github repo Meet, @meetshah1995, gsoc blog: github repo Srivatsan, @srivatsan-ramesh: gsoc blog, github repo Ravi @ravijain056: gsoc blog, github repo Pranjal, @forumulator: gsoc blog, github repo Posts: 1 Participants: 1 Read full topic May 30, 2016 GSoC week 1 roundup @cfelton wrote: The students were thrown fast and heavy into the world of open-source development in their first week. Recently MyHDL had a significant enhancement, MEP114, added to the 1.0 development branch. It was decided that the students would build their projects using the latest including the @myhdl.block decorator and the accompanying methods instead of the previous factory functions. It makes sense the GSoC projects be based on the latest. We want a collection of high-quality examples that demonstrate the MyHDL capabilities. We don’t want the examples out of date shortly after they are finished. Major changes like MEP114 don’t occur often but this change coincided with GSoC this year. Utilizing the latest is important, the students will be testing out the latest and greatest, providing feedback and improvements, while working with features that are the future of MyHDL. There were a couple opportunities to provide feedback to the MyHDL base. @srivatsan-ramesh uncovered possible name collision bug in the latest MEP114 implementation and submitted a test that exposes the problem. In addition, @mkatsimpris discovered a bug in the latest initial value changes, he also created a test (needs to be submitted to the myhdl repo). This test (currently in his local repo) should be added to the MyHDL test suite. It needs to be decided where these tests should be added, to the existing myhdl/test/conversion/general tests or to the myhdl/test/bugs. @mkatsimpris, @Vikram, @sriramesh4 , @meetshah1995 have all been using the development flow outlined in the best practices, pushing code to their development branches, creating PRs to their masters (for code review) and/or PRs to the main repo that they are working from, and interacting in the community. @Ravi_Jain has made inquires on the communications channels, I haven't heard from the others or seen any activity in the community or in their github repos or blogs this week. For those that haven't updated their repos or blogs leave a brief summary of the work completed this last week. This is a good start to the program lets keep it going, if you did not have reasonable effort this week make sure to provide an explanation to your mentors and me promptly. Student week1 summary: @mkatsimpris, last blog update 28-May, commits >5 @Vikram9866, last blog update 9-May, commits >5 @meetsha1995, last blog update 27-May, commits >5 @srivatsan-ramesh, last blog update 24-May, commits 3 @formulator, last blog update 23-Mar, commits 0 @ravijain056, last blog update 22-May, commits 0 Student and mentors tags: @mkatsimpris, @meetshah1995, @Ravi_Jain, @sriramesh4, @vikram, @jck, @josyb, @hgomersall, @martin, @guy.eschemann, @eldon.nelson, @nikolaos.kavvadias, @tdillon Links to the student blogs and repositories: * @mkatsimpris, gsoc blog, github repo * @Vikram9866, gsoc blog, github repo * @meetshah1995, gsoc blog, github repo * @srivatsan-ramesh, gsoc blog, github repo * @forumulator, gsoc blog, github repo * @ravijain056, gsoc blog, github repo Posts: 1 Participants: 1 Read full topic May 11, 2016 Hello World Hello world! I recently was given the amazing opportunity to contribute to MDAnalysis, an open source Molecular Dynamics simulation Analysis project through the Google Summer of Code initiative. I’ve been encouraged to maintain a blog by those giving me this opportunity so I’ll start things off by explaining how I got this great summer job. To summarize it quickly, Google sponsors a program in which college students apply to work on projects for open source software organizations. I was very lucky to have my research advisor, Dr. Ashley Ringer McDonald, encourage me to apply. I satisfied the first application requirement by learning how to use git and closing an issue on the MDAnalysis Github page. After that, I spent about 40 hours of concentrated effort over spring break studying dimensionality reduction and molecular dynamics in order to write a coherent application. By turning in a rough draft early I ensured the process was iterative; the contributors to MDAnalysis were very helpful with their critiques of my application. After turning in my final application, the process didn’t really stop. I made sure to keep making pull requests and to learn more about development workflow. I have learned so much about workflow and how to get over the dread of starting a pull request in the past few months. And then I got the news! I had been accepted to the Google Summer of Code! I was and still am extremely excited. With that being said, success made me lazy and somewhat complacent. Recently, I have been doing the bare minimum in terms of work and that is about to change. Even if no one is reading this, consider this blog as the first step in accountability for the rest of the summer. I will be using this to keep a record of everything I am working on day to day. With the exception of this introduction post, every post will attempt to keep a focus on a particular issue. I might write a post about a topic related to my Summer of Code work, or something related to my many other interests. I endeavor to remain positive and thoughtful, I will work on my clear overuse of commas, and I will try to make my readers laugh. I look forward to keeping this up! JD out. May 31, 2014 Terri Oda (PSF Org admin) You can leave academia, but you can't get the academic spam out of your inbox When I used to do research on spam, I wound up spending a lot of time listening to people's little pet theories. One that came up plenty was "oh, I just never post my email address on the internet" which is fine enough as a strategy depending on what you do, but is rather infeasible for academics who want to publish, as custom says we've got to put our email addresses on the paper. This leads to a lot of really awesome contacts with other researchers around the world, but sometimes it leads to stuff like the email I got today: Dear Terri, As stated by the Carleton University's electronic repository, you authored the work entitled "Simple Security Policy for the Web" in the framework of your postgraduate degree. We are currently planning publications in this subject field, and we would be glad to know whether you would be interested in publishing the above mentioned work with us. LAP LAMBERT Academic Publishing is a member of an international publishing group, which has almost 10 years of experience in the publication of high-quality research works from well-known institutions across the globe. Besides producing printed scientific books, we also market them actively through more than 80,000 booksellers. Kindly confirm your interest in receiving more detailed information in this respect. I am looking forward to hearing from you. Best regards, Sarah Lynch Acquisition Editor LAP LAMBERT Academic Publishing is a trademark of OmniScriptum GmbH & Co. KG Heinrich-Böcking-Str. 6-8, 66121, Saarbrücken, Germany s.lynch(at)lap-publishing.com / www. lap-publishing .com Handelsregister Amtsgericht Saarbrücken HRA 10356 Identification Number (Verkehrsnummer): 13955 Partner with unlimited liability: VDM Management GmbH Handelsregister Amtsgericht Saarbrücken HRB 18918 Managing director: Thorsten Ohm (CEO) Well, I guess it's better than the many mispelled emails I get offering to let me buy a degree (I am *so* not the target audience for that, thanks), and at least it's not incredibly crappy conference spam. In fact, I'd never heard of this before, so I did a bit of searching. Let's just post a few of the summaries from that search: From wikipedia: The Australian Higher Education Research Data Collection (HERDC) explicitly excludes the books by VDM Verlag and Lambert Academic Publishing from ... From the well-titled Lambert Academic Publishing (or How Not to Publish Your Thesis): Lambert Academic Publishing (LAP) is an imprint of Verlag Dr Muller (VDM), a publisher infamous for selling cobbled-together "books" made ... And most amusingly, the reason I've included the phrase "academic spam" in the title: I was contacted today by a representative of Lambert Academic Publishing requesting that I change the title of my blog post "Academic Spam", ... So yeah, no. My thesis is already published, thanks, and Simple Security Policy for the Web is freely available on the web for probably obvious reasons. I never did convert the darned thing to html, though, which is mildly unfortunate in context! comments PlanetPlanet vs iPython Notebook [RESOLVED: see below] Short version: I'd like some help figuring out why RSS feeds that include iPython notebook contents (or more specifically, the CSS from iPython notebooks) are showing up as really messed up in the PythonPython blog aggregator. See the Python summer of code aggregator and search for a MNE-Python post to see an example of what's going wrong. Bigger context: One of the things we ask of Python's Google Summer of Code students is regular blog posts. This is a way of encouraging them to be public about their discoveries and share their process and thoughts with the wider Python community. It's also very helpful to me as an org admin, since it makes it easier for me to share and promote the students' work. It also helps me keep track of everyone's projects without burning myself out trying to keep up with a huge number of mailing lists for each "sub-org" under the Python umbrella. Python sponsors not only students to work on the language itself, but also for projects that make heavy use of Python. In 2014, we have around 20 sub-orgs, so that's a lot of mailing lists! One of the tools I use is PythonPython, software often used for making free software "planets" or blog aggregators. It's easy to use and run, and while it's old, it doesn't require me to install and run an entire larger framework which I would then have to keep up to date. It's basically making a static page using a shell script run by a cron job. From a security perspective, all I have to worry about is that my students will post something terrible that then gets aggregated, but I'd have to worry about that no matter what blogroll software I used. But for some reason, this year we've had some problems with some feeds, and it *looks* like the problem is specifically that PlanetPlanet can't handle iPython notebook formatted stuff in a blog post. This is pretty awkward, as iPython notebook is an awesome tool that I think we should be encouraging students to use for experimenting in Python, and it really irks me that it's not working. It looks like Chrome and Firefox parse the feed reasonably, which makes me think that somehow PlanetPlanet is the thing that's losing a <style> tag somewhere. The blogs in question seem to be on blogger, so it's also possible that it's google that's munging the stylesheet in a way that planetplanet doesn't parse. I don't suppose this bug sounds familiar to anyone? I did some quick googling, but unfortunately the terms are all sufficiently popular when used together that I didn't find any reference to this bug. I was hoping for a quick fix from someone else, but I don't mind hacking PlanetPlanet myself if that's what it takes. Anyone got a suggestion of where to start on a fix? Edit: Just because I saw someone linking this on twitter, I'll update in the main post: tried Mary's suggestion of Planet Venus (see comments below) out on Monday and it seems to have done the trick, so hurrah! comments April 26, 2014 Terri Oda (PSF Org admin) Mailman 3.0 Suite Beta! I'm happy to say that... Mailman 3.0 suite is now in beta! As many of you know, Mailman's been my open source project of choice for a good many years. It's the most popular open source mailing list manager with millions of users worldwide, and it's been quietly undergoing a complete re-write and re-working for version 3.0 over the past few years. I'm super excited to have it at the point where more people can really start trying it out. We've divided it into several pieces: the core, which sends the mails, the web interface that handles web-based subscriptions and settings, and the new web archiver, plus there's a set of scripts to bundle them all together. (Announcement post with all the links.) While I've done more work on the web interface and a little on the core, I'm most excited for the world to see the archiver, which is a really huge and beautiful change from the older pipermail. The new archiver is called Hyperkitty, and it's a huge change for Mailman. You can take a look at hyperkitty live on the fedora mailing list archives if you're curious! I'll bet it'll make you want your other open source lists to convert to Mailman 3 sooner rather than later. Plus, on top of being already cool, it's much easier to work with and extend than the old pipermail, so if you've always wanted to view your lists in some new and cool way, you can dust off your django skills and join the team! Do remember that the suite is in beta, so there's still some bugs to fix and probably a few features to add, but we do know that people are running Mailman 3 live on some lists, so it's reasonably safe to use if you want to try it out on some smaller lists. In theory, it can co-exist with Mailman 2, but I admit I haven't tried that out yet. I will be trying it, though: I'm hoping to switch some of my own lists over soon, but probably not for a couple of weeks due to other life commitments. So yeah, that's what I did at the PyCon sprints this year. Pretty cool, eh? comments March 29, 2014 Terri Oda (PSF Org admin) Sparkfun's Arduino Day Sale: looking for inspriation! Sparkfun has a bunch of Arduinos on crazy sale today, and they're allowing backorders. It's a one day sale, ending just before midnight US mountain time, so you've still got time to buy your own! Those$3 minis are amazing.

I wound up buying the maximum amount I could, since I figure if I don't use them myself, they'll make nice presents. I have plans for two of the mini ones already, as part of one of my rainy day projects that's only a little past drawing board and into "let's practice arduino coding and reading sensor data" stage. But the rest are waiting for new plans!

I feel a teensy bit guilty about buying so many arduinos when I haven't even found a good use for the Raspberry Pi I got at PyCon last year. I did buy it a pretty rainbow case and a cable, but my original plan to use it as the brains for a homemade cnc machine got scuttled when John went and bought a nice handybot cnc router.

A pretty picture of the pibow rainbow raspberry pi case from this most excellent post about it. They're on sale today too if you order through pimoroni

I've got a few arty projects with light that might be fun, but I kind of wanted to do something a bit more useful with it. Besides, I've got some arty blinky-light etextile projects that are going to happen first and by the time I'm done those I think I'll want something different.

And then there's the Galileo, which obviously is a big deal at work right now. One of the unexpected perks of my job is the maker community -- I've been hearing all about the cool things people have tried with their dev boards and seeing cool projects, and for a while we even had a biweekly meet-up going to chat with some of the local Hillsboro makers. I joined too late to get a chance at a board from the internal program, but I'll likely be picking one up up on my own dime once I've figured out how I'm going to use it! (John already has one and the case he made for it came off the 3d printer this morning and I'm jealous!)

So... I'm looking for inspiration: what's the neatest arduino/raspberry pi/galileo/etc. project you've seen lately?

March 02, 2014

Google Summer of Code: What do I do next?

Python's in as a mentoring organization again this year, and I'm running the show again this year. Exciting and exhausting!

In an attempt to cut down on the student questions that go directly to me, I made a flow chart of "what to do next" :

(there's also a more accessible version posted at the bottom of our ideas page)

I am amused to tell you all that it's already cut down significantly on the amount of "what do I do next?" emails I've gotten as an org admin compared to this time last year. I'm not sure if it's because it's more eye-catching or better placed or what makes it more effective, since those instructions could be found in the section for students before. We'll see its magical powers hold once the student application period opens, though!