Python's Summer of Code 2016 Updates

December 06, 2016

udiboy1209 (kivy)

How to select an organisation for GSoC?

Hello! Its been a long time since I have posted something, because it has been a really long and hectic semester of college! Its over now, Phew! My friend Chinmay, who has been asking GSoC related questions to me for a while now, asked me to describe how does one go about selecting an organisation to apply for GSoC. So here I am. Hope you find this post useful.

What is a GSoC organisation?

GSoC is basically a platform for open source projects to attract potential students to contribute over the summers in return for some great working experience (and a good stipend :D). And it encourages open source organisations to propose such projects and provide mentors. It is in the interest of organisations to propose and mentor projects so that they get enthusiatic people to contribute a summer of work, and it is in the interest of students to do such projects for the awesomely fun experience of working on actively developed projects and getting to interact with the open source community.

Examples of such organisations are Mozilla, Python Software Foundation, KDE and many more. These organisations I mentioned are quite huge with a very large community backing them, and have multiple projects actively being developed. There are a lot more such organisations and communities which participate in GSoC every year (some big and some small but just as great :D). You can see the entire list of organisations and the projects they mentored on the GSoC website. Previous year’s data is just a google search away (get used to googling and get better at it :P, it’s really necessary for GSoC).

What should I look for in an organisation?

People think GSoC is all about coding and sitting in front of your laptop working through the summer but that’s not true. GSoC or any other open source contribution is about interacting with a community of like-minded people and contributing along with them to a common cause. The motivation to contribute can be anything from you personally benefitting out of it, or you wanting to thank the community because it helped you before. Or maybe you just want to explore something new! Contributions can be anything, not just actual code. A lot of communities appreciate help with their documentation and testing of various issues and features. Maybe you found a small typo in a documentation page you were reading, and felt like fixing it. Maybe you were that guy/girl and found some grammatical errors :P . You can go over to the community and tell them you found this mistake, and ask them to fix it. Maybe ask them to show you a way to fix it yourself. Yes you could definitely fix such mistakes yourself! All open source communities have a very easy-to-use way to make changes to its codebase/documentation. Most use git which is really easy once you start using it. So what if its just a small typo. It will be fixed because of you, your tiny little enthusiasm to right a wrong. The community will surely appreciate that effort, no matter how tiny it is. It may be scrutinized and checked so that it doesn’t cause any trouble, but it will be appreciated.

Its like you volunteering to decorate the neighbourhood for Diwali. So what if you just hang one light bulb. Or you just test if all the lights work or not. Your neighbours will surely appreciate that effort no matter how tiny. And in the end you will enjoy the festival and the lights along with the rest of the community, knowing that one light of the many is there because you hanged it!

More than the project, more than the work you might want to do, you should focus on the community of the organisation you choose. I have seen people pick organisations because they already know one or two people in the community. Or they have interacted with the community before for various reasons (almost always personal).

I popped my cherry in open source with a python-based music library manager called beets. I was using that software to organise my music and found a problem in it which I thought could be fixed. So after a bit of googling I found their github repository and posted an issue there, soon to get a reply from the creator that I could fix it myself if I was willing! And I did. It was a pretty small change in the code but it is still there, and it is there because of me.

Lets get specific to GSoC

All that banter was to give you a feel of the open source mindset. That I feel is important for GSoC, but you would need some more info to proceed. Now let me tell you a bit about how you should choose an organisation for GSoC.

It helps if you have in mind what softwares/languages/fields you want to work with. And this shouldn’t be restricted to what you currently know. The timeline of your project has and should have time for learning all the new things required for your project. Learning and exploring something new is part of the fun! And remember…

Well, you may not be a wizard at Hogwarts but this is true. You will have mentors in GSoC who are proficient in the area of your project and you can ask them to teach you all the new stuff that is required. All you need is enthusiasm and the will to learn.

When I decided I wanted to do GSoC, I was doing a lot of Android projects and apps, and so it may have been easier for me to work on some Android based project. But I was interested in learning python, which is a language I had tried out a bit only, so I opted for a python org instead. So I started contributing to Kivy and started interacting with the community. Kivy has a game engine called KivEnt and there was a project idea proposed by the Kivy community to work on KivEnt over the summer under GSoC. Coincidentally, I am very interested in game development and this project idea really appealed to me. So that is what I decided to take up as my project. The problem though was that KivEnt isn’t coded in pure python and due to performance reasons use a C-extension called Cython. Cython is something really different to someone used to python, and especially to someone with very little C experience (me :P). I told a few people about this and they assured me that they will help me with learning Cython while I am working on the project and it will be quite possible to keep up with deadlines of the project. And I did successfully learn Cython through the summer while also finishing my project on time! I learned two things from that. One is that nothing is too tough to learn and second, people in open source will help you with anything if you ask nicely and show the will to learn.

Note: while some stuff like languages and frameworks are easy to learn, some require a little bit more effort especially the topics involving CS or Maths theory like Computer Vision, Machine Learning, Algorithms etc. It helps to have some previous overview of such topics (like a college or online course) if you want to pick such a project/organisation

So this is how I picked my organisation and project:

• Language I wanted to work on: python. I chose Kivy which is a NUI framework in python.
• Field I wanted to work in: Game development. I chose to work on KivEnt, Kivy’s game engine.

It will really help if you define this for yourself. Then, when you go looking for orgs in previous org lists or ask people about what org to pick, you will know exactly what you are looking for.

I dont know what my interests are :P

That is a problem faced by a lot of people. There is always the option of trying out random stuff till you find something you like. But that is very time and energy consuming. What you could do is ask people you know who have contributed in open source before, previous GSoCers for example. You could ask them about their experience of the community, the stuff they worked on and take suggestions from them.

For eg. there are some python based organisations which do medical image processing based projects. These projects may not be development intensive and may involve implementing research papers in those fields and testing multiple cases. You could find out more about such projects by contacting students from last year.

There are some things other than just the project and the field you are working in. The entire contribution process for example is something you need to be comfortable with. Big organisations like Mozilla have a very active community of contributors who will help you, but they have strict rules when it comes to contribution. Your contribution isnt accepted until its perfect according to the rules. The community very readily helps people to clean up their code and patches so that they are acceptable by the rules, but some people would find such strictness useless or even frustrating. Smaller organisations on the other hand are more lax. It helps to know about such experiences beforehand.

All you need to remember is that whatever these people tell you will be their own experience. It is possible that you experience something entirely different. You will only find out what your interests are when you try out something on your own. All you will get from asking people is some suggestions on what to explore. In the end, it is you who defines your interests.

How do I find out more about the community?

You start interacting with people of the community. Its not a girl/guy you’re trying to woo that you’d need an icebreaker. Nobody in open source forums likes small talk, especially from newbies. You want to start contributing, you say that point blank. People will point you to easy bugs and issues to fix so you can get set up easily and have a good start. If you are having trouble with something, don’t try being polite, don’t ask to ask, don’t beat around the bush trying to be discreet. Just ask, and you will be helped :D.

Once you start on some easy bugs, you will face problems getting the development environment set up, finding where the bug originated from, what file serves what purpose and what not. The key is seeking help whenever you are stuck. Nobody is grading you or keeping score about how well you have performed.

Will the organisation get selected in GSoC?

It is important to know if the organisation has a chance to get selected in the final list of organisations for GSoC. You may like the community and what they work on might align with your interests but not all open source orgs apply for GSoC and not all that apply get selected. There are some criterias Google follows to select orgs too.

1. Only those organisations which are planning to apply for GSoC will get selected. Obviously :P. This is the first thing you should check. You can simply ask the community directly whether they are planning to apply. There’s no reason for them to keep this a secret. (for eg. beets doesn’t apply for GSoC)
2. Organisations which have been selected last year, or for the past two three years are very likely to get selected again. You can look up the list of selected organisations on previous year’s websites of GSoC.
3. If the organisation is applying for the first time, it will very likely get selected if it has a good community and reasonable number of long time contributors. GSoC wants its students to have experience with well-established communities and not those which are in the initial phases of ther life.

TL;DR

I’ll outline the steps you should follow to pick an organisation:

1. Define what you are interested in learning and working on over the summers.

1. What language: C++, Java, Python, Rust, JS, etc.
2. What field: Android, iOS, Game dev, Web dev, Computer vision, Machine learning, etc.

Ask previous GSoCers about their experiences and suggestions for things to explore.

2. Ask/Search for organisations which have a probability of getting selected for GSoC (look for those which were previously selected) and aligns with your interests.
3. Start interacting with the community for solving easy bugs, making minor contributions. You will get a feel of whether you like the community of that org or not.

Thanks to Kalpesh for the review and suggestions

October 29, 2016

srivatsan_r (MyHDL)

Good Sets

ACM ICPC India Regionals Online Preliminary round got over last week. Our team ‘Dashwood’ (named after the password used by Harold Finch in the TV series Person of Interest ), we were able to solve 5 questions out of the 7 questions.

The 5 questions which we could solve were all relatively easy compared to the other two.

Out of the 5 questions which we solved one was very interesting to solve, so here’s a blog explaining how I did it.

It was basically a Dynamic Programming problem.

Here is the code in C++

The inp[] will contain 1 at the indices corresponding to the input numbers.

Now we iterate through the array inp and if an index i has 1.

We update dp[i] by

dp[i] = (dp[i] + 1) % MOD;

Then we update the dp array at all the indices which are the multiples of i .

dp[j] = (dp[j] + (dp[i] + 1)) % MOD;

where j loops over all the multiples of i.

and with each update of the dp array we increment the variable sum by the same amount.

Now after all the updates sum will contain the final answer.

Why this works?

For example let the given array be {2, 3, 6, 12}.

Now we take {12} and we have to find the number of ways we can append new arrays to it such that 12 is near to its factor.

So we can append 6 or 3 or 2 with {12} or not append anything. We can append not just the individual numbers but also the arrays which can be formed with those numbers.

So for {6} we can append 2 or 3 or not append anything.

{2, 6}, {3, 6}, {6}

So dp[6] contains 3

For 2 and 3 we cannot append any numbers.

{2}, {3}

So dp[2] = dp[3] = 1

So we can append {2, 6}, {3, 6}, {6}, {2}, {3} to {12}

So the number of ways we can form subarrays with 12 is

dp[6] + dp[3] + dp[2]

which is 5.

The final answer is the sum of number of ways you can make arrays ending with {2}, {3}, {6} and {12}.

The variable sum has the total sum of the dp[] array and hence the final answer.

October 28, 2016

srivatsan_r (MyHDL)

Analyze This!

American Express Campus Analyze This is a data analysis competition organized by American Express.

I had a great time participating in the contest.

This is my first attempt to any such data analytics competition. It was quite challenging and I learnt a lot about Machine Learning and Scikit-Learn in the past dew days.

We were given a data set which contained the details of of many citizens who are going to vote in an election.

It was a supervised learning problem,  given the details of a citizen’s age, party voted previously, number of rallies attended, number of donations made to each party, etc and the party for whom they voted in the current election, we had to predict the party to which a new set of citizens will vote for.

What we did?

Since it was the first time I m trying something like this, I started with encoding all the string valued features into an one hot integer and scaled everything to the range (0, 1) using a MinMaxScaler. This increased the feature size vey much, So I thought that I could use Principle Component Analysis to extract the 200 best features that has much spread across the feature dimension.

Further I reduced the feature size by using LogisticRegression and picking up the features based on importance weights.

Then the feature size got reduced to 73. Then used these features to learn a Multi-Layer Perceptron model with 4 layers.

The Pipeline used for the whole process –

The whole model was developed in Python using scikit-learn module.

The overall accuracy of the model came out to be 75.4 % in the given dataset. We managed to get a rank of 80 in the leaderboard(though this was not the final result) out of 1500+ teams.

I guess it was a pretty decent score for a first timer.

The Github Repo containing the code and the trained model – here

I like movies.

My Top 10

1. Children of Men
2. Shawshank Redemption
3. Eternal Sunshine of the Spotless Mind
4. Good Will Hunting
5. Michael Clayton
6. An entire Pixar movie binge
7. Se7en (Couldn’t resist)
8. Young Frankenstein
9. Collateral
10. Training Day

noirs

1. Michael Clayton
2. The Usual Suspects
3. Collateral
4. The Hustler
5. The Color of Money
6. LA Confidential

westerns

1. No Country for Old Men
2. Django Unchained
3. The Last Samurai
4. 3:10 to Yuma (new version)
5. Unforgiven
6. Gran Torino

1. Shooter
2. John Wick

historical drama

1. Braveheart
3. Platoon
4. Apocalypse Now Redux
5. Apollo 13
6. The Social Network
7. A Beautiful Mind
8. John Grisham’s “The Rainmaker” (Fiction)
10. October Sky

thrillers/psychological/horror

1. Michael Clayton
2. The Machinist
3. The Exorcist
4. Manchurian Candidate (both)
5. The Silence of the Lambs
6. A lot of movies with Jake Gyllenhaal of all people, Prisoners, Zodiac, Donnie Darko, End of Watch, Nightcrawler, Rendition
7. The Departed
8. Runaway Jury

sci-fi

1. Interstellar
2. Looper
3. Primer
4. (Only the original) The Matrix

action

1. Everything Harrison Ford
2. Everything Tom Cruise
3. Everything Keanu Reeves (yes Bill and Ted are action flicks)
4. Everything Bruce Willis
5. Crouching Tiger Hidden Dragon

comedies

1. Young Frankenstein
3. Monty Python and the Holy Grail
5. Anchorman
6. Tropic Thunder
7. Ferris Bueller’s Day Off

October 02, 2016

John Detlefs (MDAnalysis)

My Five Year Plan

As a 23 year old college graduate I am often asked where I see myself in 5-10 years. I hate this question.

I’d say two thirds of my frustration when asked this type of question comes from the question being pointless, while the other third comes from some fear of not having a satisfactory answer.

If you don’t know my background, I recently graduated with both a Math and Chemistry major from the 3rd best public STEM college in the country. (This piece of bragging will come up later.) In July of 2015, in the midst of spending a summer working 3 different but equally shitty jobs I decided I wanted to become a programmer. After spending some time recovering from next-level burnout, I spent the next year working hard at teaching myself fundamentals of programming. Since then, I’ve been a Summer of Code Student, and have been slowly developing into someone with tangible experience in Open Source Software. (Shameless plug for my Github)

I have been experiencing a tremendous amount of stress in the past month which has in no way been mitigated by my credentials and achievements. I have been wrestling with the desire to become skilled without putting in the work, to get paid without having the experience, to talk about things I don’t necessarily know enough about. I want to be the programmers that I work with RIGHT NOW! Although programming and mathematical thinking align in many ways, there is a tremendous breadth to what I do not know.

But it turns out I’ve learned some things as I’ve struggled through hard Mathematics and Chemistry coursework. I’ve learned how to learn, and better yet, how to fail. Much like Andy Dufresne in “The Shawshank Redemption”, I’ve crawled through a river of shit and came out clean on the other side. I’ve dry-heaved over a toilet after failing an exam, I’ve spent countless weeks red-eyed. I’ve figured out that a healthy diet and exercise is a much better stress management tool than beer and pizza. I’ve learned just how much I can grasp in the course of five years of hard work.

A good friend of mine, James, once asked me the question, “John, what is the difference between coal and a diamond?”

The degrees I’ve earned have taught me how to survive the pressure of intense work. They have given me the ability to ask questions without fear of judgment from my peers. They have given me confidence that in just a few years time I can do some pretty great things!

So I don’t know where I’m going to be in a few years. It really isn’t on my list of concerns. What concerns me is what I’m going to achieve tonight, tomorrow, and next week. How I’m going to reconcile that 40 minutes of cardio, healthy eating and self-teaching with delivering on work-goals. To some degree, I consider answering the “five-year-plan” questions day-dreaming with a different name.

When I worry, I keep coming back to this speech.

I have to keep reminding myself of all the banal platitudes I’ve accrued over the years. I’m going to keep working on thinking about the right things. Talk to me in five years, if I stick to this routine, I’m certain that there are some great things in store.

If there is one thing I want to impart upon someone else, it is that you’re capable of great things. Just keep at it! -john

September 29, 2016

John Detlefs (MDAnalysis)

Organizing Collaboration Data as Streams

I had an idea for something I’d like to make/ see made.

Problem it tries to solve: Too many communication / data-provision platforms.

How it solves it: Allow easy curation and inspection of the communication that drives a project forward.

There are too many forms of communication, with not enough effective communication aggregators. People who create things communicate on online hosting services like Github, Microblogging services like twitter, email, chatting services like Gitter and Slack, video chat, the list goes on. Aggregating this data in a digestible manner is valuable to anyone interested in the history of how decisions were made and projects were created.

Parties that this could be useful to:

• developers looking to document the history of a company’s various projects
• lawyers looking to accumulate the data associated with discovery
• academics keeping track of research
• law enforcement tracking evidence

My idea is that a web app integrates with the browser such that one can easily aggregate the data on these various sites by associating these events with a tag in time and a stream it should be associated with.

If you cant read the picture, the idea is that all of these events are listed on the stream a user or set of users decided to associate it with. We can have templates for what users should use streams for, maybe a startup would have a main stream for mobile, web, and desktop app development, but the idea is extensible elsewhere.

Clicking on one of these points shows a component that curates the communication for easy inspection. If a user would want to add a slack conversation, hopefully we could get slack to support an api for creating a tag for that users personal interaction with slack, and be able to clip and save it to that point in the stream. The same interaction would happen with twitter, academic papers, saved emails, or any other form of viewable data.

Of course maybe this already exists, or at least in some version with Evernote, but the difference is in the accessibility of the experience, and tools to prevent cluttering of data at a zoomed out inspection of the streams.

September 27, 2016

John Detlefs (MDAnalysis)

Here’s some random nonsense about my thinking that I want to share/refine while I’m still learning about Observables.

Consider the term Behavior to be an output from a program that is expected 1..N times. This output can have many different side-effects but should occur in a sequential, predictable manner. Observables allow for predictable creation of behavior, and it seems that in a creation of an observable each side-effect should come from the result of a pure function.

The act of subscribing to an observable triggers behavior. If an observable consists entirely of pure functions, this maintains ensures the benefits of pure functional programming.

This also seems to tie into a deeper idea from topology. Or at least a pattern; covering spaces are an entirely different subject from observables. But this picture seems to tie into the idea of an observable too well.

Consider the behavior $B$, we lift it to the sandwich of the functions we want to create the desired behavior. Upon subscription we flatten back to the behavior $B$.

$B$ = {$f_1$, $f_2$, $f_3$}

And thats all I have, thanks, please refine my thinking in the comments!

September 06, 2016

chrisittner (pgmpy)

Feature summary of BN structure learning in python pgm libraries

This is a (possibly already outdated) summary of structure learning capabilities of existing Python libraries for general Bayesian networks.

libpgm.pgmlearner

• Discrete MLE Parameter estimation
• Discrete constraint-based Structure estimation
• Linear Gaussian MLE Parameter estimation
• Linear Gaussian constraint-based Structure estimation

Version 1.1, released 2012, Python 2

bnfinder (also here)

• Discrete & Continuous score-based Structure estimation
• scores: MDL/BIC (default), BDeu, K2
• supports restriction to subset of data set, per node
• supports restrictions of parents set, per node
• allows to restrict the serach space (max number of parents)
• search method??
• Command line tool

Version 2, 2011-2014?, Python 2

pomegranate

• Discrete MLE Parameter estimation
• Can be used to estimate missing values in incomplete data sets prior to model parametrization

Version 0.4, 2016, Python 2, possibly Python 3

Further relevant libraries include PyMC, BayesPy, and the Python Bayes Network Toolbox. Also check out the bnlearn R package for more functionality.

Concluding Google Summer of Code 2016

Name: Karan Saxena

Project Name: IMS V-ERAS: Improving the Step Recognition Algorithm

Working on: Python, PyTango, Kinect V2 sensor by Microsoft

Student Application: http://www.ims-gsoc.org/#improve-step-recognition

Project Mentors:

Ambar Mehrotra: https://bitbucket.org/mehrotraambar/
Antonio Del Mastro: https://bitbucket.org/aldebran/

Project Work:

Description:
Virtual ERAS (V-ERAS) forms a salient part of European Mars Analog Station (ERAS) for Italian Mars Society (IMS).

The immersive VR Simulation of V-ERAS allows users to interact with a simulated Martian environment using Aldebran VSS Motivity, Oculus Rift and Microsoft Kinect.

Motivity is a passive omnidirectional treadmill, so user’s steps are not in real dimensions. Therefore, the configuration needs to include an accurate and robust algorithm for estimating user’s steps, to be reproduced in the V-ERAS.

In the V-ERAS station simulation, the data used for the Step Recognition Algorithm with the Microsoft Kinect are the skeletal joints. They were recognized by means of the Skeletal Tracking implemented in Microsoft Kinect SDK (1.8).

However, there were two main issues with the previous settings, and the goal of my project was to rectify and overcome them:

1. Enhance the accuracy of the former step recognition algorithm: As of then, it was uncertain if that was totally or partially due to a noisy recognition of feet skeletal joints, or if that was due to the non-optimal algorithm only. That needed to be investigated and rectified.
1. Improved the recognition of feet joints: In the former configuration (using Kinect 1.8), the feet joints weren’t recognised precisely.

Deliverables achieved:

• Setting up of Kinect v2.
• Setting up PyTango with Kinect to integrate with the ERAS Environment.
• Capturing Body Frame (1920x1080) and Depth Frame and (512x424).
• Attaching corresponding Body Frame and Depth Frame by corresponding area overlay
• Calculating the distance moved (in mts) in real-time by a skeleton in the frame.
• Testing PyKinect2 and Tango Setup.
• Optimizing the output by tweaking certain parameters and changing input conditions (eg. by wearing colored shoes).

Challenging Bits:

• Setting up PyKinect2 was really difficult. Had to tweak the code a bit. Hence, supplied the custom library.
• Depth Frame was out of order with the Body Frame.
• Multi-threading support to simultaneously publish the values to the db/client.

Future Scope:

• Final integration in the production environment.

Acknowledgements:

I would like to thank:
• for the experience provided.
• IMS, for accepting my project proposal.
• My mentor Ambar Mehrotra, for being there to guide and help me.
• Vladkol and Ryan, for their valuable help.
• Antonio Del Mastro, for his kindness and encouragement during this GSoC.

And now my watch has ended!!

September 03, 2016

Nelson Liu (scikit-learn)

My Journey in Open Source / How to Get Started Contributing

I just finished the Google Summer of Code Program, wherein I worked on the Python machine learning package scikit-learn. Since I began working with the project in November 2015, I've occasionally received emails asking how one should get started contributing. In this blog post, I'll describe my journey in open source and give some tips for getting started.

In the beginning

I've always been interested in machine learning and the prospect of drawing reliable conclusions from incomplete data. Broadly, machine learning is the study of creating programs and applications that learn from experience and data.

With this interest in mind, I wanted to learn as much as I could. Early in my learning, I stumbled upon an elementary tutorial on Kaggle using scikit-learn to predict the survival outcome of Titanic passengers.

Despite having no prior knowledge of Python, I was surprised by and fell in love with how easy the package made the proecss. All I had to do was create the estimator, call fit() on my data, and call predict() on new samples! I dove into the API and documentation, and gradually began to learn about the complex mathematics and statistics underlying the deceptively simple API for the various machine learning algorithms implemented. At the same time, I learned more about Python through looking through the scikit-learn examples and writing machine learning applications on my own (though I did eventually comb through a book for a more formal approach). It would not be an understatement to say that I learned Python through scikit-learn (interestingly, I met several others at SciPy 2016 that also had this experience).

Finding slight errors

A few years later, while running one of the documented scikit-learn examples for visualizing the stock market structure (in release 0.16), I noticed that the code failed to run on my computer. Digging deeper, I saw that the problem was due to a deprecated function in matplotlib that was not in my recent version. I patched up the function myself, and saw that it ran satisfactorily on my machine.

The First Pull Request

At this point in time, I realized that I could contribute my correction back to the library. I had never been involved in an open source community before, so I was naturally quite apprehensive. I painstakingly followed scikit-learn's contributing guide to set up a proper development environment, and made my first pull request to the project. I was quite unfamiliar with git at the time, and messed this up somehow; I promptly opened another one with the correct contents.

Despite all my pre-reading and preparation, Gael Varoquaux and I worked together to make several changes to my simple pull request before it was ready for merging.

Through just my first pull request, I learned an immense amount; my code was initially rejected because it broke compatibility with past versions of matplotlib, which is something I had never even considered! Additionally, I had no experience with continuous integration systems and limited git knowledge. After taking a while to figure out how to "squash my commits" properly, the pull request was merged.

In hindsight, I'm extremely grateful to Gael and Tom for their support and patience; their kindness was a huge factor in making me feel welcomed in the project. I was happy that my work had been accepted by the project reviewers, and that I had the ability to contribute back to a project that I used so much. I wanted to get more involved and was curious of what else I could help out with. I began to follow the issue tracker and contributed back to the library whenever I could pitch in.

Participating in Google Summer of Code

In February 2016, I found out about a program called Google Summer of Code while browsing the scikit-learn project wiki and seeing the past proposals that had been submitted for consideration to the project. I thought the program would be a perfect opportunity to work more closely with community members as mentors and contribute a solid body of work.

I expressed my interest in participating on the mailing list, and I was contacted by Raghav a few weeks before the program opened. He proposed a project involving working with the tree module, and I drafted a proposal and was eventually accepted to the program.

I was fortunate to have great mentors in Raghav and Jacob and was able to learn a lot from them through our frequent communication on Google Hangouts and in person (Jacob and I are both at the University of Washington). For technical details about my Google Summer of Code project, check out my previous blog posts on them.

Post GSoC

Since the end of the program, I've been working on reviewing more pull requests. scikit-learn is mostly limited by reviewer bandwidth, as there are far more pull requests than contributors have time to critique. I hope that, with time, I can develop this skill and contribute to the project in this facet as well (while also contributing code, of course!).

Getting started with open source

Starting open source contributions can be a difficult experience; there's so much out there that it's difficult to find how you can pitch in. I'll try to provide a short guide on how to get started contributing below.

Finding something to work on

This is often the hardest part for new contributors. The best way to get started is to simply jump in! There are a myriad of ways to contribute to an open source project. Obviously, writing code to fix bugs, add new features, or enhance existing ones are useful. However, you don't have to write code to help out! Documentation is a critical part of any open source project, and there's always something to help out with in this department.

If you find an issue that you want to tackle, it's generally good practice to leave a comment asking whether you can work on it / that you will take it. This reduces the probability of duplicate patches.

While you're preparing your patch, make sure to read the contributing guidelines of the project. Often, this is a file in the root directory of the project. Alternatively, it could be in the documentation. Following the protocol of the project will reduce both the amount of effort needed by reviewers and you in terms of future corrections.

After you've submitted a pull request

Great, so you solved the issue and opened a pull request on the project. At the point, you should wait for reviewers to comment and suggest improvements. Don't be afraid to talk to them and ask them questions; you both have the same goal of improving the project. Address reviewer concerns in a timely manner, and your code will eventually be merged if the improvement is deemed necessary.

Be a good community member

I highly recommend that prospective contributors "Watch" the projects that they are interested in on Github. This allows you to easily keep up with the various conversations happening throughout the project and see how you can help out. Don't be afraid to join these conversations and make your voice heard; user input is generally quite welcome in discussions regarding new enhancements, APIs, and future releases. Of course, all rules of normal conversation apply to online forums as well --- treat others as you would like to be treated, and be a good person.

To conclude, I'd like to emphasize that everyone at one time was a beginner. Don't be afraid to ask "stupid" questions, because there's a natural learning curve involved in open source contribution; the most important thing by far is the willingness to try and learn.

Thanks to YenChen Lin and Victor Chen for reading and providing feedback on early drafts of this post

August 26, 2016

meetshah1995 (MyHDL)

That's all folks

If you recall Bugs Bunny from the title, you are awesome :).

On a more serious note, This is officially the last post of this myhdl-riscv gsoc series :(.

I implemented a RISC-V based processor Zscale in myHDL and validated-verified it with unit and assembly tests.

As of today, the entire core (Zscale processor) is functional except for the assembly test of the core. My co-gsocer, +Srivatsan Ramesh and I are working on it and hopefully should get it done soon. My college started on July 19 which slowed down the last few weeks which majorly involved testing the core.

A more holistic review :

Original Work Plan
My main motivation in the project was to demonstrate the power and flexibility of myHDL by developing a RISC-V based core in myHDL. A few main checpoints in mind were :
* Pure Python decoder for the entire instruction set.
* A complete RISCV core implemented in myHDL.
Work Completed
* Pure Python Decoder (y)
* myHDL based Decoder(y)
* Vscale individual modules (y)
* Vscale assembly modules (y)
* Vscale unit tests (y)
* Vscale assembly tests (n)
Work Remaining
Completion of Vscale assembly tests (remove a bug).
My Fork (more updated as I have a pending PR):
https://github.com/meetshah1995/riscv/commits/dev?author=meetshah1995

Main Repo (Contains all merged code ~5 PRs):
https://github.com/jck/riscv/commits/master?author=meetshah1995

Pull request(s) :

Pure Python RISC-V ISA Decoder Implementation
https://github.com/jck/riscv/pull/1

myHDL based RISCV 32I Decoder Implementation
https://github.com/jck/riscv/pull/2

Decoder module conversion and tests
https://github.com/jck/riscv/pull/3

Individual Vscale modules in myHDL
https://github.com/jck/riscv/pull/4

Assembly and test framework of Vscale in myHDL
https://github.com/jck/riscv/pull/5
Having said that, it gives me immense pleasure to have helped the myHDL & PSF community and would be more than glad if myHDL users used the riscv repository for their research and development.

It has been a great 3 months with a lot of learning and interaction experiences I gained on the way. Shout out to my mentors +Keerthan JC  and +Christopher Felton   for guiding my way through this seemingly difficult task.

I will promote myHDL in my university and contribute to the main myhdl repository in the coming months as and when I gather time.

Until next time,
MS.

August 23, 2016

tsirif (Theano)

Google Summer of Code 2016 has come to an end

Three months of coding under the hot summer sun have come to an end. Google Summer of Code 2016 was the reason that got me involved into the backends of deep learning frameworks and has further inspired me to continue to contribute in this effort in the future. But for the time being, this summer has resulted in 197 commits of 10798 insertions and 4430 deletions in total in two repositories for the Theano project, the symbolic computation framework for deep learning in Python.

Summarizing the work during GSoC

Theano/libgpuarray repository

For libgpuarray: 11 pull requests were merged

98 commits in C, Python and Cython code of 7249 insertions and 3739 deletions

1. Wrapped NVIDIA’s NCCL library for multi-GPU collectives into a GPU framework-agnostic frontend.

2. Extended pygpu’s Python interface for GPU Numpy-like ndarrays to include multi-GPU collective operations.

3. Added helper functions in pygpu’s general and GpuArray interface.

4. Fixed and enhanced various aspects of code and documentation.

mila-udem/platoon repository

For Platoon: 1 pull request was merged

99 commits in Python code of 3549 insertions and 691 deletions

1. Extended worker/controller architecture for synchronous multi-GPU and multi-node/GPU collective operations and expose an all_reduce interface in Worker class.

2. Implemented a more sophisticated error handling and launching mechanism in order to support single-node and multi-node cases as well in the same code.

3. Wrapped Worker’s all_reduce interface into a Theano Op in order to integrate unseemingly in a worker process which uses Theano.

4. Implemented distributed stochastic gradient descent algorithms (global dynamics) for data-parallel training procedures: Synchronous sum/average of parameter updates and synchronous variants of EASGD and Downpour.

Last but not least

Part of this work includes also the tests for the new features introduced. Every test is successful with the exception of the functional tests for the multi-node case of Platoon. Although that the test code exists (in fact it is the same with the single-node case, only configuration changes), I have not yet succeeded in deploying it in a cluster or a working set of MPI hosts. But I believe that this issue will be solved soon though. After all, it was expected to make some trouble as the design and implementation of multi-node/GPU collective operations in a worker/controller architecture was the most challenging and interesting part.

Till the next adventure, keep on coding
Tsirif, 2016-08-23

August 22, 2016

Avishkar Gupta (ScrapingHub)

Bad Design, Finalising & a Lookback

Hi, since this is the final one, and because I’m somebody who always has their Eureka moments in the nick of time, this is going to be a long one. But I think you’ve had enough of short and sweet from my side.

If you don’t like to read much, I suggest you turn back now.

So, as always, to re-iterate the goal of this project is a sweet and simple one: move signals away from PyDispatcher in Scrapy. As you know, the approach that we chose to follow to do the same was to go the Django way and use django.dispatch as a starting point.

For the last couple of weeks, y’all have been hearing about my fabled bench- marking suite. The reason why I never shared a link to the same in my blog posts was because I could never decide on whether the way I’m presenting those is right, but with deadline coming up, I finalised on the perf module that is used in Djangobench. Before we move any furhter, here’s a link to the benchmark suite and [here’s a sample output of those benchmarks]. Now that that’s out of the way, the rest of the post is going to concentrate on how we got there, and a problem that I encountered in the final weeks due to over-engineering on my part.

As you can see, at this moment the receiver_no_kwargs benchmark is on average about 1.5 times faster than how it previously worked. Now, let me tell you about a little something, robust_apply is a nuisance, so the original authors of django.dispatch had the brilliant idea of completely breaking backward compatability and getting rid of receivers that do not take variable keyword arguments. That however is not an option that I could follow. So being an amateur who hasn’t had to deal with breaking compatability upto this point, I decided to not introduce the same in scrapy.dispatch, rather I tried to work up some magic inside scrapy.signalmanger, specifically the following “hack”:

if receiver.__repr__() not in self._patched_receivers:
lambda sender, **kw: _robust_apply(receiver, sender, **kw)


Right, so as you can see, the flow here was crawler -> SignalManager -> Dispatch -> back to proxy in SignalManager. The method call overhead that was now introduced into the mix spelt disaster for this benchmark. Disaster. Not only that, sending receivers that were not proxied in this way was 0.5X slower than the raw dispatcher-dispatcher benchmark. Owing to this over engineered mess, I took on the task of writing this part again, with just a couple weeks of the coding period left to go. Due to the constant support that I’ve had from my mentor Jakob along the way, I’m relieved to say I was able to accomplish the same(which you already knew if you took the time to go through the benchmarks at the start :P).

Another important design decision this week was the one to deprecate scrapy. utils.signal. You’d think that is something that would be trivial since we’re moving to scrapy.dispatch but the original plan was for the methods in there to serve as pass through methods between scrapy.signalmanager and scrapy.dispatch.That however is no longer the case and our benchmark for receivers that accept kwargs shows that there is little to no overhead between that and raw signal performance.

So, with unit tests done, benchmarks done, optimizations done, it came down to the documentation. Now, I didn’t realize that all Python documentation ever is down using ReStructured Text. So I used up a couple of days to get the documen- tation done, however I’m pleased to tell you that even though I’ve not shared the same as of yet, it too is done.

However if you did look through the benchmarks you would have noticed that the connection time of signals to receivers has actually increased instead of decreasing. Well, most of that time is taken up in resolving whether a receiver accepts keyword arguments, and raising a deprecation warning. So even though it reads as being 2.5X slower than before, that time is actually negligible. Plus, in Scrapy unlike django, it’s not connecting to signals that’s the problem, it’s sending them multiple times.

Looking back, I would say at lest from where I stand the project was a success. I was able to learn a ton of new stuff, and make something cool. At the end of the day, that’s all that matters right? Also, I got to work with some wonderful people at Scrapinghub, specially my mentor Jakob, who put more thought into code review than I guess I put into writing it :), but then again, I digress. I would like to extend this longer, but I’ll be posting the three page document I’ll be submitting to Google on here too, so I guess you can read about it then. Until then, Rootavish signing off.

Ravi Jain (MyHDL)

The coding weeks are over!

So people, the coding weeks are over. This post is for a reference to the work done by me during this period highlighting the goals achieved and the outstanding work.

The task was to develop Gigaibit Ethernet Media Access Controller(GEMAC) (the MAC Sublayer) in accordance with IEEE 802.3 2005 standard using MyHDL. The aim was to test and help in the development of MyHDL 1.0dev, also demonstrating its use to the other developers.

In brief, work done includes developing Management Block and Core Blocks, i.e., Transmitter and Receiver Engine with Address Filter and Flow Control. Work left includes developing the interfacing blocks, i.e., FIFOs (Rx and Tx) and GMII.

Post Mid Term I started implementing core blocks. Midway I realised that I would be better off using Finite State Machines to implement these, which led me to rewriting the whole blocks. Currently, I am looking towards implementing the interfacing blocks; FIFOs (for which i shall try and use already developed blocks by other developers) and GMII(depends on the PHY, I will be using the one that comes on Zedboard, meaning i would be developing RGMII).

Tests for each blocks were developed using Pytest. Seperate tests were developed to test each unique feature and ensure its working. Also convertibility tests were developed to test the validilty of the converted verilog code which shall be used for hardware testing in the end.

Main Repo : https://github.com/ravijain056/GEMAC/

1.Implemented Modular Base: https://github.com/ravijain056/GEMAC/pull/1
2.Implemented Management Module: https://github.com/ravijain056/GEMAC/pull/4
3.Implemented Transmit Engine: https://github.com/ravijain056/GEMAC/pull/5

My main focus after I am done with this is to make this code approachable by other developers by providing various good examples of using the library.

srivatsan_r (MyHDL)

What did I do?

My final GSoC 2016 post!

What did I do?

HDMI Source/Sink Project (Project Repo Link):

• Created Interfaces and Transactors to connect different modules and for giving inputs.
• Created HDMI Transmitter Model and Receiver model.
• Created both the HDMI transmitter and receiver cores.
• Created tests for all the modules including the models and interfaces.
• Integrated Travis-CI, landscape.io, coveralls.io to the github repo to give an automatic evaluation of the changes made to the project.
• All modules were made convertible to Verilog.

The complete documentation for the project was hosted in readthedocs.

RISC-V Processor (https://github.com/jck/riscv):

I assisted in developing the RISC-V processor by writing some tests and implementing some modules in MyHDL.

Wrote tests for

Created modules like

What’s pending?

I was assigned the task of making the RISC-V core generate video and transmit it via the HDMI transmitter.

My mentor told me to not do it and focus on creating the RISC-V processor core as the core was not completed and required two people to work on it parallely.

GSoC was fun and I gained a lot of knowledge!

Shridhar Mishra (italian mars society)

Conclusion

Conclusion
Name: Shridhar Mishra

Sub org: Italian mars society.

Project:
Integration of Unity Game scene with the existing pyKinect and emulate a moving skeleton based of the movements tracked by the Kinect sensors.

Another year of GSoC is coming to an end and most of the proposed work has been complete.

List of commits:

https://github.com/mars-planet/mars_city/commits?author=shridharmishra4

Project description:

https://github.com/mars-planet/mars_city/tree/master/servers/unity

Things done:
• Setting up kinect 2 with ERAS environment.
• Data extraction from kinect 2 which includes skeleton coordinates, infrared, and video output from kinect camera using python.
• Setting up Kinect with Tango Controls and sending skeleton data using specialised numpy array.
• Unit testing of Pykinect2 and Tango.
• Unit testing of Unity game engine related code.
Challenging bits:
• Integration of Kinect interface and the architecture less communication using tango.
• Selection of data that is supported by Tango and which is feasible for constant polling process which least possible latency between the transmission.
• Setting up Tango on Windows can also be an uphill task :-P.. follow this for hassle free installation
Future enhancements:
• Multi threading has to be improved so that tango can run simultaneously in a daemon thread efficiently.
• Proper Integration of Tango and Unity.
• Test skeleton co-ordinates received at the linux end.
• Final integration testing and deployment.
References:

Vikram Raigur (MyHDL)

GSoC Final Report

Completed work :

1. Quantizer module.
2. Run Length Encoder  Module.
3. Huffman Module.
4. Byte-Stuffer Module.
5. Back-end Main module.
6. Back-end module compatible with front-end.
7. Input Buffer that can store three blocks form front-end.
8. Completed software prototype for back-end module.

To-Do :

1. Connecting front-end and back-end.
2. Commit new back-end with new interfaces (front-end compatible).

The back-end module which is committed into the main repository need a couple changes (adding a counter inside). I have done adding counter but did not upload it yet (testing in progress).

I made some changes to quantizer module at the end. So, a bit code clean up is needed in test-bench.

The only main task left now is connecting front-end and back-end modules.

All the modules are scalar, convertible and modular. I also write software prototypes for each module, so that we can check and compare outputs from the test-bench and the software prototype. All the test-bench’s are convertible.

The documentation for the overall project is this link. Just enter the command make html. you will get the HTML version.

The read-the-docs version (online version) of the project documentation is here.

The above documentation covers about the kind of work each module does, interfaces etc. Also, it covers the coverage results of each module.

The code coverage for the modules can be seen in the documentation.

In near future, I will try to investigate Dynamic Huffman Encoding in the back-end.

Pranjal Agrawal (MyHDL)

GSoC final summary and development

Hey guys,

This is my final post in the GSoC series of posts for the myhdl version of  Leros Tiny Processor. Here I will describe the work done, what were some of the challenges I faces, the work still remaining, and my future plans for this project.

My GSoC project was to redesign the Leros tiny processor in myhdl,  convert and synthesize,  test it on hardware,  and develop a command bridge assembly for it to be interfaced with other rhea designs.
The entire project repository can be found at:

https://github.com/forumulator/pyLeros

The pull requests that divide the project into stages(now merged) are at:

4.  The conversion branch. The complete convertible code, including the example programs, assembler, ans the generated .rom and .vhd files.
https://github.com/forumulator/pyLeros/pull/4

3.  The python based simulator and increased test coverage
https://github.com/forumulator/pyLeros/pull/3

2.  The complete working and tested pyLeros modules
https://github.com/forumulator/pyLeros/pull/2

1.  Work before the midterms PR (partly the modules and the tests).
https://github.com/forumulator/pyLeros/pull/1

pyLeros Development Summary

In my GSoC proposal, I had outlined 5 major goals for the development of leros:
1. Writing the tools like simulator, assembler.
2. Development of the processor and test suite.
3. Code Refactoring and conversion to myhdl
4. Synthesis and harware testing with examples.
5. Writing of the UART + Command bridge on Leros, the real world application.

I'm happy to tell you that my goals 1, 2, and 3, are all over and done with. The various python based modules have been written and well tested(coveralls indicate that the test coverage is currently around 88%) . I wrote a simulator in python for the processor, which has also been used for the tests. This is a good idea because processor simulators are much easier to write in software than the processors themselves, and this increases test coverage. Many of the bugs were caught this way. Many challenges were also faced, like hazards, and getting the timing down just right.

That part(points 1, 2) was done in the first 7 weeks. Unfortunately,  it took more time that originally anticipated because the slightest of bugs in the description can cause the program to run awol. Also, debugging is hard because you actually have to go to the executing step by step, looking at the execution of each instruction.

Then I started the code refactoring and conversion phase. This is important because lots of things that can be written in software and run perfectly on simulation can either not convert to VHDL or not synthesize. For example, the decoder signals, which go from the decoder to the various components and cause the execution, were originally a list of signals. Since this can't be converted, I have hence used interfaces. Other things were also changes, that were giving either no or poor results in VHDL.

Synthesis and testing

Finally synthesis and testing part, which was further refining the structure of myhdl so that the converted VHDL can be synthesized, and is semantically correct too( This is important, so that resources are not mis-assigned while synthesis). There is actually still some bugs in this, for example, the synthesizer keeps assigning a couple thousand registers instead of on board memory for the RAM.

Examples and hardware testing

Next came the examples. Because I felt like I was repeating almost exactly martin had done with his assembler in Java, I scraped my half-written python assembler and ported over the java one in a day. Now I had examples that I wrote, some of the common tasks that can be used to test processors, for example sorting algorithms, etc. These were written, assembled, tested on simulation, converted to VHDL, tested on VHDL simulation,  and finally tested on the FPGA. And they work!

External Design of the processor

The main design of the processor goes as follows: We write and assembly file, and assemble it. While instantiating the processor in our designs, we pass this file as arg to the main pyleros @myhdl.block, and also connect the 16- bit I/O ports and 1-bit strobes to the appropriate places. And the working is on. Such a design,a long with the I/O also have tests in the test suite. This was done because the processor is supposed to work as a general purpose peripheral, so the memories have been included and are not exposed outside. This design can be modified as needed.

Git development flow.

The git development workflow I followed was something like this: A little of the initial development was done was in the branch. Then I moved the development to the branch core. After the mid terms, the a PR was given from the core branch to the master, and the development continued in dev-exp. dev-stg2 contains the code for simulation and some refactoring. Branch conversion contains the conversion refactoring, tests with the assembled code, and generated vhdl, with the synthesis. The branched have subsequently branched from the previous one after giving a PR to the one before that. The branches have been merged to main after successfull testing phase, because of time constraints.

The PR's can be viewed here:
https://github.com/forumulator/pyLeros/pulls

Future plans

Unfortunately, one of the goals outlined, creating of the command bridge on pyleros, I wasn't able to complete, and has been postponed to after GSoC. However, I have a clear view in mind of what has to be done, and I expect to finish it off in the next couple of weeks. That way, I can also test the processor with existing rhea cores. Beyond that, the part remaining is writing more examples, trying to increase coverage to 100%, the works. In short, the majority (over 95%) of the project has been done.

And that brings us to the end of this long post. It has been a long and eventful summer with many ups and downs along the way. But after all this, the project is finally done, and I can tell anyone who asks that summer 2016 was a summer well spent!

August 21, 2016

sahmed95 (dipy)

Wrapping up : Google Summer of Code 2016

Software development is challenging and developing a robust package for use in the real world is a completely different experience than writing code for yourself. Google Summer of Code 2016 was a valueable learning experience which introduced me to Free and Open Source Software. I learnt the importance of writing robust tested code which is easy to read and understand for other developers. GSoC also gave me an opportunity to interact with experienced mentors who were very patient in answering my doubts and suggesting good coding practices. With thorough code reviews and regular hangout sessions, my mentors went out of their way to help me with my project. The entire process of writing a proposal, getting involved with a new community, developing a package which will be used by a large number of users, testing the code and writing examples for it has given me the confidence to take up any project and work independently. I am sure this is the first of many more projects and contributions that I will get involved with in the FOSS world.

One of the most interesting part of GSoC for me was to get to know and work with people from different parts of the world. This introduced me to a completely new way of working in a team remotely and I am now looking to get involved in more such projects.

As someone who had very little previous experience in software developement one of the most difficut task was getting selected for the project but I am very grateful that I got a chance. I already had some experience with Mathematical Modelling and as a pre-final year student in Physics this project has helped me explore the topic thorougly. I am sure this will be a great help in deciding my final year thesis and whenever I develop code from now on, I will have the mindset of an open source developer and try to write code such that it can be used freely and developed further.

chrisittner (pgmpy)

GSoC 2016 Work Product

My GSoC 2016 work can be found here:

In addition, these two Pull Requests are not yet merged:

I wrote an introduction to pgmpy’s new structure learning features here, as a Jupyter Notebook:

SanketDG (coala)

That's it, folks!

So this is it. The end of my Google Summer of Code. An amazing 12 weeks of working on a real project with deadlines and milestones.

Thanks, awesome mentor!

First and foremost, I would like to thank my mentor Mischa Krüger for his constant guidance and support through the tenure of my project.

Thank you for clarifying my trivial issues that were way too trivial. Thank you for clearing my doubts on the design of the classes. Thank you for writing a basic layout for a prototype bear. Thank you for understanding when I was not able to meet certain deadlines. Thank you Mischa for being an awesome mentor.

The Beginning

I was first introduced to coala in HackerEarth IndiaHacks Open Source Hackathon. I wanted to participate in it, so I took a look at the list of projects and saw coala. I jumped on their gitter channel and said hi. Lasse hit me back instantly, introduced me to the project, asked me to choose any newcomer issue, and my first patch got accepted in no time.

As the hackathon came to an end, it was time for organisations to start thinking about Google Summer of Code. By then, I had been taking part in regular discussions, and code reviews, Lasse asked me if I’d like to do a GSoC:

I slowly pivoted to choosing language independent documentation extraction as my GSoC project as I found it having greater depth than my other choices.

I feel privileged to be contributing to coala. The project itself is awesome in its entirety. I have contributed to my fair share of open source projects and I have never found any other project that is so organized and newcomer friendly. How coala is awesome should be itself another post.

Now to my project. As stated repeatedly in my past posts, my project was to build a language independent documentation extraction and parsing library, and use it to develop bears (static analyzing routines.)

How it all fits together

Most of the documentation extraction routines were already written by my mentor. Except a couple of tiny bugs, it worked pretty well. The documentation extraction API was responsible for extracting the documentation given the language, docstyle and markers and return a DocumentationComment object.

The DocumentationComment class defines one documentation comment along with its language, docstyle, markers, indentation and range.

My first task was to write a language independent parsing routine that would extract metadata out of a documentation i.e. description, parameter and return information. This resides inside the DocumentationComment class.

The point of this parsing library is to allow bear developers manipulate metadata without worrying about destroying the format.

I then had to make sure that I had support for the most popular languages. I used the unofficial coalang specification to define keywords and symbols that are used in different documentation comments. They are being loaded along with the docstyle.

Although I do not use the coalang stuff yet and still pass keywords and symbols manually, it will be used in future.

Lastly, I had to implement a function to assemble a parsed documentation into a documentation comment.

I separated this functionality into two functions:

• The first function would take in a list of parsed documentation comment metadata and construct a DocumentationComment object from that. The object would contain the assembled documentation comment and its other characteristics. Note that this just assembles the inside of the documentation comment, not accounting for the indentation and markers.

• The second function takes this DocumentationComment object and assembles it into a documentation comment, as it should be, taking account of the indentation and the markers.

Difficulties faced

• The first difficulty I faced was the design of the parsing module itself. With the help of my mentor, I was able to sort that out. We decided on using namedtuples for each of the metadata:
Parameter = namedtuple('Parameter', 'name, desc')
ReturnValue = namedtuple('ReturnValue', 'desc')
Description = namedtuple('Description', 'desc')

• If I wanted to make the library completely language independent, most settings would have to be configurable to the end user. Initially I hardcoded the keywords and symbols that I used, but later the coalang specification was used to define the settings. They are yet to be used in the library.

• While trying to use the above mentioned settings, I realized that the settings extraction didn’t work for trailing spaces. Since I had to have settings with trailing whitespace, I had to fix the extraction in the LineParser class.

What has been done till now

coala

The API still has a long way to go. A lot of things can be added/improved:

• Maybe the use of namedtuples is not that efficient. I think classes should be used and subclassed from these namedtuples. This will allow the API to be way more flexible than it currently is, and also retaining the advantages with using namedtuple.

• A cornercase in assembling #2645

• Range is not being calculated correctly. #2646

• The API is not using the coalang symbols/keywords. #2629

• A lot of things are just assumed from the documentation while parsing. Related: #2143

• Trivial: #2617, #2616

• A lot of documentation related bears can be developed from this API.

It has been an awesome 3 months and an even more awesome 7 months of contributing to coala. That’s it folks!

Other projects.

Also, I want to talk about the projects of other students:

• @hypothesist did an awesome job on coala-quickstart. The time saved in using coala-quickstart vs. writing your own .coafile is huge and this will lead to more projects using coala. He has also worked on caching files to speed up coala.

• @tushar-rishav built coala-html! Its a web app for showing your coala results. He has also been working on a new website for coala.

• @mr-karan did some cool documentation for the bears and implemented syntax highlighting in the terminal.

• @Adrianzatreanu worked on the Requirements API.

• @Redridge’s work on External Bears will help you write bears in your favourite programming language.

• @abhsag24 worked on the coalang specification. We can finally integrate language independent bears seamlessly!

• Thanks to @arafsheikh, you can now use coala in Eclipse.

tushar-rishav (coala)

GSoC'16 final report

Alright folks. It’s officially an end to the amazing 12 weeks of Google Summer of Code 2016! These 12 weeks have certainly helped me become a better person, both personally and professionally. I’ve had a chance to interact and learn from some very interesting and amazing minds from coala. Sadly, I couldn’t meet all but a few of them during the EuroPython conference this summer.

Acknowledgement

First and foremost, I would like to pay my eternal gratitude to my mentor Attila Tovt who patiently helped me improve the patch through multiple reviews. I was learning something new with each iteration. The major take away for me from his mentorship would be:

There is always a hack to something, then there is a solution to it.

Sadly, I couldn’t meet him during EuroPython conference, but I am hopeful to pay my regards in person someday! :)

Next, I must thank Abdeali for helping and guiding when I started my journey with coala and also Lasse Schuirmann. Well after all he is coala’s BDFL! Wondering what that means? Benevolent Dictator for Life :D

Hmm, on a serious note, if I say I am going to continue contributing to coala or any other FOSS that I may come across in future, one of the major reasons and an influencer would be Lasse! The guy is totally amazing. I don’t think anything else could describe him better than his own words that he said during EuroPython’16

Guys! Don’t just be a participant. It’s boring! Be and create a conference!

Impressive isn’t it? That’s Lasse for you! :)

I would also thank fellow GSoC students and now my new friends - Adrian, Adhithya, Alex, Karan, Araf, Sanket and Abhay. I look forward to stay in touch with them even after GSoC! :)

Last but far from the least, thanks to Max for giving such a wonderful lightning talk at EuroPython’16, cooking for us with all love at Bilbao, for being such an amazing and wonderful person.

The acknowledgement must end with my gratitude to coala, Google and Python Software Foundation for giving me an opportunity to be a GSoC student in the first place.

Work history

Past summer, I’ve contributed and maintained coala-html and coala-website projects. The commits and live demos are available online.

Table 1
Project Commits Status
coala-html demo coala-html commits, total 49 commits completed
coala website repo and coala embedded in coala-website demo Mentioned in Table 2 It is almost complete, but requires improvement in design.
Table 2

Commits for coala website repository as Gitlab doesn’t support commit filtering by author yet.

Commit SHA Shortlog
f9242a runserver.py: Init setup with flask
51ac6e bower.json: Use bower to install dependencies
ac3cdd beardoc: Display what coala can do interactively
f90513 bear-doc: Include bear-doc

Although GSoC period may end, my contributions to FOSS won’t! :)

Regards,
Tushar

John Detlefs (MDAnalysis)

My Summer of Code Experience

The summer is over and I am very proud to report that I have accomplished most of my goals I set before starting work with MDAnalysis.

A TL; DR of my github commits to MDAnalysis — — April (Before GSoC officially started) — - Fixed a bug in rmsd calculation code to give the right value when given weights. - Eliminated a dependency on the BioPython PDB reader and updated documentation. The ‘primitive’ PDB reader replaced BioPython - Added an array indexed at zero for the data returned by hydrogen bond analysis

May (Official work started on the 23rd.)

• I refactored most of the code in MDAnalysis/analysis/align.py to fit our new analysis api. I improved documentation where I could and wrote a new class called AlignTraj that fits the new API.

June

• I was reading and writing a lot about dimension reduction, and during this time I wrote the Diffusion Maps module MDAnalysis/analysis/diffusionmap.py

July

• I finished diffusion maps, and significantly increased coverage for the RMSD class in the analysis package.
• I went to SciPy!
• Started work on PCA

August

• I fixed some problems in the DCD reader which involved dealing with C code.
• I finished the PCA module
• I started (and am very close to finishing) work on a refactor for RMSD calculation to align with the new API.

Times like this call for reflection and honesty, I feel as though there were good moments, some bad moments, but for the most part I feel as though I did a perfectly satisfactory performance. If I were to give myself a grade I’d give myself a B. It may be one of those B’s where my raw grade was a B- but I clearly worked hard and know more at the end than when I started, so I feel as though I’ve earned the bump.

Why not an A, you ask?

There were some weeks over the summer where I really felt like I wasn’t reviewing and iterating on my own work in an incisive fashion. It is easy to write code, commit some changes, and then wait for your mentor to make comments on what you should fix. Around week five or six, I could have done a better job at imagining I was my mentor and predicting his suggestions.

In addition, there were some times in which I self-assigned bug fixes and projects that turned out to be far harder than I anticipated. (One of them was a reader rewind issue that required more understanding of the Reader API than I currently have.) Having respect for the complexity of code and the amount of time it takes to go in and dissect a problem in order to figure out how to fix it is something that I hope to keep improving.

What I did well

• I worked hard on setting a realistic timeline at the beginning of the summer and achieved all but the last item on the list.
• I communicated frequently and did not hesitate to ask questions that came up.
• I helped discover and fix some bugs in old C code which required some self- teaching.
• Read tech papers, taught myself some dimension reduction algorithms such that I could confidently justify code I wrote.

Historically I think I’ve overpromised and underdelivered on projects, but in this case I think I did a decent job of delivering on the work I promised and doing more when I could. I never got to go in and try to figure out parallel trajectory analysis, but I am still optimistic that I can find the time to work on this soon.

What have I learned?

• It is often easier and more productive to write crappy code and iterate on it rather than trying to sit at your computer and come up with the perfect code on your first try.

• Weekends are an important break to prevent long term mental fatigue.

• Motivation can come and go in waves, ride the wave. Don’t get bummed out and hard on yourself when you find yourself in a lull.

• IRC or Slack or Gitter is really great to have an informal avenue for questions. Mailing lists and email is overly formal for the kinds of questions I want to ask as a noobie programmer.

• Don’t read into things, people get busy, if someone doesn’t respond to an email you probably haven’t done anything wrong.

• The only way to solve hard problems is to break them down into discrete chunks. This is a skill that I have yet to master but am working on improving.

• I approach health in a very rigorous and number oriented way, I have been thinking that I could treat reading and self-improvement as I do exercise I could do a better job at being productive.

• A significant portion of working on large software projects is communicating technical ideas rather than purely writing code. Being able to write prose clearly for all audiences is a skill I am still working to improve.

August 20, 2016

tushar-rishav (coala)

Mutable default arguments in Python

Recently, I came across an interesting feature in Python and thought I should share it with everyone.

Suppose we have a code snippet that looks like:

Before reading further, please stop and go through the code snippet carefully.

Now, What output do you expect from the above snippet? There is a high probability (unless you are a Python wizard!) that you might expect the output to be

Well, if that’s what you expected, then you are wrong!

The correct output is

Interesting isn’t it? Let’s dig a little deeper and find out why this happens.

Perhaps, it’s quite known now.

Expressions in default arguments are calculated when the function is defined, not when it’s called.

Before I explain further, let’s verify the above statement quickly:

For me the output looks like

Clearly, arg was calculated when time_func was defined and not when it’s called otherwise you would expect arg to be different each time it’s executed.

Coming to our func example. When def statement is executed a new function object is created, bound to the name func and stored in the namespace of the module. Within the function object, for each argument object with a default value, an object is created that holds the default value.

In the above example, a string object (“Hello!”) and an empty dictionary object is created as a default for default_immutable_arg and default_mutable_arg respectively.

Now, whenever func is called without arguments, the arguments will use the values from their default bounded object i.e default_immutable_arg will always be “Hello!” but default_mutable_arg may change. It’s because of the fact that string objects are immutable whereas a dictionary objects are mutable. So whenever in line 4, we append “!” to default_immutable_arg, a new string object is created and returned which is then printed in next line, keeping default string object’s value still intact.

This isn’t the case with mutable dictionary objects. The first time we execute func without any arguments, default_mutable_arg takes its value from default dictionary object which is {} now. Hence, the else block will be executed. Since the dictionary objects are mutable, the else block changes the default dictionary object. So in the next execution of the function, when default_mutable_arg reads from default dictionary object, it receives {'some_key':'some_value'} and not {}. Interesting huh? Well that’s the explanation! :)

Solution

Don’t use the mutable argument objects as default arguments! Simple! So how do we improve our func ? Well, just use None as default argument value and check for None inside function body to determine if the arguments were passed or not.

Now, you can imagine what would happen if we overlook this feature in defining class methods. Clearly, all the class instances will share the same references to the same object which isn’t something you’d want to have in the first place! :)

I hope that was fun,

Cheers!

Aron Barreira Bordin (ScrapingHub)

Scrapy-Streaming - Support for Spiders in Other Programming Languages

This page describes the Scrapy Streaming project, the summer goals, submitted patches, and a simple overview of the results.

Scrapy is one of the most popular web crawling and web scraping framework. It’s written in Python and known by its good performance, simplicity, and powerful API. However, it’s only possible to write Scrapy’s Spiders using the Python Language. The goal of this project is to provide an interface that allows developers to write spiders using any programming language, using json objects to make requests, parse web contents, get data, and more. Also, a helper library will be available for Java, JS, and R.

Scrapy Streaming

This project was named Scrapy Streaming, and it’s a Scrapy’s extension that provides an json layer to develop spiders using any language.

Development

In this section, you can read an overview of the development progress and the submitted pull request related to each topic.

Documentation

I started the Scrapy Streaming development defining the project API. writing the Communication Channel spec and writing about its usage.

I considered important to start with the project docs because it gave me some time to share my API proposal and discuss with the Scrapy developers and contributors to get some feedback, even before starting with the development.

Travis and Unit Tests

Before coding, I defined the project and source code structure, adding travis, configuring codecov to check the test coverage, and implemented some initial tests.

To outcome a good project, codecov would help me to ensure to test everything under development.

Scrapy Commands

Scrapy has a command line interface.

The Scrapy Streaming adds the streaming command to the CLI. The streaming command allows you to execute scripts/executables as a Spider, using the following command:

scrapy streaming my_executable -a arg1 -a arg2 -a arg3,arg4


For example, to run a R spider named sample.R, you can use the following command:

scrapy streaming Rscript -a sample.R


If you are using a scrapy project, you can also add multiple spiders. To do this, you must add a file named external.json in the project root similar to the following example:

and then run it using the crawl command. For example:

scrapy crawl java_spider


This definition lets you implement multiple spiders using multiple programming languages and organize them in the same project.

Communication Protocol

The Communication Protocol is a json API the lets any process to communicate with Scrapy, letting you develop spiders with any programming language.

This section was a bit challenging because I needed to implement a good code that supports stdin/stdout buffering, preventing user mistakes, system errors, processing performance, scrapy problems, and spider problems.

This is the core of the project, the most important patch. So I added a lot of unit tests to ensure it’s working well in both Python2 and Python3.

Examples

Something very important in this project is spider examples. Future users may depend on initial examples to be able to learn and use Scrapy Streaming.

I added examples in Python, R, Node.js, and Java, describing its basic usage. Also, the documentation contains a Quickstart section that provids a brief explanation about how to use it in each programming language.

R package

To help the development of spiders using the R language, I implement a R package to wrap the communication channel and help developers. It contains unit tests, documentation, examples, and it’s developed using the standard R package structure.

Java library

To help the development of spiders using the Java language, I implement a Java library to wrap the communication channel and help developers. It contains unit tests, documentation, examples, and it’s developed using the standard Java library structure with Maven.

Node.js package

To help the development of spiders using the Javascript (with Node.js) language, I implement a Node package to wrap the communication channel and help developers. It contains unit tests, documentation, examples, and it’s developed using the standard Node.js package structure.

Selectors

My initial proposal had some features called “Selectors”. With selectors, you could extract data from the HTML response using css/xpath filters in the communication protocol. This part was removed from the proposal, because this could add more complexity to the project.

I implemented a proof of concept, and I’d be happy to implement it after the summer because I consider this a very important feature to Scrapy Streaming.

WIP

Here, I highlight some topics that is not yet done.

• Publish the documentation in a official page (currently I’m using my personal readthedocs page to store the docs.)
• Publish Scrapy Streaming to PIP
• Publish helper packages to the following repositories (the code is ready to be published):
• Java -> Maven Central Repository
• R -> CRAN
• Node.js -> NPM
• Some pull requests requires reviews.

Final Considerations

This summer was very good for me. I was able to develop the project and I achieved my goals with this project. Now, it’s possible and easy to implement spiders using your preferred programming language.

I received an awesome support and I’d like to thank my mentors (@eLRuLL and @ptremberth), and another GSoC student (@Preetwinder) for all the help in this project. We’ve discussed a lot about its usage, best practices, and coding.

I’m a big fan of the Open Source community, so I’ll be happy to continue contributing with the Scrapy Streaming repository, and some related project after this summer.

Thank you !!,

Aron Bordin.

Scrapy-Streaming - Support for Spiders in Other Programming Languages was originally published by Aron Bordin at GSoC 2016 on August 20, 2016.

Last weeks

During the last weeks I was actively fixing bugs, adding small features, refactoring the existing code and writing docs.

I’ve refactored my event listeners approach to support element.node:addEventListener() and element.node:removeEventListener(). Now, you can add more than one event listener for the specified event, which, I think, is great.

Also, I’ve refactored the HTML Element creation process. Now, HTML elements are created in the tab#evaljs method which is the main method for evaluating JavaScript code across the application. Because of that you can do something like:

local div = splash:evaljs('document.createElement("div")')
div.node.id = 'myDiv'
div.node.style.background = '#ff00ff'
splash.select('body').node:appendChild(div)


And, finally, documentation. I wrote about 800 lines of documentation and now my PR has almost 4,000 line changes. I understand how hard is to review it and I appreciate how much time the reviewer should spend on it.

Final Submission

During this summer I was working on the following pull requests for Splash:

splash:with_timeout()

• Pull Request: srapinghub/splash#465
• Status: fully implemented and merged.
• Abstract: allows to set timeout to any Lua function. It will let the executable function to run only (sometimes more) the specified amount of time.
• Blog post: post
• Comments: Implementing this helped me to deeply understand how Splash scripts work and, particularly, how is the scripts event loop is implemented.

DOM elements: take 1

• Pull Request: srapinghub/splash#471
• Status: switched to another API interface, closed.
• Abstract: allows to manipulate with DOM elements in Lua scripts.
• Blog post: post
• Comments: I have created a branch of API that allows to work DOM elements in the following manner:
local form = splash:select('form')
local input = splash:select('form > input[name="name"]')
local ok, name = assert(form:node_property('value'))
assert(input:field_value('mr. ' .. name))
local ok, submit = assert(form:node_method('submit'))
submit()


As you can see you should write pretty much code for (in this case) getting the form values, changing it and then submitting the form. And it was the main reason to change the API interface.

Bug #1

• Pull Request: srapinghub/splash#482
• Status: fixed and merged.
• Abstract: In Lua all methods of all tables of the particular class was bound to one Python objects.
• Blog post: post
• Comments: This bug was pretty significant for my further work as it wouldn’t allow to work with multiple elements in the same time:
local el1 = splash:select('#el1')
local el2 = splash:select('#el2')

el1:form_values() -- calls el2:form_values()


I’ve changed the followings things: 1. Getters and Setters of the table (that are assinged from Python) now are assigned to the table itself rather than their was in closure scope. 2. Changed the __index and __newindex metamethods so that they are assigned to the table itself rather than to its metatable. 3. Made tables’ metatable as a metatable of itself.

Overall during fixing this bug I’ve learned many things about how OOP and metamethods works in Lua.

Bug #2

• Pull Request: srapinghub/splash#487
• Status: succeeded by another fix implementation, still open.
• Abstract: Private methods in Lua are bounded to the last wrapped exposed object.
• Comments: The problem was with private methods of exposed objects (objects that are exposed from Python to Lua). They were bounded to the last created object. I’ve tried to fix this bug by assigning these private methods to the exposed object (table) itself in __private method. However, in that case they would be accessed from user scripts. One of my menters – @immerrr did fantastic and mind-blown job fixing this bug in srapinghub/splash#495, you should have a look if you are familiar with Lua and its OOP implementation :smile:.

DOM elements: take 2

• Pull Request: srapinghub/splash#490
• Status: wip, fixing last comments on Pull Request, open.
• Abstract: Allows to manipulate DOM elements using Lua properties and methods.
• Comments: after discussion with my mentors we decided to make API as simple as possible. In comparison with old API in the new one you can set node properties just by assigning to the table and calling node methods by just calling lua methods. You can even assign event handlers as pure Lua functions:

The previous example with the new interface:

local input = input.node:querySelector('input[name="name"]')
input.node.value = 'mr. ' .. input.node.value
assert(input.node.parentElement:submit())


And this is how you can assign event handlers (with addEventListener):

local button = splash:select('button')
local tooltip = splash:select('.tooltip')
event:preventDefault()
tooltip.node.style.left = event.clientX
tooltip.node.style.top = event.clientY
end)


Pretty impressive, huh? :smile:

Overall

I finished almost everything that I was planning to do. Some of the features that I’ve planned was implemented by other Splash contributors (event system reworking) or was already present in the project (plugin system). However, I also did more than supposed to do, like splash:with_timeout() which originally was only a flag for splash:go() command and event listeners for element.

Final Thoughts

This summer was the most intensive one in my entire life. I was working almost all days in a week and I’m not regretting it.

GSOC was a unique experience for me for several reasons.

First of all, I’ve never worked with Python and Lua in a such big project like Splash. During the coding period my Python and Lua skill level increased significantly. I want to thanks Mikhail and Dennis for helping me with every question I asked from them. You guys rock :sunglasses: :muscle:.

The second reason is working on open source project. Particularly, writing so much documentation and tests. I’d never thought that it can take so much time. On the other hand, during test and documentation writing you better understand your implemented feature and things that will be good to add or remove.

Finally, thank you Google for giving such opportunity to students all over the world to code and learn.

August 19, 2016

Final Submission

This program was one of the best learning experiences I’ve ever had. During the entire GSoC phase i was able to contribute to mainly two repositories of the coala organisation, A sub org under the Python Software Foundation.

These include commits on the coala repository:

https://github.com/coala-analyzer/coala/commits?author=abhsag24

and the coala-bears repository:

https://github.com/coala-analyzer/coala-bears/commits?author=abhsag24

coala-bears also contains a branch which hasn’t been merged into master yet, All my commits on this branch can be found here:

I’ve been really grateful to my mentor Fabian, and the admin of my sub-org, Lasse, I’ve had the most enriching discussions with these people , Cheers to coala, FOSS and the entire GSoC experience,  i’ll definitely look forward to working beside these people after the program as well.

Prayash Mohapatra (Tryton)

Concluding Thoughts

Had a great journey this summer with Tryton. Really loved contributing to open source properly. Contributing these many days has made me happy. Happy for being able to give back to what I use. Thankful that I stumbled upon PyCon’s videos. Going through them is really fun. Hoping to contribute in python next time. :)

As for Tryton, Sergi and Cédric were always there with their quick responses on the IRC. Sergi’s replies sometimes made me smile instantly :D I wish I could understand the Spanish videos at the Tryton YouTube channel. I am planning to learn Spanish basics now xD After an entire day with code, human connection felt like a bliss.

Code Review #27421002 is where my work can be checked. This link will always have a special place in my heart.

ghoshbishakh (dipy)

Google Summer of Code Progress August 19

So we are at the end of this awesome summer and this post is about the progress in my final weeks of GSOC 2016! And the major addition in this period is the development stats visualization page.

GitHub stats visualization

As we had planned, the new Dipy website needed a page to highlight the growing number of developers and their contributions to the Dipy project. And finally we have achieved that with a separate django app that creates visualizations with data pulled from GitHub API, and for drawing some neat graphs I have used the Chart.js library.

And hey its a separate django app!

So it can be integrated easily into any other django project! Simply copy the github_visualization folder into your project and add github_visualization to the INSTALLED_APPS list in settings.py.

Now you just need to add a couple of lines to the template in which you want to show the visualizations.

Just change the ‘username’ and ‘repository_name’ to point to the GitHub repository you want to see visualizations for.

The work was submitted throught the pull request #15.

Ranveer Aggarwal (dipy)

To New Beginnings!

Winter is coming!

This blog post marks the end of an amazing summer, the summer of '16.
This summer, I got an opportunity to share knowledge, views and code with some of the most brilliant people I have come across, contributing to an amazing application that has the potential to take neuroscience to the next level.

My Google Summer of Code project with DIPY, under the umbrella of Python Software Foundation is one of the most interesting projects that I have ever worked on, which was reason enough to keep me motivated through all these months.

The Project

Currently, if you have an OpenGL interface, you need to use Qt/GTK or some UI library to create a window and focus out of the OpenGL window to do simple UI tasks like fill forms, click on a button, save a file, etc. Our idea is to get rid of the external interfaces and have the UI built in.
So, all the interaction happens within the 3D world interface.

Currently, no library in Python offers such a functionality, much needed during scientific visualisations.
So, we built this cross platform minimal interface on top of VTK, which provides a very simple but powerful API, and can be tweaked to create futuristic interfaces.

Pre-GSoC

I had heard about DIPY through a friend of mine (who himself is a Google Summer of Code student for DIPY this year) and I was intrigued by what they were doing. Over my years as an undergrad, I had developed an interest in Visual Computing and this was one organisation that was doing some really cool work in the domain. So I headed over to have a look at the list of projects and sent across a mail to my (to-be) mentor, Eleftherios Garyfallidis.

He and Marc had an idea of building a futuristic GUI for DIPY using VTK. I loved the idea and I instantly began working on making it a reality. And over the course of the next 3 months, we came really close to it.

Getting Started

As with any project, the first step is setting things up. This took some time given that I was on OS X and my mentors were on Linux. Setting up VTK on OS X has a really convoluted procedure, but finally, working together we got it up and running in a week or two.

VTK is an amazing framework, but there isn’t much documentation and learning resources (for a complete newbie) on the internet. And therefore, I got off to a slow start. Over time, I realised that there is a world beyond documentation and StackOverflow and open mailing lists are one of the best things that happened to Open Source and Software Development in general.

And then there were these situations.

After nearly a month of struggling, we were finally able to get a click-able button working.

The Summer

Once we had a working button ready, the subsequent work went on smoothly. Every time we modelled a new UI element, we found out a better way to do the whole thing, and ended up rewriting older elements. There were times when we completely rewrote elements to add a single small functionality. We let go of a lot of code and came up with simpler and more efficient ways to do things.

Since we were building a programmable UI, we tried to keep each element as generic as possible, exposing as many parameter to the user as we could. To make things simple, we also gave the best possible default values to these parameters, so that the user can simply instantiate a UI element object without compromising the amount of control he/she has on it.

When I say smooth, I don’t mean there weren’t roadblocks. There were several times when I was stuck on a problem for days, but my mentors never shied away from discussing the problems with me, and I could ping them any time to get a solution.

3 months hence, we have a good UI framework in place, built entirely in VTK-Python and it’s built in such a way that it can potentially be plugged into any VTK-based application.

Tech

The project involved knowledge about how OpenGL works, a good knowledge of Python, ability to read through documetation and mailing list archives and 3D coordinate geometry (all that effort 5 years ago finally paid off!).

My Contributions

A complete list of all my commits into the project are listed here.
Here’s the PR [#1111] with all the changes.
Here’s what we did:

Building a Button

Using vtkTexturedActor2D we built a button with functionalities to change icons and add callbacks. This is what we got.

A button overlay

A Text Input Field

We built a textbox using vtkTextActor and added ways to edit the text in the text box. Starting from an editable actor, we ended up with a multi-line text box. We rewrote a lot of code while building this and this is where we ended up with the idea of having a generic UI super class for all the UI elements. This element also introduced ui parameters (to pass between the element and the interactor) which were later deprecated.

A text box

Line Sliding

While building the line slider, we realised the need for multiple elements within one element. This is where the idea of a common set_center method came up. This element also introduced changes in the way we added elements to the renderer. We also introduced a ui_list for each element that carries all the sub-elements in that element. We ended up with this.

The Line Slider

Circular Slider

Using techniques similar to above, we built a circular slider, using a lot of math. The circular slider underwent a lot of modifications while adding it to the panel because we wanted to maintain a constant value while moving it around.

The Circular Slider

Moving on, the idea was to use the existing elements in 3D. Using vtkAssembly and vtkFollower, the former taking a lot of time to understand (but thanks to efforts by my mentors, it turned out to be not so convoluted), we successfully ported several 2D elements to 3D. We couldn’t do 3D sliding, so that is something we will be appending to future work.

A 2D Panel

A 2D Panel is basically a collection of 2D elements. Built in such a way that they relatively stay the same, not depending on the size of the panel, the panel turned out to be more useful than we thought after we managed to set up a panel of panels.
We also used the set_centers recursively to move the panel with all its elements around. We also used it to align panels to the left or right of the screen.

A Panel

A Right-Aligned Panel

The time had come to build a file dialog. Using os and glob.glob we built a file menu for displaying files in the current folder and changing directories when clicked.

A File Dialog

To put all that we had done to test, we built a file save dialog. This used almost everything we had built till now - panels, buttons, text box, etc. Here, for inter-object communication, we introduced optional parent references for each element. In the end, it all worked out well :)

Future Work

Here’s what we want to do in the future:

• Build robust 3D elements
• Convert the prototype elements to futuristic looking elements
• Add unit tests - we don’t have a unit testing framework right now, we are looking for one

Acknowledgements

While the final certificate would bear my name, there are countless brilliant brains that went behind this project.

Both my mentors, Eleftherios Garyfallidis and Marc-Alexandre Côté have stood by my side, all through the summer, with regular (and at times more than that) meetings, reviews and suggestions. I could ping them at any odd time of the day and they would promptly reply to all my doubts. The Story of the Rabbit’s PhD Thesis holds true :D

The people who built VTK and those who built the Python wrapper for it have done an amazing job. It’s an amazing framework, still in development. I am also indebted to the people who discuss all their doubts online and leave breadcrumbs leaving to the right resource.

And lastly, my colleagues who gave me valuable feedback like, “Try red!”.

Thank you all!

udiboy1209 (kivy)

How To Use Tiled Maps In KivEnt

This is a tutorial for using the maps module in kivent to load Tiled maps. It covers the steps required to load and display a map on the kivent canvas, but it doesn’t cover how to make a map in Tiled.

Building and installing the module

Make sure you have kivy and its dependencies properly set up and working using these installation instructions.

Clone the kivent repository to obtain the source. The module is currently in a separate branch ‘tiled_gsoc_2016’ so you can clone that branch only.

git clone -b tiled_gsoc_2016 https://github.com/kivy/kivent.git


You can skip the -b tiled_gsoc_2016 if you want the whole repository.

Install kivent_core first. Assuming you cloned it in a dir named ‘kivent’

$cd kivent/modules/core$ python setup.py build_ext install


Then install the maps module similarly

$cd kivent/modules/maps$ python setup.py build_ext install


It is best to set up kivy and kivent in a virtual environment. Just make sure you use the correct python for the above commands. The module works best with python3, but it works with python2 too.

Setting up the KV file

We need a basic setup of the gameworld and a gameview where we will add the renderers to be displayed. We also need to add systems which the tiles depend on like PositionSystem2D, ColorSystem and MapSystem.

PositionSystem2D is necessary for any map because it it responsible for the tile positions. And MapSystem holds the relevant data for the map hence that is necessary too, obviously. ColorSystem is required if there are shapes in your map which require coloring. And GameView is the canvas where we will render the map’s layers.

This is the basic boilerplate KV necessary for rendering the map.

Setting up the Systems

I will start with the basic game app structure of main.py.

We now need to load the systems required for each layer. We will have to specify parameters for them the same way we fo it in KV files. We will make 3 dicts, one each for Renderer, PolyRenderer and AnimationSystem and pass them to load_map_systems util function to create 4 layers.

We will be returned a list of renderers and animators. This list can be added to the gameworld init sequence like so. Renderers and animators require specific states to be set so we have to add these lists while setting states. Modify the corresponding lines with these.

and

These systems need to be rendered from bottom to top to preserve the layer order. And the gameview camera handles rendering of these systems. So we will set the render order for that camera to match layer index. Add this line to __init__.

Next up, we need to populate our systems with entities and for that we need a TileMap loaded with tile data. This data will be obtained from the TMX file. The util module has a function for loading TMX files and registering them with the map manager

setup_tile_map() should be added to init_gameworld() so that it is called after gameworld init.

parse_tmx takes the filename of the TMX, loads it to a TileMap, registers it in the map_manager with name as the filename without the extension, and returns that same name.

Creating Entities in the GameWorld

All that is left to do is to create entities from this TileMap. The function for that is init_entities_from_map. It requires the TileMap instance and an instance of the gameworld’s init_entity function. It is used like this:

You can add this to setup_tile_map after parse_tmx is called and we have the TileMap.

This is all we require to load a Tiled map in KivEnt.

Thank you and happy tiling!

jbm950 (PyDy)

GSoC Week 14

This is the final week of GSoC! I have the write up of the work done on a different page.

This project has been a wonderful learning experience and I have met many wonderful people over the course of this summer. Before the summer began I had never written any test code. Now at the conclusion of the summer I have written benchmark tests and I try to make sure the code I write has complete unit coverage. I feel that this is one of my biggest areas of improvement over the past 3 months. I have also learned a considerable amount about the combination of programming and dynamics. Prior to GSoC my experience with dynamics consisted solely of finding the equations of motion by hand. With my time in GSoC I have been exposed to finding the equations of motion programatically and using the results with an ode integrator. I have also obtained a more in depth knowledge of different algorithms for creating the equations of motion.

Another thing that I have enjoyed in this experience is seeing how a large programming project works. I feel that this will make me more employable in the future as well as allow me to help any other new open source community get up and running.

Lastly, one of the big highlights of my summer was my trip to Austin, TX for Scipy 2016. Being a GSoC student I not only got to go to my first professional conference but I was able to help give one of the tutorials. Also I was able to meet many of the other people who work on sympy. This went a long way in making my contributions to sympy from being just contributions to some organization to working on a project with other people. It was also neat to meet the developers of other open source projects. Like my comment about sympy, I now see those projects less as being “project name” and more as the people actually working on them.

I have enjoyed my GSoC experience and if I were eligible I would not hesitate to apply again. I would like to thank those who chose me and made this summer of learning and experiences possible.

fiona (MDAnalysis)

Coding and Cats - the end of GSoC

Hello again! It’s been a while since my last post and there’s a lot I’d liked to have talked about more if I’d found the time, but with the end of the coding period only a handful of days away, let’s take a look at what I’ve managed to achieve throughout GsoC.

My GSoC project has been divided into three main parts; you can find all the code I’ve been writing (including documentation, examples, and discussion!) by following the links below to the pull request I made for each, and I’ll also give a little summary below that about what’s possbile with each part and the possible future changes and improvements (in addition to general things like more testing and fixing any bugs that pop up). Only one part has reached the stage of merging with MDAnalysis, but the current versions of the other two are at least working and, with a bit more work post-GSoC will hopefully follow soon!

This part involved adding an ‘auxiliary’ module to MDAnalysis to allow additional timeseries data from a simulation to be read alongside the trajectory. You can read more about it in this post, or to see more about how it’d work in practice, here’s the documentation.

Make bundle

This part (originally planned as a less-general ‘umbrella class’) involved adding a function to let us group together any related simulations and their various ‘metadata’ and ‘auxiliary data’, to make it easier to keep track of and perform analysis across these simulations. You can read more about it in this post.

Run WHAM

The last part involved writing a function to allow WHAM analysis to be performed in MDAnalysis. I didn’t get around to making a post looking at this part in particular, but you can check out the original discussion on github. I mentioned WHAM back in this post; it’s an algorithm we can use to take data from a set of umbrella sampling simulations and calculate the ‘Potential of Mean Force (PMF)’ profile showing the (one-dimensional) energy landscape of the system we’re simulating.

Rather than write a new implementation, I’ve been writing a wrapper for Alan Grossfield’s implementation. Grossfield’s WHAM is run from the command line and uses input files following a particular format. The idea of my wrapper is to allow WHAM to be run in a python script alongside other analysis, remove the need for specific file-formatting, and offer some additional options such as start and end times or changing energy units.

Umbrella Sampling: bringing it all together

The main personal driving force behind my project was to be able to simplify analysis of Umbrella Sampling simulations in MDAnalysis. So how does the work I’ve done pull together to achieve that? Let’s let our final kitties of GSoC demonstrate:

(Can’t remember why we want Umbrella Sampling simulations and PMF profiles? Go back to this post).

Again, there’s still a little work left before this is fully realised within MDAnalysis – but when I find time post-GSoC I’ll definitely be working to get this done!

And finally…

Despite many moments of frustration I’ve enjoyed GsoC: I’ve definitely improved my coding skills and I’ve (mostly) built something I’m likely to actually use for my own research! A huge thanks to my mentors and everyone else at MDAnalysis for help, ideas and discussion along the way.

I had a lot of fun drawing my little cat friends help make this blog interesting, hopefully it’s been fun and informative for you as well! I’d like to keep up posting here (though it’s likely to be even less frequently), so keep an eye out if you’re interested - at the very least, I’ll make a post when ‘make bundle’ and ‘run wham’ are finished.

Thanks for following along GSoC with me and for putting up with all my cats, and (hopefully) see you again sometime!

August 18, 2016

Ramana.S (Theano)

Summary of Pull Requests

Hi, In this post I will just list the Pull Requests that have been made during GsoC 2016.

(This fortnight blog will follow this, tomorrow. This blog is just for the purpose of code submission to Google. )

Cheers,
Ramana

chrisittner (pgmpy)

GSoC Project Status Quo

GSoC 2016 is coming to an end I’ve just sent the last PR necessary to complete the scope of my proposal. It has been an exciting project, and I do feel that I learned a lot. I was able to implement a number of basic BN structure estimation algorithms, that I wanted to study for a long time. Once all the reviewing is done, pgmpy will support for basic score-based structure learning, with the usual structure scores (BIC, BDeu, K2) and exhaustive search and local heuristic search (hill climbing) with tabu search, edge blacklists and whitelists and indegree restriction.

It will also support basic constraint-based structure learning with conditional chi2 independence tests. I implemented the PC algorithm and a PDAG completion procedure (under review). Finally, MMHC, a hybrid learning algorithm is also implemented (under review). In the beginning of the project I also worked on Bayesian parameter estimation for BNs.

While the estimators/-folder is already less empty than before, a lot remains to be done. pgmpy learning does not yet have:

• learning for MarkovModels
• learning for continuous networks
• better support for learning from incomplete data
• Chow-Liu tree-structure BN learning <- an efficient algorithm that can find the optimal tree network given data. Sounds great, then we get tree-bayes classification as well.
• Strong documentation & show cases to get more people interested in pgmpy

Tomorrow, I’ll share another post with my key learnings about BNs!

Raffael_T (PyPy)

It's alive! - The final days of GSOC 2016

I promised asyncio with async-await syntax, so here you have it :)
I fixed all bugs I could find (quite many to be more exact, which is normally a good sign), and as a result I could run some programs using asyncio with async-await syntax without any error, with the same results cpython 3.5 would give.
I implemented all tests of cpython for the async-await syntax and added some tests to check if everything works together well if combined in a more complex program. It's all working now.
The GIL problem I described earlier was just an incompatibility of asyncio with the PyPy interpreter (pyinteractive). The error does not occur otherwise.
I have been working on the py3.5-async branch in the PyPy repository lately and that is also where I did all my checks if everything is working. I also merged all my work into the main py3.5 branch. This branch was really broken though, because there are some major changes from py3k to py3.5. I fixed some things in there, and the PyPy team managed to fix a lot of it as well. My mentor sometimes took some time to get important things to work, so my workflow wouldn't get interrupted. While I was working on py3.5-async, a lot of fixes have been done on py3.5, changing the behaviour a bit and (possibly) breaking some other things. I have not yet checked everything of my proposal on this branch yet, but with some help I think I managed to get everything working there as well. At least it all looks very promising for now.
Next to that I also managed to do some fixes I could find in the unpack feature of my proposal. There have been some special cases which lead to errors, for example if a function looks like this: def f(*args, **kwds), and a call like that: f(5,b=2,*[6,7]), I got a TypeError saying “expected string, got list object”. The problem here was that certain elements had a wrong order on the stack, so pulling them normally would not work. I made a fix checking for that case, there are probably better solutions but it seems to work for now.

I will probably keep working on py3.5 for a little bit longer, because I would like to do some more things with the work of my proposal. It would be interesting to see a benchmark to compare the speed of my work with cpython. Also there is a lot to be done on py3.5 to work correctly, so I might want to check that as well.

Here is a link to all of my work:
https://bitbucket.org/pypy/pypy/commits/all?search=Raffael+Tfirst

My experience with GSOC 2016
It's really been great to work with professional people on such a huge project. My mind was blown as I could witness how much I was able to learn in this short time. The PyPy team was able to motivate me to achieve the goal of my proposal and to get a lot more interested in compiler construction than I have already been before.
I am glad I took the chance to participate in this years GSOC. Of course my luck with having such a helpful and productive team on my side is one of the main reasons why I enjoyed it so much.

Aakash Rajpal (italian mars society)

Final Week Final Blog!!

Hey Everyone, as the project submission deadline nears I have been working to finish my project. The  Heads Up Display(HUD)  is rendering perfectly fine on the Oculus  interacting with the leap motion and receiving data from the server. Till now, I was able to integrate the HUD with my own demo model in blender and simulate it on to the Oculus. The Final Step involved integrating the HUD with one of the model scenes available on the IMS V-ERAS repository.

This was a difficult task as the Italian Mars Society initially started to simulate models through the Blender Game Engine for the DK1 but as the support for Blender by Oculus was very limited they decided to shift to Unity Instead. Currently, they are working only on Unity and as I am working under PSF organization using unity was not an option. Also since there has been a lot of changed from Rift DK1 to Dk2, most of the  models were unable to render successfully on the Oculus.

I had a little chat with my mentor about this issue, and he asked me to integrate the HUD with the models that were fully rendered on DK2 and forget about the other ones. After that, I found an avatar model and began work to integrate the HUD on to it.ouple of days later, I was able to render the HUD on to the Oculus with one of V-ERAS models through Blender Game Engine and the result seems good.

Couple of days later, I was able to render the HUD on to the Oculus with one of V-ERAS models through Blender Game Engine and the result seems good.

Currently, I am trying to make the HUD an addon in Blender Game Engine so that it can be imported into any model/scene and render successfully on the Oculus.

As for the final submission , I have started with the documentation and hopefully ,I will submit by the end of this week.

Final Post

Preetwinder (ScrapingHub)

GSoC-6

Hello,
This post continues my updates for my work on porting frontera to python2/3 dual support.
This will be my final post. I have achieved my goal satisfactorily, frontera now works under both Python 2 and Python 3 for all the different scenarios such as with different DB’s(Postgres, Mysql, Hbase) and with different Messagebus implementations(Kafka, ZMQ). The latest release of Frontera - 0.6.0 now available on PyPI contains these changes.

I have also substantially increased the test coverage, now almost all the major components(Workers, Backends, Manager etc) are tested.
Here is a link to all of my commits

This period of 3 months has been invaluable to me, and I have learned a lot during it. I am greatly thanful to my mentors Alexander Sibiryakov and Paul Tremberth for working with me and helping me through this period. I would also like to thank my sub-organization Scrapinghub and my organization Python Software Foundation for selecting me. Finally I would like to thank Google for providing this opportunity to me and others.

GSoC-6 was originally published by preetwinder at preetwinder on August 18, 2016.

Vikram Raigur (MyHDL)

Backend Top Module

The Back-end Top module connects quantizer, RLE, Huffman and Byte-Stuffer Modules.

The back-end have a small FSM sitting inside which makes all the modules run parallely.
i.e when Byte-Stuffer sends ready signal to Huffman, Huffman sends ready to RLE, RLE send ready signal to Quantizer.

The Back-end have a input buffer attached which will take data from the front-end, The Buffer have size of 3*64, so that front-end do not have to wait for the back-end to be ready.

Among the backend modules huffman module takes around 80 clock cycles. So, each block have to wait till those cycles finish.

ByteStuffer

The Byte Stuffer module checks for 0xFF bytes and appends 0x00 after it. The Byte Stuffer module is implemented and already merged in the main repo.

Prediction and forecasting

The last challenge I faced during GSoC was to implement Kim prediction and forecasting. At first it appeared to be quite difficult, because I dealt with both mathematical and architectural issues. Here's a short overview of the subject.

Prediction maths

Kalman filter in Statsmodels has three modes of prediction:
• Static prediction. This is prediction of the next observation based on current. This type of prediction is equivalent to usual Kalman filtering routine.
• Dynamic prediction. Still don't get its purpose.
• Forecasting. Mathematically speaking, forecast of the out-of-sample observation is its expectation conditional on known observations.
My goal was to implement two of these types - static prediction and forecasting in case of switching regimes, i.e. construct Kim prediction and forecasting device.
I haven't got any problems with static prediction, because it's also equivalent to Kim filtering. But forecasting issue is not covered in [1], so I had to use my own intelligence to come up with mathematical routine, calculating future data expectation and covariance, conditional on known observations. Basically it's just a Kim filter, but without Hamilton filtering step and with underlying Kalman filters working in prediction mode.

Prediction code

My laziness forced intention to write less code and reuse the existing. The idea was to somehow reuse Kalman filter prediction, but when I came up with the correct forecast routine I understood that it doesn't appear to be possible. So I had to implement all the routine myself, which is located at kim_filter.KimPredictionResults._predict method. Luckily the prediction code is much more compact then the filter's one. Also I didn't have to care so much about optimality, since prediction doesn't take part in likelihood optimization.

Some nice pics

Since no test data is available for forecasting, I used iPython notebooks as a sort of visual testing.
I added pictures of static (one-step-ahead) prediction and forecasting to Lam's model and MS-AR model notebooks, they look sensible and my mentor liked them:
(this is for Lam's model)
(and that's for MS-AR)
Forecast's variance is constant, because it's fast to find a stationary value.

GSoC: Summary

I'm proud to say that I've completed almost all items of my proposal, except of constructing a generic Kim filter with an arbitrary r number of previous states to look up. But this is due to performance problems, which to be solved require more time then GSoC permits. In detail, implemented pure-Python r=2 case works slowly and is to be rewritten in Cython.
Anyway, a good advantage of my Kim filter implementation, as mentioned by my mentor, is using logarithms to work with probabilities. It gives a high improvement in precision, as I conclude from testing.
A broad report on what's completed and what's not can be found here in github comments.

GSoC: My impressions

GSoC has surely increased my level of self-confidence. I've made a lot of nice work, written 10k lines of code (I was expecting much less, to be honest), met many nice people and students.
I have to admit, that GSoC appeared to be easier than I thought. The most difficult and nervous part of GSoC was building a good proposal. I remember, that I learned a lot of new material in very short terms  - I even had to read books about state space models during vacation, from my smartphone, sitting in a plane or a subway train.
I also started working on my project quite early - in the beginning of May. So, I had like 60% of my project completed by the midterm, and I didn't start working full time yet, because my school and exams finished only by the end of June.
So I worked hard during July, spending days in the local library, but still, I think I never worked like 8 hours a day. Eventually, to the beginning of the August I completed almost everything, the only left thing was prediction and forecasting, discussed previously in this post.
I dreamed about taking part in GSoC since I was sophomore, and I'm glad I finally did it. The GSoC code I produced is definitely the best work I've ever done, but I hope to do more in future.
Thanks for reading my blog! It was created for GSoC needs only, but I think I will continue writing, as I get something interesting to tell.

Literature

[1] "State-space Models With Regime Switching" by Chang-Jin Kim and Charles R. Nelson.

Shridhar Mishra (italian mars society)

Update! on 10 July-Tango installation procedure for windows 10.

Tango installation on windows can be a bit confusing since the documentation on the website is old and there are a few changes with the new versions of MYSQL.

Here's the installation procedure that can be helpful.

1 : Installation of MySQL
In this installation process mysql 5.7.11 is installed. Newer versions can also be installed.
while installation we cannot select a manual destination folder and Mysql is installed in C:\Program Files (x86)\MySQL by default.
During installation it is mandatory to set at least a 4 character password for root which wasn't the case for previous versions.
This is against the recommendation from tango-controls which was specific for older version of SQL.

2 : Installation of TANGO

Execute the installer. You should specify the destination folder created before :'c:\tango'
After installation you can edit the MySQL password.

3 : Configuration

• 3-1 TANGO-HOST

Define a TANGO_HOST environment variable containing both the name of the host on which you plan to run the TANGO database device server (e.g. myhost) and its associated port (e.g. 20000). With the given example, the TANGO_HOST should contain myhost:20000.

On Windows you can do it simply by editing the properties of your system. Select 'Advanced', 'Environment variables'.

• 3-2 Create environment variables.

2 new environment variables has to be created to run create-db.bat
• 3-2-1 MYSQL_USER

this should be root

fill in the password which was used during mysql installation.

• 3-3 MODIFY PATH

Add this to windows path for running sql queries.
C:\Program Files (x86)\MySQL\MySQL Server 5.7\bin

• 3-4 Create the TANGO database tables

Be sure the mysql server is running, normally it should.
Execute %TANGO_ROOT%\share\tango\db\create_db.bat.

• 3-5 Start the TANGO database:

execute %TANGO_ROOT%\bin\start-db.bat -v4
the console should show
"Ready to accept request" on successful installation.

• 3.6 Start JIVE

Now you can test TANGO with the JIVE tool, from the Tango menu, or by typing the following command on a DOS windows :
%TANGO_ROOT%\bin\start-jive.bat

Ref: http: www.tango-controls.org/resources/howto/how-install-tango-windows/

Aron Barreira Bordin (ScrapingHub)

Scrapy-Streaming [8/9] - Scrapy With Node.js

Hi everyone,

In these weeks, I’ve added support for Node.js Language on Scrapy Streaming. Now, you can develop scrapy spiders easily using the scrapystreaming npm package.

It’s a helper library to help the development process of external spiders in Node.js It allows you to create the scrapy streaming json messages.

Examples

I’ve added a few examples about it, and a quickstart section in the documentation.

PRs

Aron.

Scrapy-Streaming [8/9] - Scrapy With Node.js was originally published by Aron Bordin at GSoC 2016 on August 18, 2016.

August 17, 2016

liscju (Mercurial)

Work Product submission

So the GSOC is nearly over, its time to make a summary what was done:

• all things mentioned in the proposal were done
• there is a support for having large file stored outside of repository, in remote location
• remote location destination can be statically stored in configuration file
• remote location destination can be generated dynamically with hook
• for now there is support for storing large files in remote location in
• local file system
• http
• https
• it's easy to write support for new protocol for communicating with remote destination server
• solution works for clients with old versions of mercurial
• repository with redirection feature turned on doesnt need to store any large files locally, the only occasion in which file would be downloaded is when user does things like update/commit
• nearly all the code is located in a single python file, which makes functionality of feature not mangling with standard largefiles code
• i was blogging about my experience during gsoc in http://liscjugsoc.blogspot.com/
• i created a user manual for the feature:
• i created a technical documentation for the project

All commits can be seen here:

Or all commits in single patch file:

What needs to be done:
• Merge to the Mercurial Source Code:P

Levi John Wolf (PySAL)

Formally Winding down GSOC

This week, I’ve been really bringing the work on GSOC to a close. Thus, I’ve linked to a notebook where I walk through the various work I’ve done in the project.

What a ride.

mkatsimpris (MyHDL)

Work Product

In the following post is summarized the overall work that has been done during the GSoC. Completed Work  Color Space Conversion Module with parallel inputs and parallel outputs interface Color Space Conversion Module with serial inputs and serial outputs interface 1D-DCT Module 2D-DCT Module Zig-Zag Scan Module Complete Frontend Part of the Encoder Block Buffer Input Block Buffer All the

Work Product

In the following post is summarized the overall work that has been done during the GSoC. Completed Work  Color Space Conversion Module with parallel inputs and parallel outputs interface Color Space Conversion Module with serial inputs and serial outputs interface 1D-DCT Module 2D-DCT Module Zig-Zag Scan Module Complete Frontend Part of the Encoder Block Buffer Input Block Buffer All the

Riddhish Bhalodia (dipy)

Last Blog!

So we are near the end of Google Summer of Code (GSOC) 2016, and I am very excited that my work will help researchers and students using DIPY. I had a really wonderful experience working with my mentors and DIPY. GSOC has definitely built my concepts stronger, made me a better programmer and a better researcher than before. I would like to have this blog summarize my GSOC work, where I will try to answer the following questions:

1. What are the algorithms/programs that I have implemented in this GSOC?
2. What are the aims that were met?
3. How different are the initial proposal and the final outcome? and so on.

The links to all my code

This is a list of comprehensive links to all the merged as well as ongoing pull requests (PRs) and the commits to those PRs:

PR 1: Local PCA Slow (just the python implementation of [1]) [Soon to be MERGED].
All commits for PR1

PR2: Local PCA fast (with the cython implementation) [Resolving a bug]
All commits for PR2

PR3: Adaptive Denoising (implemented and optimized [2]) [MERGED]
All commits for PR3

PR4: Brain Extraction [Soon to be MERGED]
All commits for PR4

PR5: MNI Template Fetcher [MERGED]
All commits for PR5

Throughout the blog the above mentioned PRs will be referred by their PR numbers.

Let me start by describing my initial proposal

The initial proposal

There were two primary aims of my GSOC proposal:

1. Local PCA and it’s Optimization

I had to implement the method described in the paper [1] in python for DIPY. I had to optimize the algorithm in cython to achieve higher speedup for the same. Along with this I had to test for several different data and MRI scans.

2. Brain Extraction Method and it’s Implementation

We had to come up with a plausible idea of having a brain extraction method suitable to DIPY and implement it.

The Final Output

I am pleased to say that we have managed to meet most of the goals set by the GSOC proposal and more. Breaking it into pieces:

1. Python implementation of Local PCA [1] (DONE) [PR1]
2. Cython implementation of Local PCA [1] (DONE except for the bug described in next section) [PR2]
3. Tests and Documentation for the Local PCA [1] (DONE) [PR1, PR2]
4. The local PCA algorithm was tested for several different datasets like:
• Single shell Stanford HARDI Data
• Multi shell Sherbrooke Data
• Multi-shell Human Connectome data
• General Electric (GE) diffusion MRI data

6. Nlmeans (non-local means) block wise averaging and it’s optimization [2,3] (DONE) [PR3] (Adaptive denoising PR)
7. Tests and Documentation of nlmeans block and adaptive soft coefficient matching [1] (DONE) [PR3] (Adaptive denoising PR). Tests done for T1 as well as diffusion data and it showed significant improvement in both cases.

8. Brain extraction Method designed (DONE) [PR4]
9. Brain extraction Implemented in python (DONE) [PR4]
10. Brain extraction experiments for different datasets (Some remaining):
• The experiments are done using the IBSR dataset of MRI which has manual segmentation as well. This was chosen as it is easily accessible as well as the extraction result can be compared with the manually extracted one which can give us a comparative metric to gauge the efficiency of the algorithm.
• The metric we used to compare manual and automatic extraction  is the Jaccard’s measure and we observed mean jaccard’s index of 0.84 for three subjects from the ISBR data.
• The experiment was done using T1 template (MNI template), with T1 input data. To test how does the change in modality affects the extraction we also tried it with T1 template and input data of different modality (like B0), and it worked out really well.
11. Brain extraction speedup (DONE) [PR4]
12. Tests and Documentation of brain extraction (Almost DONE) [PR4]

Note : As there are lot of results you can look at each one of then in the entire blog whose titles are easy to follow. ( riddhishbgsoc2016.wordpress.com )

Here is the list of all my blogposts for GSOC 2016, it will give a more comprehensive explanation for each milestone:

Immediate Next Steps

A] Fix the fast bug

I have successfully implemented the fast localPCA in cython [PR2], but upon extensive testing we have found a strange bug. Fast cython implementation takes less time than the python only version in most cases, but for certain systems (some of the laptops) it does not do so! This system specific inconsistency is something we have to resolve fast and possibly in couple of weeks.

B] More experimentation and testing of brain extraction

We need to experiment with the adaptive denoising algorithm when used in combination with brain extractions and how does that affect the performance. We also want to test the brain extraction under several different combinations of the modalities of the input and the template data, to see how robust is our implementation.

C] Improving the brain extraction tutorial, and implement one more data fetcher for the smooth understanding of the tutorial.

C] Make the cython implementation of LocalPCA even faster (which may come under the umbrella of fixing the above described bug).

D] Merge all the PRs.

Future Directions to Follow

A] Open multiple-process (Open MP) based multithreading for both adaptive denoising and localPCA, this can result in a more faster implementation for the local PCA.

B] Short publications for local PCA and brain extraction. The projects led to some exciting directions and we believe that we can have some kind of research output from them. I will keep working with the DIPY team on these projects and will keep this blog updated on whatever new updates we have in this regard.

References

[1] Diffusion Weighted Image Denoising Using Overcomplete Local PCA
Manjón JV, Coupé P, Concha L, Buades A, Collins DL, et al. (2013) PLoS (Pub Library of Science) ONE 8(9): e73021

[2] Multiresolution Non-Local Means Filter for 3D MR Image Denoising
Pierrick Coupe, Jose Manjon, Montserrat Robles, Louis Collins. Adaptive .
IET Image Processing, Institution of Engineering and Technology, 2011

August 16, 2016

chrisittner (pgmpy)

PC constraint-based BN learning algorithm

The past while I have been working on basic constraint-based BN learning. This required a method to perform conditional independence tests on the data set. Surprisingly, such tests for conditional independence are not part of scipy.stats or other statistics libraries.

To test if X _|_ Y | Zs, one has to manually construct the frequencies one would expect if the variables were conditionally independent, namely $$P(X,Y,Zs)=P(X|Zs)\cdot P(Y|Zs)\cdot P(Zs)$$ and compare it with the observed frequencies, using e.g. a $$\chi^2$$ deviance statistic (provided by scipy.stats). Expected frequencies can be computed as $$\frac{P(X, Zs)\cdot P(Y, Zs)}{P(Zs)}$$, So one can start with a joint state_count/frequency table and marginalize out $$X$$, $$Y$$, and both and compute the expected distribution from the margins.

Once such a testing method is in place, the PC algorithm can be used to infer a partially directed acyclic graph (PDAG) structure to capture the dependencies in the data, in polynomial time. Finally, the PDAG can be fully oriented and completed to a Bayesian network. The implementation looks like this:

Methods of the ConstraintBasedEstimator class:

test_conditional_independence(self, X, Y, Zs=[])

Chi-square conditional independence test (PGM book, 18.2.2.3, page 789)

build_skeleton(nodes, independencies)

Build undirected graph from independencies/1st part of PC algorithm (PGM book, 3.4.2.1, page 85, like Algorithm 3.3)

skeleton_to_pdag(skel, seperating_sets)

Orients compelled edges of skeleton/2nd part of PC (Neapolitan, Learning Bayseian Networks, Section 10.1.2, page 550, Algorithm 10.2)

pdag_to_dag(pdag)

Method to (faithfully) orient the remaining edges to obtain BayesianModel (Implemented as described here on page 454 last paragraph (in text)).

Finally three methods that combine the above parts for convenient access:

estimate(self, p_value=0.01)

-> returns BayesianModel estimate for data

estimate_skeleton(self, p_value=0.01)

-> returns UndirectedGraph estimate for data

estimate_from_independencies(nodes, independencies)

-> static, takes set of independencies and estimates BayesianModel.

Examples:

import pandas as pd
import numpy as np
from pgmpy.base import DirectedGraph
from pgmpy.estimators import ConstraintBasedEstimator
from pgmpy.independencies import Independencies

data = pd.DataFrame(np.random.randint(0, 5, size=(2500, 3)), columns=list('XYZ'))
data['sum'] = data.sum(axis=1)
print(data)

# estimate BN structue:
c = ConstraintBasedEstimator(data)
model = c.estimate()
print("Resulting network: ", model.edges())


Output:

      X  Y  Z  sum
0     1  3  4    8
1     3  3  0    6
2     4  4  1    9
...  .. .. ..  ...
2497  0  4  2    6
2498  0  3  1    4
2499  2  1  3    6

[2500 rows x 4 columns]

Resulting network: [('Z', 'sum'), ('X', 'sum'), ('Y', 'sum')]


Using parts of the algorithm manually:

# some (in)dependence tests:
data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD'))
data['E'] = data['A'] + data['B'] + data['C']
c = ConstraintBasedEstimator(data)

print("\n P-value for hypothesis test that A, C are dependent: ",
c.test_conditional_independence('A', 'C'))
print("P-value for hypothesis test that A, B are dependent, given D: ",
c.test_conditional_independence('A', 'B', 'D'))
print("P-value for hypothesis test that A, B are dependent, given D and E: ",
c.test_conditional_independence('A', 'B', ['D', 'E']))

# build skeleton from list of independencies:
ind = Independencies(['B', 'C'], ['A', ['B', 'C'], 'D'])
ind = ind.closure()
skel, sep_sets = ConstraintBasedEstimator.build_skeleton("ABCD", ind)
print("Some skeleton: ", skel.edges())

# build PDAG from skeleton (+ sep_sets):
data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 3)), columns=list('ABD'))
data['C'] = data['A'] - data['B']
data['D'] += data['A']
c = ConstraintBasedEstimator(data)
pdag = c.skeleton_to_pdag(*c.estimate_skeleton())
print("Some PDAG: ", pdag.edges())  # edges: A->C, B->C, A--D (not directed)

# complete PDAG to DAG:
pdag1 = DirectedGraph([('A', 'B'), ('C', 'B'), ('C', 'D'), ('D', 'C'), ('D', 'A'), ('A', 'D')])
print("PDAG: ", pdag1.edges())
dag1 = ConstraintBasedEstimator.pdag_to_dag(pdag1)
print("DAG:  ", dag1.edges())


Output:

P-value for hypothesis test that A, C are dependent:  0.995509460079
P-value for hypothesis test that A, B are dependent, given D:  0.998918522413
P-value for hypothesis test that A, B are dependent, given D and E:  0.0
Some skeleton:  [('A', 'D'), ('C', 'D'), ('B', 'D')]
Some PDAG:  [('A', 'C'), ('A', 'D'), ('D', 'A'), ('B', 'C')]
PDAG:  [('A', 'D'), ('A', 'B'), ('C', 'D'), ('C', 'B'), ('D', 'A'), ('D', 'C')]
DAG:   [('A', 'B'), ('C', 'B'), ('D', 'A'), ('D', 'C')]


TaylorOshan (PySAL)

Wrapping Up

In the last week of the GSOC work, the focus is on wrapping everything up, which means finalizing code and documentation, providing useful examples for educational use, and reflecting on the entire project to provide a plan on how to continue to grow the project beyond GSOC 2016.

The finalized code and documentation will be reflected in the project itself, where as educational materials will be in the form of a jupyter notebook that demonstrate various features of the project on a real life dataset (NYC CITI bike share trips). The notebook can be found here. Other experiments and proto-typing notebooks can be found in this directory.

In order to systematically reflect on the progress made throughout the project, I will now review the primary features that were developed, linking back to pertinent blog posts where possible.

API Design and The SpInt Framework

The primary API consists of four user-exposed classes: Gravity, Production, Attraction, and Doubly, which all inherit from a base class called BaseGravity. All of these classes can found in gravity script. The user classes accept the appropriate inputs for each of the four types of gravity-based spatial interaction model: basic gravity model, production-constrained (origin-constrained), attraction-constrained (destination-constrained), and doubly constrained. The BaseGravity class does most of the heavy lifting in terms of preparing the apprpriate design matrix. For now, BaseGravity inherits from CountModel, which is designed to be a flexible generalized linear model class that can accommodate several types of count models (i.e., Poisson, negative binomial, etc.) and several different types of parameter estimation (i.e., iteratively weighted least squares, gradient optimization, etc.). In reality, CountModels only currently supports Poisson GLM's (based on a customized implementation of statsmodels GLM) and iteratively weighted least sqaures estimation, which will be discussed further later in this review. In addition, the user may have a continuous dependent variable, say trade flows in dollars between countries, and therefore might want to use a non-count model, like Gaussian ordianry least sqaures. Hence, it may make more sense in the future to move away from the CountModel, and just have the BaseGravity class do the necessary dispatching to the approriate probability model/estimation techniques.

Related blog post(s): post one; post two

Sparse Compatibility

Because the constrained variety of gravity models (Production, Attraction, Doubly) require either N or 2N categorical variables (fixed effects), where N is the number of locations that flows may move between, a very large sparse design matrix is necessary for any non-trivial dataset. Therefore, a large amunt of effort was put into efficiently building the sparse deisgn matrix, specifically the sparse categorical variables. After much testing and benchmarking, a function was developed that can construct the sparse categorical variable portion of the design matrix relatively quickly. This function is particualr fast if the set of locations, N, is index using integers, though it is still efficient if unqiue locations are labeled using string identifiers. The sparse design matrix allows more efficient model calibration. For example, a spatial system akin to all of the counties in US (~3k locations or ~9 million observations), requires less than a minute to claibrate a production-constrained model (3k fixed effects) on my notebook about two minutes for a doubly-constrained (6k fixed effects) model on my notebook. It was decided to use normal array's for the basic Gravity model since it does not have fixed effects by defualt, has dense matrices, and therefore becomes inefficient as the number of observations grows if sparse matrices are used.

Related blog post(s): post three; post four; post five

Testing and Accounting for Overdispersion

Since the nubmer of pointial origin-destination flow observations grows exponentially as the number of locations increases, and because often there are no trips occuring between many locations, spatial interaction datastes can often be overdispersed. That is, there is more variaiton in the data than would expected given the mean of the observations. Therefore, several well known tests for overdispersion (Cameron & Trivedi, 2012) in the context of count (i.e., Poisson) models were implemented. In addition, a QuasiPoisson family was added that can be activated using the Quasi=True parameterization. If it is decided to accomodate more probability models other than Poisson, as previously discussed, then the Quasi flag would be replaced with a family parameter that could be set to Gaussian, Poisson, or QuasiPoisson. The purpose of the QuasiPoisson formulation is to calibrate a Poisson GLM that allows for the variance to be different than the mean, which is a key assuption in the defualt Poisson model.

Related blog post(s): post six

Exploratory Spatial Data Analysis for Spatial Interaction

To explore the spatial clustering nature of raw spatial interaction data an implementation of Moran's I spatial autocorrelation statistic for vectors was completed. Then several experiemnts were carried out to test the different randomization technqiues that coudl be used for hypothesis testing of the computed statistic. This analytic is good for exploring your data before you calibrate a model to see if there are spatial associations above and beyond what you might expect from otherwise random data. More will be said about the potential to expand this statistic at the end of this review in the 'Unexpected Discoveries' section.

Related blog post(s): post seven

Investigating Non-Stationarity in Spatial Interaction

A method called local() was added to the Gravity, Production, and Attraction classes that allows the models to be calibrated for each of a single location, such that a set of parameter estimates and associated diagnostics is acquired for individual subsets of data. These results can then be mapped, either using python or other conventional GIS software, in order to explore how relationships change over space.

Related blog post(s): post seven

Spatial Weights for Spatial Interation

In order to carry out the vector-based spatial autocorrelation statistics, as well as various types of spatial autoregressive model specifications, it is necessary to define spatial associations between flows using a spatial weight matrix. To this end, three types of spatial weights were implemented, which can be found in the spintW script in the weights mpodule. The first is a origin-destination contiguity-based weight that encodes two flows as a neighbor is they share either an origin or a destination. The second weight is based on a 4-dimensional distance (origin x, origin y, destination x, destination y) where the strength of the association decays with further distances. Finally, network-based weights that use different types of adjacency of flows represented as an abstract or physical network.

As part of this work I also had the opportuity to contirbute some speed-ups to the DistanceBand class in the Distance script of the weights module so that it avoided a slow loop and could leverage both dense and sparse matrices. In the case of the 4-dimensional diistance-based spatial weight, the resulting spatial weight is not sparse and so the exisiting code could become quite slow. Now it is possible to set the boolean parameter build_sp=False, which will be more effiient when the distance-weighted entries of the weight matrix are increasingly non-zero.

Related blog post(s): post eight

Accounting for Spatial Association in Spatial Interaction Models

It has recently been proposed that due to spatial autocorrelation in spatial interaction data it is necessary to account for the spatial association, otherwsie the estimated parameters could be biased. The solution was a variaiton of the classic spatial autoregressive (i.e., spatia lag) model for flows, which could estimate an additional parameter to capture spatial autocorrelation in the origins, in the destinations and/or in a combination of the origins and destinations (LeSage and Pace, 2008). Unfortunately, no code was released to support this mode specification, so I attempted to implement this new model. I was not able to completely replicate the model, but I was able to extend the existing pysal ML_Lag model to estimate all three autocorrelation parameters, rather than just one. I have also attemped to re-derive the appropriate variance-covariance matrix, though this will take some more work before it is completed. More on this in the 'Moving Forward' section found below.

Related blog post(s): post nine

Assessing Model Fit for Spatial Interaction Models

Several metrics were added for assessing model fit. These include McFadden's pseudo r-squared (based on likelihood ratios), the adjusted McFadden's pseduo r-squared ro account for model complexity, D-squared or percent deviance (based on deviance ratio) and its adjusted counterpart, the standardized root mean square error (SRMSE), and the Sorenseon similarity index (SSI). The D-squared statistics and the pseudo r-squared statistics are properties of the GLM class, while the SRMSE and SSI metrics have been added as properties of the BaseGravity class. However, the functions to compute the SSI and SRMSE are stored in the utils script since they may also be useful for detemrinistic non-gravity type models that could be implemented in the future.

Related blog post(s): post ten

Unexpected Discoveries

While implementing the vector-based spatial autcorrelation statistic, it was noticed that one of the randomization technqiues is not exactly random, depending on the hypothesis that one would like to test. In effect, when you produce many permutations and compare your test statistic to a distriubution of values, you would find that you are always rejecting your statistic. Therefore, there is additional work to be done here to further define different possible hypothesis and the appropriate randomization techniques.

Leftovers and Moving Forward

Future work will consist of completing the spatial interaction SAR model specification. It will also include adding in gradient-based optimization of likelihood functions, rather than solely iteratively weighted least squares. This will allow the implementation of other model extensions such as the zero-inflated Poisson model. I was not able to implement these features because I was running short on time and decided to work on the SAR model, which turned out to be more complicated than I originally expected. Finally, future works will also incorporate determinsitic models, spatial effects such as competing destinations and eigenvector spatial filters, and neural network spatial interaction models.

GSoC week 12 roundup

@cfelton wrote:

This is the last roundup, as posted in previous GSoC roundups the GSoC program has outlined that all students have their final code committed by 20-Aug. If you have not committed your final code make sure to do so in the next couple days and prepare your final blog post that will be used in your evaluation submission.

IMPORTANT NOTE TO MENTORS
All mentors need to provide a summary of the students final evaluation to me (@cfelton) via email by 22-Aug. The assigned mentors were never corrected in the GSoC system, I will need to complete all the final evaluations again. Because of schedule conflicts and PSF requirements I will be completing all the evaluations by the 23rd, please provide the final review as soon as possible.

** Student final project submission **
Students, make sure to have your final blog post, the final post should review what you completed and what is outstanding. I should be able to easily understand what is working in the projects, what is missing, and what doesn't work. This detailed final post is required for a passing evaluation.

The GSoC work product submission guidelines outline what the final post should have. Take the time required to generate the final blog post that you will link in your submission, it should have:

1. Description of the work completed.
2. Any outstanding work if not completed.
3. Link to the main repository.
4. Links to the PRs created during the project.

Review the submission guidelines page in detail.

The idea of GSoC isn't that students churn out code -- it's important that the code be potentially useful to the hosting Open Source project!

Also make sure the README on the project repositories is complete, it should give an overview of the project and instructions for a user to get started: install, run tests, the core interfaces, and basic functional description.

Student week12 summary (last blog, commits, PR):

jpegenc:
health 87%, coverage 97%
@mkatsimpris: 12-Aug, >5, Y
@Vikram9866: 07-Aug, >5, Y

riscv:
health 96%, coverage 51%
@meetsha1995: 11-Aug, >5, Y
@srivatsan: 14-Aug, >5, N

gemac:
health 93%, coverage 92%
@ravijain056, 02-Aug, >5, N

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 10

Participants: 6

Thanks

Before I start on the work summary I would like to thank:

• Lasse for introducing me to coala and helping me with (almost) any problem I had during this GSoC.
• Udayan for being a cool guy with good humor (if I say so myself) as well as a good mentor.
• Mischa for helping me with functional python, decorators and excellent reviewing.
• Fellow GSoC students that I met at Europython, Adrian, Adhityaa and Tushar for the awesome community bonding.
• Max because he has dreadlocks. Also the meaningful life teachings and the cooking.

Work Summary

I have several pull/merge requests across the different coala repos which I will list here with links so you can check them for yourself.

Pull/Merge Request Description Status
coala/2407 Modify the JSON spec used by the decorator Merged
coala/2452 Migrate some libs to coala-utils Merged
coala/2460 Bump version for coala-utils Merged
coala/2583 Add external bear tutorial Merged
coala-bear-management/3 Extend tool to support external bears Merged
coala-utils/5 Refactor from coala-decorators to coala-utils Merged
coala-utils/7 Migrate StringConverter from coala core Merged
coala-utils/15 Modification for backwards compatibility Merged
coala-utils/19 Revert changes in yield_once as a fix Merged
coala-utils/20 Add open_files context manager Merged
coala-bears/617 Extend tool to support conda packaging Pending

Goals

The most important part of the project was to be able to write bears with other languages. I can proudly say that it is now possible to write such an "external" bear.

Some other achieved goals are:

• Bear creation tool
• External bear proof of concept with tutorial

Some work left to do:

• Merge packaging tool extension
• Add Diff handling to external bears

Now that we have come to an end I can say that the toughest challenge by far was the code merging process since coala has a very strict reviewing workflow.

Wrap Up

So that is it for GSoC 2016. It was an awesome experience in which I learnt a lot of stuff (not only programming related) and I met a lot of cool people. I would definitely recommend at least trying to join the program. Worst case scenario, you will have contributed to an open source community which I have explained from the very first post of my blog why it is important.

That's it from me, feel free to pm me about any questions related to the project (and not only) on the coala gitter channel.

Alex

udiboy1209 (kivy)

One Hell Of A Summer!

I always wondered how large-scale projects and organisations got formed and got to the stage where they currently are, where there are tens or even hundreds of people maintaining them and constantly contributing to make them better. I wondered how it would feel like to see the project build up from the first line of code! Google Summer of Code allowed me to experience that! It truly has been an overwhelming and exciting summer of code!

I got to see the very beginning of the maps module, from not just the first line of code but the initial concept, the plan of coding it and everything. I know it is just a small part of kivent but it has certainly been huge for me compared to the projects I have done before!

I will try to describe my project from the very beginning in this post.

Initial Idea

KivEnt is a game engine and the one thing every game engine needs is a module for displaying tile maps. Tile maps make it simple to design game levels and fields. For example without a map module, a pokemon game developer would have to individually place each grass tile, sand tile and water tile manually, he would have to decide where which cliff edge goes so that the cliffs look elevated in 3D. You are already thinking of ways to automate all this aren’t you? Yeah, just store a 2D array with which tile to render in that position and we can make the 2D array separately. Why don’t we store the array in a file in some standardised format so we could create that file externally. Layering would just require multiple 2D arrays!

These are such fundamental requirements of games that there are a lot of tools out there just for creating that external file I mentioned above. A very famous editor is the Tiled Map Editor. Tiled has a fixed format of its files known as TMX and so to display the map created in Tiled on KivEnt, we need module to take all that data from the TMX and render it correctly on the KivEnt canvas. That is essentially what the map module does. But just displaying those tiles and forgetting about them isn’t done. We need to be able to access and modify every tile hence the module also has a good api to access the data.

Fundamental Requirements

KivEnt runs on the entity-component architecture which in the simplest sense means that each object in the game is an entity which has data components for each system that controls its properties. So each tile of the map has to be an entity and have some component which relates it to a map. So we have the MapSystem which controls the position of each tile on the map (row, col). First requirement hence was to setup such a system and the corresponding component. Also each tile could have additional properties like animation and hence be a part of other Systems.

Next, we need a way to efficiently store all the data about which place on the map has which tile. Contiguous allocation is the best way because the tile at row m and column n is at element m * map width + n in the array! We do such array allocation at the Cython level to have greater control over the memory used. But to access this data in code, we also need to have Python APIs for all this data. I built wrapper API classes for both a single tile and the whole tilemap which was the second requirement.

• PR #141: This is the PR where I built this first minimum viable prototype of the module.
• PR #142: Added animation to tiles in this PR.
• PR #143: Fixed a batching bug in the AnimationSystem which prevented textures from two different sources being added as animation frames together.

I then rendered this kinda trippy color tiles map. There was no automatic entity creation features so I had to set the tile in each place randomly but its a good test case.

Now that we have a setup which can store tile map data and easily render it as entities in the game, we just need to get the data from the TMX to the TileMap object. For that we first need something which could parse the XML in TMX and get data. I decided to use python-tmx which is simple and pretty lightweight. It wraps up the XML data to python data objects. There were some feilds which were not being read by the module because it wasn’t up to date with the latest TMX format. So I submitted patches to it for implementing the missing bits.

• Patch #2: For hexsidelength parameter for hexagonal tiles.

Basic TMX parser

There are a lot of steps involved in loading TMX files. For a basic idea, each tile would require a texture, a model and possibly an animation to display it. This previous blog post covers the details of the parser.

• PR #149: Basic TMX parser implementation for orthogonal tiles.

Next step was to support other tile formats like hexagonal and isometric

Hexagonal and Isometric tiles

Hexagonal tiles have a form of arrangement called staggered arrangement. We can have the same kind of arrangement for isometric to get an isometric staggered map.

This is the way tiles in a staggered arrangement would look like.

o   o   o   o
o o o o o o o o
o o o o o o o o
o o o o o o o o
o   o   o   o


The above arrangement has stagger index as even and stagger axis as x axis. This just means that every even indexed tile along the x-axis will be shifted along the y-axis by 1.

If the stagger index was odd and the stagger axis was y axis, this would be the outcome.

  o o o o o o o
o o o o o o o
o o o o o o o
o o o o o o o
o o o o o o o
o o o o o o o


Now consider if these tiles were hexagonal or isometric in shape, with the correct spacing and positioning we could create a map using staggered arrangement. All we have to add to the code is how to get position of the tile from (i,j). Silimar lgic applies for isometric arrangement.

Here are some examples:

PR #153 is where I added this feature. Figuring out the formula for the position from (i,j) was really interesting! I will probably describe it in another blog post.

Shapes and other Objects

Maps may have something other than just tiles drawn on them, like circles, polygons and images. Tiled has a way to store all that data, and so our module has to be able to display it. For drawing shapes we use KivEnt’s VertexModel.

VertexModel is a set of vertices and a list of indices which indicate what triangles to draw to create the shape required. So essentially all polygons can be represented as multiple adjoint traingles using these vertices and indices. We can also display ellipses using a high number of triangles. So there is a util function which takes the vertex data of the shapes in TMX and converts them to suitable data for the VertexModel. To display the shape all we have to do is create an entity with this model as its render data. Images are trivial too because they will be rendered separately as any other image with a rectangular model and texture.

This is how the objects look on screen:

I added this feature in PR #154.

The End

This was the entire work I did as part of GSoC, ending it with PR #156 for documentation and a bit of code cleanup. In the community bonding period I had worked on the AnimationSystem in PR #131 which was required in the maps module.

This has been really the most exciting project I have ever done! I thank Google Summer of Code, Kivy, and Jacob Kovak, my mentor, for giving me this great oppurtunity and helping me complete it!

Check out the tutorial to use the maps module with KivEnt here

August 15, 2016

Kuldeep Singh (kivy)

Before End-Term Evaluation

The GSoC is about to end and I am very excited😀. Everything went well except for 1-2 in between after my last blog as I got occupied by some placement work at my college The LNM Institute of Information Technology (awesome place).

Ok, so this week I have been working on documenting my PR’s and making everything mergable. There are some pull requests which I don’t think would be merged soon, so I will work on them even after GSoC.

It was fun and an awesome experience of learning and interacting with kivy community.

Nelson Liu (scikit-learn)

scikit-learn GSoC Summary, Lessons Learned, and Future Work

This summer, I was quite fortunate to work on the scikit-learn project with my mentors Jacob Schreiber and Raghav RV as part of the Google Summer of Code Program. I worked on various features for the tree module, and I'd like to take a moment now that the program is over to summarize what I've done, recall the many things learned over a plethora of successes and failures, and talk about future work to be done in the space.

Listing of Work Completed

So, what did I even do during the summer? A summary listing of my contributions is below.

Throughout the summer, I...

• committed 999 lines of code
• wrote 7 Google Summer of Code-related blog posts
• helped new contributors successfully make their first contributions at the Scipy 2016 scikit-learn sprint
• participated in reviews of 67 contributed pull requests, for example:
• had 14 (including the major two mentioned above) of my pull requests in total merged

all between April 22nd to August 10th, when this post was first written.

Technical Skills Developed

Cython

Since much of the tree module is written in Cython, it was naturally important for me to acquire a good understanding of the language, how it works, and its various idiosyncrasies. I had never used any C-flavored languages before Cython, so it took a bit of time to get used to the slightly different style of development and concepts like manual handling of pointers, freeing memory, the lack of IndexErrors when dealing with arrays of pointers, etc. Looking back, I'm happy that I got the chance to work with the language, and I think it'll prove useful in the future for both future contributions and for my own projects and research.

Decision Trees

To effectively contribute to the tree module, I had to know the ins and outs of decision trees. This includes both theoretical underpinnings, as well as scikit-learn's own personal implementation. There were a few sources that were quite helpful for developing this knowledge, mainly Leo Breiman's Classification and Regression Trees [1] for the theoretical side and Chapter 5 Gilles Louppe's PhD thesis Understanding Random Forests [2] for learning more about scikit-learn's implementation.

Lessons Learned

While I learned about useful tools like Cython and machine learning in general, I also gained some more general advice that I will be sure to carry on into the future.

Never assume something is easy

When writing my proposal, my mentors and I assumed that implementing MAE would be a straightforward affair. Indeed, writing the first version was quite simple. However, making it efficient required much more. As a result, we spent far more time on the MAE criterion than initially planned, which limited the amount of time I had to work on other things originally outlined in the proposal.
Thus, it's important to be flexible and be willing to change your work priorities when obstacles come up.

Always ask for help if you need it

I was lucky to have easy access to my mentors through a Google Hangouts conversation, and thus we frequently talked about the various algorithmic tricks we were thinking about for MAE and other aspects of the code. Additionally, we had several video calls to discuss anything that was hard to convey by internet message. My progress was catalyzed immensely by their constant availability, and I'm extremely happy that I asked for help whenever I needed it.

Future Work to be Done

There is a plethora of future work to be done on the tree module, which has been distended to and has become a bit overgrown. There's a high start-up cost to initially contributing to this module, as it's quite complex (seeing as the code is spread across multiple files) and a substantial amount is written in Cython, which takes some getting used to. There are two main goals for contributions to make in the future (which I've begun to work on already):

• Optimize MAE for large datasets, as it's still quite slow
• Add post-pruning to decision trees

Participating in Google Summer of Code has brought me closer to the group of amazing developers and people that is the scikit-learn community. I look forward to working on the issues above and contributing to scikit-learn with code and reviews for years to come, and I couldn't be more satisfied with the experience.

References
----------------

[1] Breiman, Leo. Classification and Regression Trees. Belmont, CA: Wadsworth International Group, 1984. Print.

[2] Louppe, Gilles. "Understanding random forests: From theory to practice." arXiv preprint arXiv:1407.7502 (2014).

kaichogami (mne-python)

GSoC Final Report

As per the requirement of GSoC, this article consists of a brief description of my work, what was done, what could not be done, what is left to do, what are future plans of the project, links to work that was merged, links to patches that was not.
My project involved refactoring decoding modules to comply with scikit-learn. Each heading contains a link to their respecitve pull requests. All commits are referred at the end of article. I have created a jupyter notebook detailing my changes.

Project Work

Xdawn Refactoring

I started my work with refactoring the xdawn module. The xdawn now works with numpy arrays, takes less parameters  in __init__ and therefore is faster than before. The lightweight implementation also enables to pipeline it with other preprocessing steps. I wrote a simple script to compare the time with the original xdawn and the refactored xdawn named XdawnTransformer with data matrix of shape (288, 59, 61).

Time taken to initialize and fit Xdawn
0.0634717941284
time taken to initialize and fit Xdawntransformer
0.0294799804688
Time taken to transform Xdawn
0.0195329189301
Time taken to transform Xdawntransformer
0.00475907325745



There was around 53% and 75% decrease in time, in running fit and transform respectively.

Unsupervised Spatial Filter

Initially written by my mentor, this class uses scikit-learn decomposition algorithms(mainly variations of ICA and PCA) and applies them on epochs data. eeg/meg features dimensions are high which increases the complexity and decreases the efficiency of operations. PCA and ICA are common approach to reduce the dimensionality of such matrices.

Vectorizer

Scikit-learn API follows a convention of working with 2D array. To make decoding modules compatible, which uses internal MNE functions(functions work on data array of higher dimensions), they need to be reshaped or converted into a 2D array. Vectorizer thus converts higher dimension data into 2D by placing it in the step previous to a scikit-learn transformer or estimator.

Ongoing Work

Temporal Filter

Minor refactor to the existing FilterEstimator class which applies zero-phase low pass, high-pass, band-pass, band-stop filter to epochs data. The new class, called TemporalFilter which does not take info parameter and works with sfreq, keeping it as light as possible. Few other internal changes include, changing  the default parameter, using functions from filter.py instead of writing checks.

Scoring method in SearchLight

SearchLight class, written by Jean, uses the default scoring method used in the estimator as initialized in the constructor. However in some use cases, scoring method needs to be changed. Also while evaluating the score with cross_val_score method of scikit-learn, changing the scorer is convenient. I am currently working in resolving the issue.

During the start of the gsoc, the discussion with my mentors was mainly on how the new decoding API should look like. Being a newcomer, I was confused on what really was required. However my doubts were then clarified, thanks to the patience of my mentors.

VectorizerMixin

Initially we decided to go with the idea of using a mixin class to internally convert all the data to 2D in the output, and again converting the input to 3D as we decided decoding class would only accept and return a 2D array(MNE functions work with 3D data). This class could also be placed in the beginning of the pipeline so that epoch’s data is converted to 2D. However this idea was discarded as this would have involved a lot of refactoring.

Location of new classes

Initially all the work to be done during gsoc, we decided to keep it in a separate file called gsoc.py, however Alexander was against it.

Work Left and Future

The decoding modules needs a lot of work. Jean has organised the work that is done, and what is still left, nicely here . I plan to stick around and work on the remaining modules that still need rework.

Finally I am extremely grateful to the chance that gsoc provided. I learnt a lot about code refactoring, API designing and writing code beautifully(my first PR and subsequent PRs clearly shows the improvement). I am honored to work with such a good community, being nice to a newcomer and responding to any queries that I had.
Lastly  I thank my mentors for being extremely helpful and this project is only possible because of them.

Utkarsh (pgmpy)

MCMC: Hamiltonian Monte Carlo and No-U-Turn Sampler

The random-walk behavior of many Markov Chain Monte Carlo (MCMC) algorithms makes Markov chain convergence to target distribution inefficient, resulting in slow mixing. In this post we look at two MCMC algorithms that propose future states in the Markov Chain using Hamiltonian dynamics rather than a probability distribution. This allows the Markov chain to explore target distribution much more efficiently, resulting in faster convergence.

Hamiltonian Dynamics

Before we move our discussion about Hamiltonian Monte Carlo any further, we need to become familiar with the concept of Hamiltonian dynamics. Hamiltonian dynamics are used to describe how objects move throughout a system. Hamiltonian dynamics is defined in terms of object location and its momentum (equivalent to object’s mass times velocity) at some time . For each location of object there is an associated potential energy and with momentum there is associated kinetic energy . The total energy of system is constant and is called as Hamiltonian , defined as the sum of potential energy and kinetic energy:

The partial derivatives of the Hamiltonian determines how position and momentum change over time , according to Hamiltonian’s equations:

The above equations operates on a d-dimensional position vector and a d-dimensional momentum vector , for .

Thus, if we can evaluate and and have a set of initial conditions i.e an initial position and initial momentum at time , then we can predict the location and momentum of object at any future time by simulating dynamics for a time duration .

Discretizing Hamiltonian’s Equations

The Hamiltonian’s equations describes an object’s motion in regard to time, which is a continuous variable. For simulating dynamics on a computer, Hamiltonian’s equations must be numerically approximated by discretizing time. This is done by splitting the time interval into small intervals of size .

Euler’s Method

The best-known way to approximate the solution to a system of differential equations is Euler’s method. For Hamiltonian’s equations, this method performs the following steps, for each component of position and momentum (indexed by )

Even better results can be obtained if we use updated value of momentum in later equation

This method is called as Modified Euler’s method.

Leapfrog Method

Unlike Euler’s method where we take full steps for updating position and momentum in leapfrog method we take half steps to update momentum value.

Leapfrog method yields even better result than Modified Euler Method.

Example: Simulating Hamiltonian dynamics of a simple pendulum

Imagine a bob of mass attached to a string of length whose one end is fixed at point . The equilibrium position of the pendulum is at . Now keeping string stretched we move it some distance horizontally say . The corresponding change in potential energy is given by

,

where is change in height and is gravity of earth.

Using simple trigonometry one can derive relationship between and .

Kinetic energy of bob can be written in terms of momentum as

Further, partial derivatives of potential and kinetic energy can be written as:

and

Using these equations we can now simulate the dynamics of simple pendulum using leapfrog method in python.

from __future__ import division
import matplotlib.pyplot as plt
import numpy as np

epsilon = 0.025  # Stepsize
num_steps = 98  # No of steps to simulate dynamics
m = 1  # Unit mass
l = 1.5  # length of string
g = 9.8  # Gravity of earth

def K(p):
return 0.5* (p**2) / m

def U(x):
epsilon_h = l * (1 - np.cos(np.arcsin(x/l)))
return m * g * epsilon_h

def dU(x):
return (m * g * l * x) / (1.5 * np.sqrt(l**2 - x**2))

x0 = 0.4
p0 = 0
plt.ion() ; plt.figure(figsize=(14, 10))
# Take first half step for momentum
pStep = p0 - (epsilon / 2) * dU(x0)
# Take first full step for position
xStep = x0 + epsilon * pStep
# Take full steps
for num_steps in range(num_steps):
# Update momentum and position
pStep = pStep - epsilon * dU(xStep)
xStep = xStep + epsilon * (pStep / m)
# Display
plt.subplot(121); plt.cla(); plt.hold(True)
theta = np.arcsin(xStep / 1.5)
y_coord = 1.5 * np.cos(theta)
x = np.linspace(0, xStep, 1000)
y = np.tan(0.5*np.pi - theta) * x
plt.plot(0, 0, 'k+', markersize=10)
plt.plot(x, y, c='black')
plt.plot(x[-1], y[-1],'bo', markersize=8)
plt.xlim([-1, 1]); plt.ylim([2, -1]); plt.hold(False)
plt.title("Simple Pendulum")
plt.subplot(222); plt.cla(); plt.hold(True)
potential_energy = U(xStep)
kinetic_energy = K(pStep)
plt.bar(0.2, potential_energy, color='r')
plt.bar(0.2, kinetic_energy, color='k', bottom=potential_energy)
plt.bar(1.5, kinetic_energy+potential_energy, color='b')
plt.xlim([0, 2.5]); plt.xticks([0.6, 1.8], ('U+K', 'H'))
plt.ylim([0, 0.8]); plt.title("Energy"); plt.hold(False)
plt.subplot(224); plt.cla()
plt.plot(xStep,pStep,'ko', markersize=8)
plt.xlim([-1.2, 1.2]); plt.ylim([-1.2, 1.2])
plt.xlabel('position'); plt.ylabel('momentum')
plt.title("Phase Space")
plt.pause(0.005)
# The last half step for momentum
pStep = pStep - (epsilon/2) * dU(xStep)


The sub-plot in the right upper half of the output demonstrates the trade-off between the potential and kinetic energy described by Hamiltonian dynamics. The red portion of first bar plot represents potential energy and black represents kinetic energy. The second bar plot represents the Hamiltonian. We can see that at the potential energy is zero and kinetic energy is maximum and vice-versa at . The lower right sub-plot shows the phase space showing how momentum and position are varying. We can see that phase space maps out an ellipse without deviating from its path. In case of Euler method the particle doesn’t fully trace a ellipse instead diverges slowly towards infinity (look at here for further detail).

We can also see that value of Hamiltonian is not constant but is oscillating slightly. This energy drift is due to approximations used to discretize time. One can clearly see that value of position and momentum are not completely random, but takes a deterministic circular kind of trajectory. If we use Leapfrog method to propose future states than we can avoid random-walk behavior which we saw in Metropolis-Hastings algorithm

Hamiltonian and Probability: Canonical Distributions

Now having a bit of understanding what is Hamiltonian and how we can simulate Hamiltonian dynamics, we now need to understand how we can use these Hamiltonian dynamics for MCMC. We need to develop some relation between probability distribution and Hamiltonian so that we can use Hamiltonian dynamics to explore the distribution. To relate to target distribution we use a concept from statistical mechanics known as the canonical distribution. For any energy function , defined over a set of variables , we can find corresponding

, where is normalizing constant called Partition function and is temperature of system. For our use case we will consider .

Since, the Hamiltonian is an energy function for the joint state of “position”, and “momentum”, , so we can define a joint distribution for them as follows:

Since , we can write above equation as

Furthermore we can associate probability distribution with each of the potential and kinetic energy ( with potential energy and , with kinetic energy). Thus, we can write above equation as:

,where is new normalizing constant. Since joint distribution factorizes over and , we can conclude that and are independent. Because of this independence we can choose any distribution from which we want to sample the momentum variable. A common choice is to use a zero mean and unit variance Normal distribution (look at previous post). The target distribution of interest from which we actually want to sample from is associated with potential energy.

Thus, if we can calculate , then we are in business and we can use Hamiltonian dynamics to generate samples.

Hamiltonian Monte Carlo

In Hamiltonian Monte Carlo (HMC) we start from an initial state , and then we simulate Hamiltonian dynamics for a short time using the Leapfrog method. We then use the state of the position and momentum variables at the end of the simulation as our proposed states variables . The proposed state is accepted using an update rule analogous to the Metropolis acceptance criterion.

Lets look at the HMC algorithm:

Given initial state , stepsize , number of steps , log density function , number of samples to be drawn

1. set
2. repeat until

• set

• Sample new initial momentum ~

• Set

• repeat for steps

• Set
• Calculate acceptance probability

• Draw a random number u ~ Uniform(0, 1)

• if then

is a function that runs a single iteration of Leapfrog method.

In practice sometimes instead of explicitly giving number of steps , we use trajectory length which is product of number of steps , and stepsize .

Lets use this HMC algorithm and draw samples from the same distribution multivariate distribution we used in previous post.

, where

and

I’m going to use HMC implementation from pgmpy, which I have implemented myself.

Here is python code for that

from pgmpy.inference.continuous import HamiltonianMC as HMC, LeapFrog, GradLogPDFGaussian
from pgmpy.factors import JointGaussianDistribution
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(77777)
# Defining a multivariate distribution model
mean = np.array([0, 0])
covariance = np.array([[1, 0.97], [0.97, 1]])
model = JointGaussianDistribution(['x', 'y'], mean, covariance)

# Creating a HMC sampling instance
# Drawing samples
samples = sampler.sample(initial_pos=np.array([7, 0]), num_samples = 1000,
trajectory_length=10, stepsize=0.25)
plt.figure(); plt.hold(True)
plt.scatter(samples['x'], samples['y'], label='HMC samples', color='k')
plt.plot(samples['x'][0:100], samples['y'][0:100], 'r-', label='First 100 samples')
plt.legend(); plt.hold(False)
plt.show()


If one compares these results to what we have seen in previous post for Metropolis-Hastings algorithm it is clear that HMC converges towards target distribution a lot faster than Metropolis-Hastings algorithm. On careful inspection we can also see that graph looks a lot denser than that of Metropolis-Hastings, which mean that our most of the samples are accepted (high acceptance rate).

Though performance of HMC might seem better but it critically depends on trajectory length and stepsize. Poor choice of these can lead to high rejection rate, or too high computation time. One can see the results himself by changing both of the parameters in above example. For example when I just changed stepize to 0.5 from 0.25 in above example, nearly all samples are rejected. Though stepsize parameter for HMC implementation is optional, I do not suggest one should use it.

In pgmpy we have implemented an another variant of HMC in which we adapt the stepsize during the course of sampling thus completely eliminates the need of specifying stepsize (but still requires trajectory length to be specified by user). This variant of HMC is known as Hamiltonian Monte Carlo with dual averaging. In pgmpy we have also provided the implementation of Modified Euler method for simulating Hamiltonian dynamics. (By default both algorithms use Leapfrog. It is not recommended to use Modified Euler method, or Euler method because trajectories are not elliptical, thus they show poor performance in comparison to leapfrog method). Here is a code snippet on how we can use HMCda algorithm in pgmpy.

# Using JointGaussianDistribution from above example
from pgmpy.inference.continuous import HamiltonianMCda as HMCda, ModifiedEuler
# Using modified euler instead of
samples = sampler_da.sample(initial_pos=np.array([7, 0]), num_adapt=10,num_samples=10, trajectory_length=10)
print(samples)


Both (HMC and HMCda) of these algorithms requires some hand-tuning from user, which can be time consuming especially for high dimensional complex model. No-U-Turn Sampler (NUTS) is an extension of HMC that eliminates the need to specify the trajectory length but requires user to specify stepsize. With dual averaging algorithm NUTS can run without any hand-tuning at all, and samples generated are at-least as good as finely hand-tuned HMC.

NUTS, removes the need of parameter number of steps by considering a metric to evaluate whether we have ran Leapfrog algorithm for long enough, that is when running the simulation for more steps would no longer increase the distance between the proposal value of and initial value of

At high level, NUTS uses the leapfrog method to trace out a path forward and backward in fictitious time, first running forwards or backwards 1 step, the forwards and backwards 2 steps, then forwards or backwards 4 steps etc. This doubling process builds a balanced binary tree whose leaf nodes correspond to position-momentum states. The doubling process is halted when the subtrajectory from the leftmost to the rightmost nodes of any balanced subtree of the overall binary tree starts to double back on itself (i.e., the fictional particle starts to make a “U-Turn”). At this point NUTS stops the simulation and samples from among the set of points computed during the simulation, taking are to preserve detailed balance.

The API(in pgmpy) for NUTS and NUTS with dual averaging is quite similar to that HMC. Here is a example

from pgmpy.inference.continuous import (NoUTurnSampler as NUTS, GradLogPDFGaussian,
NoUTurnSamplerDA as NUTSda)
from pgmpy.factors import JointGaussianDistribution
import numpy as np
import matplotlib.pyplot as plt
# Creating model
mean = np.array([0, 0, 0])
covariance = np.array([[6, 0.7, 0.2], [0.7, 3, 0.9], [0.2, 0.9, 1]])
model = JointGaussianDistribution(['x', 'y', 'z'], mean, covariance)
# Creating a sampling instance for NUTS
samples = sampler.sample(initial_pos=np.array([1, 1, 1]), num_samples=1000, stepsize=0.4)
# Plotting trace of samples
labels = plt.plot(samples)
plt.legend(labels, model.variables)
plt.title("Trace plot of NUTS samples")
plt.show()

# Creating a sampling instance of NUTSda
samples = sampler_da.sample(initial_pos=np.array([0, 1, 0]), num_adapt=1000, num_samples=1000)
# Plotting trace pf samples
labels = plt.plot(samples)
plt.legend(labels, model.variables)
plt.title("Trace plot of NUTSda samples")
plt.show()


The samples returned by all four algorithms are of two types which is dependent upon installation available. If working environment has a installation of pandas, then it will return a pandas.DataFrame object otherwise it will return a numpy.recarry object. As for now pgmpy has pandas as a strict dependency so samples returned would always be a DataFrame object but in near future we will not have pandas as a strict dependency.

Apart from sample method all the four implementation have a another method named generate_sample method, whose each iteration yields a sample which is a simple numpy.array object. This method is useful if one wants to work on a single sample at a time. The API for generate_sample method is exactly similar to that of sample method.

# Using the above sampling instance of NUTSda
gen_samples = sampler_da.generate_sample(initial_pos=np.array([0, 1, 0]),
samples = np.array([sample for sample in gen_samples])
print(samples)


pgmpy also provides base class structures so that user defined methods can be plugged-in. Lets look at some example on how we can do that. In this example the distribution we are going to sample from is Logistic distribution The probability density of logistic distribution is given by:

Thus the log of this probability density function (potential energy function) can be written as:

And the gradient of potential energy :

import numpy as np
from pgmpy.factors import ContinuousFactor
from pgmpy.inference.continuous import NoUTurnSamplerDA as NUTSda, BaseGradLogPDF
import matplotlib.pyplot as plt

# Creating a Logistic distribution with mu = 5, s = 2
def logistic_pdf(x):
power = - (x - 5.0) / 2.0
return np.exp(power) / (2 * (1 + np.exp(power))**2)
# Calculating log of logistic pdf
def log_logistic(x):
power = - (x - 5.0) / 2.0
return power - np.log(2.0) - 2 * np.log(1 + np.exp(power))
# Calculating gradient log of logistic pdf
power = - (x - 5.0) / 2.0
return - 0.5 - (2 / (1 + np.exp(power))) * np.exp(power) * (-0.5)

# Creating a logistic model
logistic_model = ContinuousFactor(['x'], logistic_pdf)

# Creating a class using base class for gradient log and log probability density function

def __init__(self, variable_assignments, model):

log_logistic(self.variable_assignments))

# Generating samples using NUTS
num_samples=10000)

x = np.linspace(-30, 30, 10000)
y = [logistic_pdf(i) for i in x]
plt.figure()
plt.hold(1)
plt.plot(x, y, label='real logistic pdf')
plt.hist(samples.values, normed=True, histtype='step', bins=100, label='Samples NUTSda')
plt.legend()
plt.hold(0)
plt.show()


Ending Note

In this blog we see how by avoid random-walk behavior we can explore target distribution efficiently using some powerful algorithms like Hamiltonian Monte Carlo and No-U-Turn Sampler. In my hopefully next blog post I’ll show not so common yet interesting application of MCMC which I came across recently.

August 14, 2016

GSoC '16: Final Report

GSoC 2016 was one of the best things I've had the opportunity to participate in. I've learned so much, had a lot of fun with the community the whole time, got to work on something that I really like and care about, got the once-in-a-lifetime opportunity to visit Europe, and still get paid in the end. And none of this would have been possible without the support and help from the coala community as a whole. Especially Lasse, who was my mentor for the program, from whom I've learned so, so much. And Abdeali, who introduced me to coala in the first place and help me get settled in the community. It honestly wouldn't have been possible without any of them, and I really mean it. Seriously, thank you :)

List of commits I've made over the summer

The last three months have been action packed. Check 'em out for yourself:

coala-quickstart

Commit SHA Commit
b8d8349 Add tests directory for testing
df99516 py.test: Execute doctests for all modules
3d01aed Create coala-quickstart executable
28a33f9 Add coala bear logo with welcome message
759e445 generation: Add validator to ensure path is valid
111d984 generation: Identify most used languages
4ace132 generation: Ask about file globs
8f7fe23 generation: Identify relevant bears and show help
839fa19 FileGlobs: Simplify questions
7c98e48 Settings: Generate sections for each language
b28e20c Settings: Write to coafile
69a5d2f Generate coafile with basic settings
60bee9a Extract files to ignore from .gitignore
62978ad Change requirements
36c8486 Enable coverage report
d78e85e Bears: Change language used in tests
4a8819e setup.py: Add myself to the list of maintainers
54f21c6 gitignore: Ignore .egg-info directories
6a7b63a Bears: Use only important bears for each language

coala

Commit SHA Commit
45bfec9 Processing: Reuse file dicts loaded to memory
ef287a4 ConsoleInteraction: Sort questions by bear
7d57784 Caching: Make caching default
1732813 Processing: Switch log message to debug
01890c2 CachingUtilitiesTest: Use Section
868c926 README: Update it
f79f53e Constants: Add strings to binary answers
2d7ee93 LICENSE: Remove boilerplate stuff
da6c3eb Replace listdir with scandir
ad3ec72 coalaCITest: Remove unused imports
91c109d Add option to run coala only on changed files
5a6870c coala: Add class to collect only changed files
622a3e5 Add caching utilities
e1b3594 Tagging: Remove Tagging

coala-utils

Commit SHA Commit
27ee83c Update version
64b0e0b Question: Validate the answer
1046c29 VERSION: Bump version
bd1e8fa setup.cfg: Enable coverage report
79fee96 Question: Use input instead of prompt toolkit
cfd81c1 coala_utils: Move ContextManagers from coalib
c5a4526 Add MANIFEST
f019962 Change VERSION
9db2898 Add map between file extension to language name
a52a309 coala_utils: Add Question module

That's a +2633 / -471 change! I honestly didn't know it'd be that big. Anyway, those were the technical stats. On to the showcase!

Stuff I worked on

My primary GSoC proposal: coala-quickstart

coala-quickstart

And here's the coafile that's generated:

Pretty neat stuff, huh? :)

Anyway, that was my whole project in a nutshell. I worked on other stuff too during the coding period. Here are some of the results:

Caching in coala

This is another thing I'm proud of: caching in coala. Remember how you had to lint all your files every time even if you changed just one line? No more. With caching, coala will only collect those files that have changed since the last run. This produces a terrific improvement in speed:

Trial 1 Trial 2 Trial 3 Average
Without caching 9.841 9.594 9.516 9.650
With caching 3.374 3.341 3.358 3.358

That's almost a 3x improvement in speed!

Initially, caching was an experimental feature since we didn't want to break stuff! And this can break a lot of stuff. But fortunately, everything went perfectly smoothly and caching was made default.

The coala README page got a complete overhaul. I placed a special emphasis on simplicity and the design; and to be honest, I'm quite happy with the outcome.

Other miscellaneous stuff

I worked on other tiny things during the coding phase:

• #2585: This was a small bugfix (to my annoyance, introduced by me). This also led to a performance improvement.
• #2322: listdir is a new python3.5 feature that is faster than the traditional scandir that is used to get a directory's contents.
• e1b3594: I removed Tagging with this commit. It was unused.
• #11, #14: A generic tool to ask the user a question and return the answer in a formatted manner. This is now used in several packages across coala.

There were other tiny changes, but you can find them in the commit list.

Conclusion

It's really been a blast, right from the start to the start to the finish. Thanks to everyone who has helped me in any way. Thanks to Google for sponsoring such an awesome program. Thanks to the PSF for providing coala with an opportunity at GSoC. I honestly can't see how this would have been possible without any of you.

To everyone else, I really recommend contributing to open-source. It doesn't have to be coala. It doesn't even need to be a big project. Just find a project you like: it can even be a silly project that doesn't do anything useful. The whole point is to get started. GSoC is one way to easily do that. There is such a wide variety of organizations and projects, I'm pretty sure at least one project will be to your liking. And you're always welcome at coala. Just drop by and say hello at our Gitter channel.

liscju (Mercurial)

Coding Period XI - XII Week

In this week I was deciding how to propagate authorization information about redirection server from main repo server to clients. From my investigation and from talk with developers on #glyph channel it seems like the best decision is to make sure that redirection server certificate CA is either:
1) well known CA
2) is the same CA as it was used to sign certificate for redirection server

From talk with my mentor we decided that the best I can do right now is to test the solution and prepare the feature to be production ready. The Redirection Feature is having all functionalities we planned to do so this seems reasonable.

srivatsan_r (MyHDL)

12 Weeks and Counting….!

Well, unofficially this is my 14th coding week on GSoC, since I started coding two weeks earlier. It was a very good experience working with MyHDL. It was fun and challenging both during my first project and the second one.

I learnt a lot of new things about like how an open source project is packaged and distributed, how the project should be structured, why tests are important and how continuous integration tools help.

I was thinking whether I would have got the same level of knowledge if I would have contributed to some other Open Source organisation, then I realised that it would have not been the case with other organisation. I was having a chat with my friend who was working with an other Sub-org under Python Software Foundation. I asked him how many lines of code did you write during GSoC (Though this may not be the exact measure of the work done, it still gives an approximate measure.), he said “around 500!”. I was like “Just 500?”, because I wrote around 3000 lines for my first project. Just then, I realised that I have contributed a lot for MyHDL and it has given me a lot of learning experience in return.

Most importantly I should thank my mentors (Mr.Eldon Nelson and Mr.Christopher Felton) they both were very very supportive. Eldon was very motivating and always keeping a check on how my I have completed my project and Chris was helping me with debugging errors and clearing my doubts. I bet I would have not got such wonderful mentors in any other GSoC organisation.

Coming back to my GSoC update, my college has started(2 weeks before), so the GSoC progress got a little delayed, we are left with the final overall core test. For both my partner and me college has started and we have assignments from the college side to complete. So, the GSoC progress is getting a little delayed.

In my first project, I have an PR which is not yet merged. I m waiting for my mentor to give me a green signal to merge it.

August 13, 2016

chrisittner (pgmpy)

HillClimbEstimator done

pgmpy now has a basic hill climb BN structure estimator.

Usage:

import pandas as pd
import numpy as np
from pgmpy.estimators import HillClimbSearch, BicScore

# create data sample with 9 random variables:
data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 9)), columns=list('ABCDEFGHI'))
data['J'] = data['A'] * data['B']

est = HillClimbSearch(data, scoring_method=BicScore(data))
best_model = est.estimate()

print(sorted(best_model.nodes()))
print(sorted(best_model.edges()))


Output:

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
[('A', 'J'), ('B', 'J')]


Preetwinder (ScrapingHub)

GSoC-5

Hello,
This post continues my updates for my work on porting frontera to python2/3 dual support.
Only 3 days remain till the beginning of the final clean-up period. The testing work is entirely finished, although there are some other modules I might add tests for later. The porting work is also almost done. I have a PR which ports the final components(workers, middlewares) which will most probably be merged on Monday. The only remaining modules after that are the messagebus modules. I’ll be making a PR with these components on Monday, and it should be merged in the next day or two. I plan to spend the next week on making changes to the documentation, and performing tests for various deployment configurations to weed out any problems that might have persisted. If things go according to plan, We should be able to make a release with Python 3 support soon after the work period ends(23 August).

GSoC-5 was originally published by preetwinder at preetwinder on August 13, 2016.

It has come to an end..

In this blog post I will write a short summary in which I will present my GSoC progress so far and include some proofs to what I have worked on.

Currently, most of my project’s aims have been achieved. However, there’s still some little work to be done. I loved working on my project and I really had fun, while I learnt a lot of cool stuff!

If I’d have to approximate, I’d say around 80% of what I proposed myself to do this summer I have achieved.

Why have I not finished it? Because I wanted to do too much?

No. This is not the reason. The real reason for not finishing is that I’ve spent a lot of time designing and thinking in advance for the tools and how everything was going to be, because I’ve wanted everything to be well thought and to be actually usable. Instead of going blindly on the first option of uploading, or installing, I took some time and discussed with everyone around coala how this is going to be done and why. This thinking took a lot of time.

Another reason for still not having everything done is that the review process takes a while. Here, at coala, we do not merge things just to have something partially usable. We review it. Hard. Until every line of code is used at its maximum efficiency and makes sense and is optimized.

What I have already done with success

I have fulfilled 90% of the bears with their REQUIREMENTS as a metadata attribute. This REQUIREMENTS attribute is a tuple that contains instances of PackageRequirement classes, classes which contain package names, versions, package managers. I have also created metadata attributes specific to each bear and fulfilled all the bears with them.

I have created a tool that uploads ALL bears correctly to PyPI, taking data from the bears themselves, including from the metadata attributes I’ve written to them.

I have created a tool that installs bears while interacting with the user, giving him the options to install ALL bears, SOME bears or None. This tool will also install the REQUIREMENTS, gathered from the attribute existent in most bears, using an installation_method() that each PackageRequirement instance has, specific to that manager.

I have tied the bears to be discovered by coala using entry points, as coala gathers bears searching for installed PyPI packages that have the ‘coalabears’ entry point.

What is left to do

However, with all the work spent, there’s still some things that I’d love to do next!

Firstly, I’d love to make some cool packages that would be shown to the user using existing bears, such as Web Development bears package which will include JavaScript bears, CSS, HTML.

Also, I will make some cool improvements and enhancements to the installation tool, some of which I started working on, and some of which will be shown here https://gitlab.com/coala/bear_installation_tool/issues . Some of these enhancements include:

• Changing the output given by PyPI to a cooler output
• Showing all bears that failed installation at the end as a list
• Fix a bug in which coala does not correctly find all bear packages installed

kaichogami (mne-python)

Work After Mid-Term

Hello all!
I completely forgot about posting about my progress after mid term. I hope I make up for that with this post.
Till my mid term, I was working on refactoring Xdawn, which was huge and definitely complicated for a beginner like me . However patience and with the huge help from my mentors I made some valid contributions!

Moving forward, I started working on making a class called Vectorizer which makes scikit-learn pipeline compatible with MNE transformers, typically  working with 3D matrix .
Next I worked on a general class(PR shows closed, but was merged while rebasing) which applies reduction/decomposition algorithms to mne data, using scikit-learn transformers. This work was started by my mentor Jean, on which I improved and extended its functionality.
I am currently working on a refactoring FilterEstimator which I will mention in the next blog post. Thanks for reading.
Have a nice day!

tushar-rishav (coala)

Pipes in Linux

In this blog, I’d like to share about Pipes - an interesting feature of Unix/Linux operating systems (it’s available in other systems too). Having talked about Pipes in brief, I will do the implementation in Python. So let’s get started!

Pipe

Brief History

Pipes are the eldest IPC tools and were introduced by Douglas McIlroy after he noticed that most of the time they were processing the output of one process as the input to another. Later, Ken Thompson added the concept of pipes to the UNIX operating system.

In simple terms, a pipe is a method of connecting the standard output of one process to the standard input of another. A quick example:

In the above shell command (reading from left to right), we are passing the output from ls command as an input to sort command. The combined output is a current directory’s content in a sorted order. The vertical bar ( | ) is a pipe character. It acts as a method of one-way communications (or half-duplex) between processes.

They are of basically two types : Anonymous Pipes and Named Pipes (FIFO).

The one we just saw, was an anonymous pipe or half-duplex pipe.

Pipe creation

When a process creates a pipe, the kernel sets up two file descriptors (read and write) for use by the pipe. A pipe initially connects a process to itself and any data traveling through the pipe moves through the kernel. Under Linux, pipes are actually represented internally with a valid inode which resides within the kernel itself, and not within the bounds of any physical file system. Well, you might question, what’s the point of pipe if they connect a process to itself. Are they just going to communicate with itself? Well, the answer is no. Pipes are useful in case when we fork a child process and as we know a child process inherits any open file descriptors from parent, allowing us to setup a multiprocess communication (in this case between child and parent process). As both the processes have access to file descriptors, a pipeline is setup.

One important thing we should note that, since the pipe resides within the confines of the kernel, any process that is not in the ancestry for the creator of the pipe has no way of addressing it. This is not the case with named pipes (FIFOS), which we will discuss next.

Named Pipes

Unlike anonymous pipes, a named pipe exists in the file system. After input-output has been performed by the sharing processes, the pipe still exists in the file system independently of the process, and can be used for communication between some other processes.
We can create a named pipes either using mkfifo or mknod shell commands (Python has an inbuilt method which we will see during implementations).
Just like a regular file, we can set file permissions on a named pipe. Checkout mode from man mkfifo.
A quick example of named pipes that we might have often come across.

Here, the output from ls -li is redirected to a temporary named pipe, which shell creates, names and later deletes. Another fun example is to create a very basic shared terminal. Let’s try it out.
The idea is to create a named pipe and then use two separate cat process to read/write data from/to the named pipe.

• Creating a named pipe (from console) :

If you observer closely the output from ls -l looks like:

You may have noticed an additional | character is shown next to named_pipe_file_name and the file permission starts with p. This is a Linux clue that the named_pipe_file_name is a pipe.

• Using it:

You might notice that after the first command the execution appears to be blocked. This happens because the other end of the pipe is not yet connected, and so the kernel suspends the first process until the second process opens the pipe.

I hope that was a very simple usage of named pipes and helped you understand named pipes.

Now (for fun), let’s implement pipes in Python. Since the code is in Python, I need not explain every line from the code. It should be readable. :)
The basic idea is to create two processes (parent and child) and let parent read the data written by child process.

Implementation in Python2.7
• Anonymous pipe

• Named pipe

The implementation is almost similar except the fact that instead of descriptors, we have access to named pipe file name and we use it to perform I/O operations.

Phew! That was fun!

Cheers!

GSoC Experience

Well, it’s been quite a time since the last blog. The current blog and the following blogs are going to be about my overall experience and the stuff that I have learnt in the past 11 weeks while contributing to coala-analyzer as a Google Summer of Code developer under Python Software Foundation. The list is long hence I won’t contain them in single post. :)

EuroPython’16 Experience

Recently, I attended EuroPython conference at Bilbao, Spain where I had a chance to meet a few cool fellow coalaians ( @sils1297, @sims1253, @Udayan12167, @justuswilhelm, @Redridge, @Adrianzatreanu and @hypothesist ) and over thousand Pythonista | Pythoneer who happened to share their love and experience using Python by presenting talks, lightning talks or training session. Sadly, I couldn’t meet Attila Tovt my amazing GSoC mentor.

Being my first PyCon ever, I was a little nervous but curious about it. Having spent a day with the community made me feel comfortable. Seeing the energy that people shared, I was overwhelmed! In the coming days, mornings started with a Keynote speaker which was then followed by over dozen talks throughout a day on various exciting topics like Descriptors in Python, Effective Code Review, AsyncIO, Algorithmic Trading with Python, Deep Learning with tensorflow, Gilectomy: overcoming the GIL in CPython implementation by Lary Hastings and many more.

Finally the exploration ended with the workshop which I conducted on Guide to make a real contribution to an open source project for novice. It was a learning experience and definitely memorable for me! Being the first time in Europe, I was excited. People are friendly and the place is truly beautiful! :)

GSoC’16 Experience

Well, it has been totally an amazing and learning experience during this summer. I could effectively learn the best practices for a collaborative programmer and (probably) became one! Credits to my mentor - @Uran198 who patiently and solicitously reviewed my PRs. I never really bothered to follow practices like atomic changes , TDD aiming for maximum coverage for good code quality until I started contributing. Honestly, following such practices seemed bloating and sometimes annoying at first but once I got hold of them, they became a habit. I think during the GSoC period, the crucial things that I have learnt are how to write a code that is maintainable (docstrings, effective and atomic commits), testable (efficient code design, writing unittests and impressive coverage) and follows the standards. Having learnt these skills, I look forward to share them with my friends and the community.
The coming blogs would cover these practices in details. :)

GSoC’16 status

11 weeks are over with a week remaining before submissions. coala-html project is ready with this PR. I shall be working on to improve this project even after my GSoC period is over. Apart from coala-html, coala-website is almost ready with minor design stuff remaining. Soon enough I will submit it for review. :)

Cheers!

Sheikh Araf (coala)

[GSoC16] Week 12 update

It is the final week of GSoC and my project is almost complete. I’m still working on the coafile editor, but I haven’t been able to devote much time to it since my summer vacations got over.

Nevertheless I’ve made some progress. Instead of implementing the GUI of the editor in one go, I’ll first implement support for coafile. This will mostly include some syntax highlighting and content assist. The next step then would be to use this text editor inside a graphical editor.

This would definitely require more than one week to implement, so I’ll work on the project after the official deadline too. And ofcourse the idea is to keep working on the plug-in, and maintaining it.

This is probably my last update on GSoC. It’s been really awesome and I’m thinking of writing another post about my overall experience with GSoC.

mr-karan (coala)

coala GSoC 2016 Summary

This blog post is about my work done during the GSoC coding period (May 23 - Aug 15). Doing GSoC has been one of the most amazing experience in my life. Thanks to PSF and coala for giving me this opportunity to work on an amazing open source project. Also a big thanks to Google for running GSoC and cultivating the culture of contributing to Open Source. I have learnt so many things over a span of short three months that would definitely help me grow as a developer.

Work Summary

#1925 #2569 Syntax Highlighting Merged
#154 #2 coala_bears_create Merged
- #2 Bear Docs Website Merged
#31 #443 GoErrCheckBear Merged
#400 #415 VerilogLintBear Merged
#573 #581 WriteGoodLintBear Merged
#588 #589 HappinessLintBear Merged
#642 #643 package.json bug Merged
#646 #667 VultureBear Merged
#2574 #2590 ASCIINEMA_URL attribute Merged
#658 #658 Add ASCIINEMA urls Merged
#662 #663 ASCIINEMA urls fix Merged
#2309 #2310 Warning Message for wrong linter Closed / Not Happening
#611 #675 RustLintBear Open
#601 #633 MyPyBear Open
#629 #632 SpellCheckBear Open
#596 #602 HTTPoliceLintBear Open

List Of Commits

coala

Commits SHA Shortlog
4e20e9a ConsoleInteraction: Add syntax highlighting
14ece91 ConsoleInteraction: Add comment for line number
b610ae4 coalib/bears/Bear: Add ASCIINEMA_URL attribute

coala-bears

Commits SHA Shortlog
6bd715c bears/python: Add MyPyBear
1ea2076 bears/python: Add VultureBear
73374c3 bears: Fix ASCIINEMA_URLS
0587654 bears: Add ASCIINEMA_URL
67f9de3 requirements: Update coala version
80aa542 package.json: Add name and version
1ba1bfd codecov.yml: Fix it
9c85fe3 bears/naturallanguage: Add WriteGoodLintBear
0e057fc bears/js: Add HappinessLintBear
77279d3 bears/go: Add GoErrCheckBear
72e5d87 bears/verilog: Add VerilogLintBear

coala-bear-management

Commits SHA Shortlog
516b50c Add files required for PyPi
28d1064 Add .gitlab-ci.yml
c252dd7 Add setup.cfg
a5e4a3d coala_bears_create: Add main application
8c87bfe Add requirements
fa8fe8d Add README.rst
624b247 gitignore: Remove useless entries

website

Commits SHA Shortlog
d5e70d2 Add LangCtrl and language page view
b7f53c6 Add DetailCtrl and detail page view
8175882 Add BearCtrl and bear page views
b113e2a Add main angular app and homepage
66ee579 Add external stylesheets and images
68cc4b0 data: Mock JSON output
e0d47be bower: Add Vendor Dependencies

Some fun stats with git log

coala 456 429
coala-bears 1012 66
website 543 0
coala-bears-create 400 0

That’s a +2411/-495 change over the span of three months :smile:

A coala-bears template generator

coala-bears-create is a tool which lets you create Bears easily, by asking you questions and filling a standard config file. The files generated can then be quickly used by filling additional details to get your Bear up and running in no time. The CLI is implemented with the help of Python Prompt Toolkit which helped me plug in features likely dropdown list, status bar and prompt method for asking questions to the user.

The templates are present in scaffolding-templates directory. The user is asked to enter the directory where she wants to create the Bear and after the user finishes answering all questions, the data is taken from scaffolding templates , the user’s answers are filled in the templates and a new folder is created for the user with Bear and BearTest files.

Advantage/Motivation: The advantage of this little application is that, since previously the task of creation of bear meant you need to type/copy some of the standard variable, like certain imports or linter variables which are shared across all Bears. But using this application you just need to specify the value and you have a Bear ready almost instantly without any extra effort required.

Syntax Highlighting in coala

Presently coala displays results like this:

The task of this project was to implement Syntax Highlighting for the affected code which brings in some visual enhancement for the user. This task wasn’t as trivial as we didn’t have the information of the language of the file being analyzed, as well as for the fact that we needed dual highlighting (foreground & background colors) The aim of the task was to highlight the part which is in sourcerange and provide syntax highlighting for the rest of it, since highlighting a text and providing colors on top of it doesn’t work well. I also had to figure out a way where I can still print spaces and tabs in unicode like in the previous version.

I began to search for a library for syntax highlighting as # 1 rule in software development is to use what’s existing already. I finally found Pygments library to be ideal for this task.

I used get_lexer_for_filename and fed sourcerange.file to this, to get the appropriate lexer based on the language of the file. If however there’s no suitable lexer found, an exception is raised which I catch and make TextLexer (plain-text lexer) as the default one. I used highlight method from Pygments library to get a str object which are basically ANSI escape sequences wrapped to my string. I used TerminalTrueColorFormatter to print my results in a terminal. There was another formatter TerminalFormatter but after some hair pulling(read debugging) I realized I couldn’t use it for background highlighting.

Till now, I was able to colorize the results but one important task which was missing was that there were no spaces/tabs markers in the strings. In the old source code, there was a custom function which iterated over the string and if a space or tab is found, it was replaced by the unicode character which seems like a bullet mark in the terminal for space and double right arrows for tabs. I couldn’t do this because the str object I had now was also wrapped with ANSI code escape sequences and I spent almost half a day trying to search a workaround for this. Not satisfied with whatever method I used, I searched around Pygments docs in hope for finding something. And voila, I was delighted to see that there’s a filter VisibleWhiteSpace which can be provided to the formatter. It worked as expected and this also helped me remove an entire function from the original source code, with an elegant one line solution. I also asked the maintainers of the original code and they seemed to be fine with this new change. I also highlighted the result.message using the same method. I changed the tests so that they can work with Pygments highlighting and refactored the code a bit, which reduced the duplication of long highlight methods.

Bear Documentation Website

The task was to create a website for all coala-bears documentation. I was initially clueless on which stack to choose but after some discussion with @tushar-rishav and @sils1297, I decided to go with the awesome AngularJS because of it’s highly customizable filters. I used coala --show-bears json to get a JSON output of all the details/configs of bears present. I parsed this data and used Materiallize to display them in a list of cards, where the user could Read More about the bear in next page or could see some of the important details in the card-reveal option present.

You can take a look at the website here.

This site will be deployed alongside coala-website which is Tushar’s project

I enjoy creating new bears for coala, as I religiously use them alongside my side projects. During my GSoC period, I created some bears. Some of them are completed and merged, while some of them are still open and I will continue to work on them after my GSoC is over.

A list of them is available on top of the post.

What I couldn’t do

I have been able to complete most of my proposal tasks and I had to face some difficulties as the project progressed, but nonetheless most of the crucial work has been completed and merged. However, I couldn’t complete Navigation of Results and Embedded Source Code Linting which was mentioned in my project. I hope to continue working on them after my project is over, as the philosophy of GSoC is to embibe the culture of contributing continuously to Open Source and not be done with it in 3 months.

The task of Navigation of Results was the idea, where the user could go back and forth but the current architecture or design in which coala presents the results made it very difficult to achieve this task and I couldn’t come up with a good clean approach for this.

The problem with Embedded Source Code Linting was, there was no good approach for finding out what the language is being currently analyzed. I did open an issue for a task related to this in which the user would be presented with an error message if she tries to use a wrong linter, but the status of the PR was changed to not happening/wont fix. This is because there is no one single accurate approach for getting the language, even MIME data fails most of the time. Due to these reasons and obstacles, these tasks weren’t done, but I am sure we can come up with some better alternatives to achieve these things by setting a more achievable target.

Other cool stuff I did

I wanted to add ASCIINEMA urls in the bear documentation website, but there wasn’t one single point where I could access them, as they were all tweeted out. I wrote a script to grab all the tweets and filter out the ones which had asciinema url in them.

For fun purposes, I also wrote a script which counts how many times you have been mentioned by @coala-analyzer They are all located in Twitter Scripts repository in my branch.

Credits

A BIG shout-out to the tools which helped me achieve my tasks

Also thanks to my mentor ManoranjanP, co mentor Lasse, Mischa, AbdealiJK and rest of amazing coala community for helping me throughout the project and providing amazing and helpful reviews on all my PRs. I plan to stay in touch with coala community after GSoC ends, so that we can kick some ass again.

Happy Coding!

Distributed Results Class

Since the last update I have modified what the DistributedModel returns. Previously, it just returned the parameters but I’ve changed things such that it now returns a results class instance similar to the fit methods for other mdoel classes. I’ve also made it possible for the user to change the result class used. The reasoning for this is that what sort of results we expect can change based on the methods used by the DistributedModel. For instance, since we default to a debiasing approach that uses the elastic_net code we would expect a RegularizedResults instance as the default. This is what is set up but I have also implemented a bare bones DistributedResults class. Currently this doesn’t have much but it does allow for some flexibility if we want to add more down the line. To change the results class you simply give an additional argument to DistributedModel: results_class=<ResultsClass>.

Significance Testing

Besides the work on distributed estimation I have also put together a PR for the LASSO significance testing that was mentioned as part of the GSoC proposal. It is fairly straight forward. Currently, all I’ve added is the covariance test specified in http://statweb.stanford.edu/~tibs/ftp/covtest.pdf. This is build somewhat in parallel to the other contrasts used in statsmodels but given that the set up is somewhat different I think this is important for now. I think I’d like to eventually go back and unify things somewhat but I think that is beyond the scope of the GSoC work and may be something that carries over into the fall. At this point I just need to add the tests for this and I think it will be ready to go.

August 12, 2016

Karan_Saxena (italian mars society)

Pre-final post

Sorry for this (very) late post.

My code was making progress so I was trying to fit in as many optimisations as possible before putting up this post. Guess this is high time I do it.

More details in the final blog post :P

So my code is (finally!!) working. The steps are being tracked and I am now able to get all the coordinates of the feet.

Only the publishing of the output on Tango is left.

Onwards and upwards!!

aleks_ (Statsmodels)

Bugs again?!

Implementing tests for Granger-causality and instantaneous causality for Vector Error Correction Models (VECM) (compare chapters 3.6 and 7.6 in [1]) resulted in hours of bug-searching.

The search for bugs in my Granger-causality-test started to come to an end, when I realised that my results and those of the reference software JMulTi differed by a constant factor, namely T / (T - K*p - num_of_deterministic_terms). With this finding I could spot the source of the differing results: the covariance matrix of the residuals. While I was using the estimator with (T - K*p - num_of_deterministic_terms) in its denominator, JMulTi seems to use T as the denominator. Again, it took me some time to understand the reason behind this choice ... and found it: As described in chapter 5.2 in [1], when it comes to parameter constraints in Vector AutoRegressive (VAR)-models, it makes sense to use T as the denominator in the calculation of the mentioned covariance matrix estimator. And since there are such constraints under H0 of the Granger-causality-test, it makes sense to use T as the denominator here, too. With this tiny adaption of code (i.e. dividing the matrix by the factor mentioned above) my results suddenly equaled those of JMulTi.

Also the test for instantaneous causality led to time consuming searching for bugs. However, I didn't really find a bug in my code. Already desperately changing arbitrary values I found out that the corresponding test case passed when I based my test on a VAR(p+1)-model. However, unlike the test for Granger-causality, the test for instantaneous causality should not be based on a VAR(p+1) but rather a VAR(p)-model according to [1]. I am curious whether this is a bug in JMulTi. To find out more I will try this test with other datasets too.

With that, thanks for reading and let's get back to work ... or have a good lunch first : )

[1] Lütkepohl, H. (2005): "New Introduction to Multiple Time Series Analysis"

mkatsimpris (MyHDL)

Documentation and Coverage Completed

The coverage, documentation, and synthesis results are in the PR. I am waiting for Chris to review them and tell me what to change.

meetshah1995 (MyHDL)

Verify -> Validate -> Vscale

As you probably can guess from the post title, the last few weeks was mostly about verifying and validating the vscale modules.

I developed unit tests for each module and now the ongoing work is to create unified tests for the entire assembly of modules. With the successful (*fingers crossed*) implementation of these tests, and some awesome documentation,  riscv module will finally come alive to be fully used by the myHDL community :) .

See you next week,
MS.

Utkarsh (pgmpy)

Markov Chain Monte Carlo: Metropolis-Hastings Algorithm

As discussed in my previous post , we can use a Markov chain to sample from some target probability distribution . To do so, it is necessary to design transition operator for the Markov Chain such that the stationary distribution of chain matches the target distribution . The Metropolis-Hastings sampling algorithm allows us to build such Markov chains.

Detailed Balance

To understand how Metropolis-Hastings enable us constructs such chains we need to understand reversibility in Markov Chain. In my previous post I briefly described reversibility as,

if the probability of transition is same as the probability of reverse transition then chain is reversible.

Mathematically we can write this as:

This equation is called as detailed balance.

Now if the transition operator is regular( A Markov chain is regular if if there exists some number such that, for every , the probability of getting from to in exactly steps is greater than 0. ) and it satisfies the detailed balance equation relative to , then is the unique stationary distribution of (For proof refer here).

Metropolis-Hastings Algorithm

Let be the desired stationary distribution which matches the target probability distribution . Let be any to states belonging to state-space of Markov chain. Now using detailed balance equation

which can be re-written as:

Now, we will separate the transition in two sub-steps ( I’ll explain why in a moment ), the proposal and the acceptance-rejection. The proposal distribution is the probability of proposing a state given , and the acceptance probability is the conditional probability to accept the proposed state . Transition probability can be written as the product of both:

Using this relation we can re-write the previous equation as:

Now since lies in , and we want to maximize the acceptance of new proposed state thus we choose acceptance probability as

Now acceptance probability is a probability associated with an event of accepting new proposed state, so whenever our acceptance probability is we accept the new proposed state. But what about the case when acceptance probability lies in , i.e less than . In such cases we take a random sample from and if acceptance probability is higher than this number we accept new state otherwise reject it. In some place this criterion is called as Metropolis acceptance criterion.

In nutshell we can write Metropolis-Hastings algorithm as following procedure :

1. Initialisation: Pick an initial state at random

2. Randomly pick new proposed state according to

3. Accept the state according to Metropolis acceptance criterion. If state is accepted set the current state to , otherwise set it to . Yield this current state as sample

4. Go to step 2 until required number of samples are generated.

There are few attractive properties of Metropolis-Hastings algorithm which may not be visible in first-sight.

• First, the use of proposal distribution for sampling. The advantage of using proposal distribution is that it allows us to indirectly sample from the target distribution when it is too complex to directly sample from.

• Secondly, our target distribution doesn’t need to be normalized. We can use un-normalized target distribution and our sample will be as good as in the case of normalized target distribution. If you carefully look at the calculation of acceptance probability we are using ratio of target distribution, thus normalizing constant cancels out. The calculation of normalizing constant is itself difficult (requires numeric integration).

Now the reason to split transition probability must be clear because it allows us take advantage of proposal distribution.

Enough of this theory, let’s now use this algorithm and try to find samples from beta prime distribution

The probability density function of beta prime distribution is defined as:

where is a Beta function. We will ignore this normalizing constant.

Since Beta prime distribution is defined for , we will choose our proposal distribution as exponential distribution

,

where parameter controls the scale of distribution.

We will define our target distribution such as scale is our previous value of sample

import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt

# Defining beta prime function, valued returned is un-normalized probability
beta_prime = lambda x, a, b: x**(a-1)*(1+x)**(-a-b)

# Defining the transition function Q
q = lambda x, scale: np.exp(-scale*x)

def mcmc_beta_prime(num_samples, a, b, warm_up):
np.random.seed(12345)
samples = []
x = np.random.exponential(1)  # The inital state x
for i in range(0, num_samples):
samples.append(x)
x_prime = np.random.exponential(1/x)  # The new proposed state x'
factor = q(x, x_prime)/q(x_prime, x)

# The acceptance probability
A = min(1, factor * beta_prime(x_prime, a, b) / beta_prime(x, a, b))

# Accepting or rejecting based on Metropolis acceptance criterion
u = np.random.uniform(0, 1)
if u < A:
x = x_prime
else:
x = x
return samples[warm_up:]  # Discards samples from initial warm-up period

# This function plots actual beta prime distribution against sampled
def plot_beta_prime_and_samples(a, b):
plt.figure()
x = np.linspace(0, 100, 10000)
y = [ss.betaprime.pdf(x_i, a, b) for x_i in x]
plt.plot(x, y, label='Real distribution: a='+str(a)+',b='+str(b))
plt.hist(mcmc_beta_prime(100000, a,b, 1000), normed=True, histtype='step',
bins=100, label="Simulated MCMC")
plt.xlim([0, 5])
plt.ylim([0, 2])
plt.legend()
plt.show()
plt.close()

plot_beta_prime_and_samples(5, 3)


As we can see our sampled beta prime values closely resemble the beta prime distribution

The Metropolis-Hastings algorithm is a Markov chain Monte Carlo algorithm that can be used to draw samples from both discrete and continuous probability distributions of all kinds, as long as we compute a function f that is proportional to the density of target distribution. But one disadvantage of Metropolis-Hastings algorithm is that it has poor convergence rate. Lets look at an example to understand what I meant by “poor convergence”. In this example we will draw samples from 2D Multivariate normal distribution

Multivariate normal distribution is represented as , where is mean vector, and is covariance matrix.

Probability density at any point is given by:

where is is normalizing constant

Our target distribution will have

mean

and covariance

Our proposal distribution will be a multivariate normal distribution centred at previous state and unit covariance i.e,

,

where is an identity matrix

import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt

# Defining target probability
def p(x):
sigma = np.array([[1, 0.97], [0.97, 1]])  # Covariance matrix
return ss.multivariate_normal.pdf(x, cov=sigma)

# Defining proposal distribution
def q(x_prime, x):
return ss.multivariate_normal.pdf(x_prime, mean=x)

samples = np.zeros((1000, 2))
np.random.seed(12345)
x = np.array([7, 0])
for i in range(1000):
samples[i] = x
x_prime = np.random.multivariate_normal(mean=x, cov=np.eye(2), size=1).flatten()
acceptance_prob = min(1, (p(x_prime) * q(x, x_prime) )/ (p(x) * q(x_prime, x)))
u = np.random.uniform(0, 1)
if u <= acceptance_prob:
x = x_prime
else:
x = x
plt.figure()
plt.hold(True)
plt.scatter(samples[:,0], samples[:,1], label='MCMC samples', color='k')
plt.plot(samples[0:100, 0], samples[0:100, 1], 'r-', label='First 100 samples')
plt.legend()
plt.hold(False)
plt.show()


In plot we can see that Metropolis-Hastings algorithm takes time to converge towards target distribution (slow mixing). Like Metropolis-Hastings algorithm many MCMC algorithm suffer from this slow mixing. Slow mixing happens because of a number of factors like random-walk nature of the Markov chain, tendency of getting stuck at a particular sample and only sampling from a single region having high probability density. In my next post we will look at some of the more advance MCMC techniques namely Hybrid Monte Carlo (Hamiltonian Monte Carlo / HMC) and No-U-Turn Sampler (NUTS), which enables us to explore target distribution more efficiently.

In examples of next post I will use my own implementation of HMC and NUTS (which I implemented under pgmpy) and thus will require a latest installation of pgmpy in working env. For installation instruction you can look at here.

Markov Chain Monte Carlo: Metropolis-Hastings Algorithm

As discussed in my previous post , we can use a Markov chain to sample from some target probability distribution . To do so, it is necessary to design transition operator for the Markov Chain such that the stationary distribution of chain matches the target distribution . The Metropolis-Hastings sampling algorithm allows us to build such Markov chains.

Detailed Balance

To understand how Metropolis-Hastings enable us constructs such chains we need to understand reversibility in Markov Chain. In my previous post I briefly described reversibility as,

if the probability of transition is same as the probability of reverse transition then chain is reversible.

Mathematically we can write this as:

This equation is called as detailed balance.

Now if the transition operator is regular( A Markov chain is regular if if there exists some number such that, for every , the probability of getting from to in exactly steps is greater than 0. ) and it satisfies the detailed balance equation relative to , then is the unique stationary distribution of (For proof refer here).

Metropolis-Hastings Algorithm

Let be the desired stationary distribution which matches the target probability distribution . Let be any to states belonging to state-space of Markov chain. Now using detailed balance equation

which can be re-written as:

Now, we will separate the transition in two sub-steps ( I’ll explain why in a moment ), the proposal and the acceptance-rejection. The proposal distribution is the probability of proposing a state given , and the acceptance probability is the conditional probability to accept the proposed state . Transition probability can be written as the product of both:

Using this relation we can re-write the previous equation as:

Now since lies in , and we want to maximize the acceptance of new proposed state thus we choose acceptance probability as

Now acceptance probability is a probability associated with an event of accepting new proposed state, so whenever our acceptance probability is we accept the new proposed state. But what about the case when acceptance probability lies in , i.e less than . In such cases we take a random sample from and if acceptance probability is higher than this number we accept new state otherwise reject it. In some place this criterion is called as Metropolis acceptance criterion.

In nutshell we can write Metropolis-Hastings algorithm as following procedure :

1. Initialisation: Pick an initial state at random

2. Randomly pick new proposed state according to

3. Accept the state according to Metropolis acceptance criterion. If state is accepted set the current state to , otherwise set it to . Yield this current state as sample

4. Go to step 2 until required number of samples are generated.

There are few attractive properties of Metropolis-Hastings algorithm which may not be visible in first-sight.

• First, the use of proposal distribution for sampling. The advantage of using proposal distribution is that it allows us to indirectly sample from the target distribution when it is too complex to directly sample from.

• Secondly, our target distribution doesn’t need to be normalized. We can use un-normalized target distribution and our sample will be as good as in the case of normalized target distribution. If you carefully look at the calculation of acceptance probability we are using ratio of target distribution, thus normalizing constant cancels out. The calculation of normalizing constant is itself difficult (requires numeric integration).

Now the reason to split transition probability must be clear because it allows us take advantage of proposal distribution.

Enough of this theory, let’s now use this algorithm and try to find samples from beta prime distribution

The probability density function of beta prime distribution is defined as:

where is a Beta function. We will ignore this normalizing constant.

Since Beta prime distribution is defined for , we will choose our proposal distribution as exponential distribution

,

where parameter controls the scale of distribution.

We will define our target distribution such as scale is our previous value of sample

import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt

# Defining beta prime function, valued returned is un-normalized probability
beta_prime = lambda x, a, b: x**(a-1)*(1+x)**(-a-b)

# Defining the transition function Q
q = lambda x, scale: np.exp(-scale*x)

def mcmc_beta_prime(num_samples, a, b, warm_up):
np.random.seed(12345)
samples = []
x = np.random.exponential(1)  # The inital state x
for i in range(0, num_samples):
samples.append(x)
x_prime = np.random.exponential(1/x)  # The new proposed state x'
factor = q(x, x_prime)/q(x_prime, x)

# The acceptance probability
A = min(1, factor * beta_prime(x_prime, a, b) / beta_prime(x, a, b))

# Accepting or rejecting based on Metropolis acceptance criterion
u = np.random.uniform(0, 1)
if u < A:
x = x_prime
else:
x = x
return samples[warm_up:]  # Discards samples from initial warm-up period

# This function plots actual beta prime distribution against sampled
def plot_beta_prime_and_samples(a, b):
plt.figure()
x = np.linspace(0, 100, 10000)
y = [ss.betaprime.pdf(x_i, a, b) for x_i in x]
plt.plot(x, y, label='Real distribution: a='+str(a)+',b='+str(b))
plt.hist(mcmc_beta_prime(100000, a,b, 1000), normed=True, histtype='step',
bins=100, label="Simulated MCMC")
plt.xlim([0, 5])
plt.ylim([0, 2])
plt.legend()
plt.show()
plt.close()

plot_beta_prime_and_samples(5, 3)


As we can see our sampled beta prime values closely resemble the beta prime distribution

The Metropolis-Hastings algorithm is a Markov chain Monte Carlo algorithm that can be used to draw samples from both discrete and continuous probability distributions of all kinds, as long as we compute a function f that is proportional to the density of target distribution. But one disadvantage of Metropolis-Hastings algorithm is that it has poor convergence rate. Lets look at an example to understand what I meant by “poor convergence”. In this example we will draw samples from 2D Multivariate normal distribution

Multivariate normal distribution is represented as , where is mean vector, and is covariance matrix.

Probability density at any point is given by:

where is is normalizing constant

Our target distribution will have

mean

and covariance

Our proposal distribution will be a multivariate normal distribution centred at previous state and unit covariance i.e,

,

where is an identity matrix

import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt

# Defining target probability
def p(x):
sigma = np.array([[1, 0.97], [0.97, 1]])  # Covariance matrix
return ss.multivariate_normal.pdf(x, cov=sigma)

# Defining proposal distribution
def q(x_prime, x):
return ss.multivariate_normal.pdf(x_prime, mean=x)

samples = np.zeros((1000, 2))
np.random.seed(12345)
x = np.array([7, 0])
for i in range(1000):
samples[i] = x
x_prime = np.random.multivariate_normal(mean=x, cov=np.eye(2), size=1).flatten()
acceptance_prob = min(1, (p(x_prime) * q(x, x_prime) )/ (p(x) * q(x_prime, x)))
u = np.random.uniform(0, 1)
if u <= acceptance_prob:
x = x_prime
else:
x = x
plt.figure()
plt.hold(True)
plt.scatter(samples[:,0], samples[:,1], label='MCMC samples', color='k')
plt.plot(samples[0:100, 0], samples[0:100, 1], 'r-', label='First 100 samples')
plt.legend()
plt.hold(False)
plt.show()


In plot we can see that Metropolis-Hastings algorithm takes time to converge towards target distribution (slow mixing). Like Metropolis-Hastings algorithm many MCMC algorithm suffer from this slow mixing. Slow mixing happens because of a number of factors like random-walk nature of the Markov chain, tendency of getting stuck at a particular sample and only sampling from a single region having high probability density. In my next post we will look at some of the more advance MCMC techniques namely Hybrid Monte Carlo (Hamiltonian Monte Carlo / HMC) and No-U-Turn Sampler (NUTS), which enables us to explore target distribution more efficiently.

In examples of next post I will use my own implementation of HMC and NUTS (which I implemented under pgmpy) and thus will require a latest installation of pgmpy in working env. For installation instruction you can look at here.

jbm950 (PyDy)

GSoC Week 13

This week I spent a lot of time working on FeatherstonesMethod and its component parts. I started off by moving a bunch of spatial vector functions from another PR I have to the featherstone PR and used some of those functions to calculate the spatial inertia of Body objects. The next thing I worked on was completely rewriting the internals of the joint code. The joints now consist of 4 reference frames and points (one set at each of the bodys mass centers and one set per body at the joint location).

After this I ran some basic code that used these new features and kept making changes until the code was able to run without producing errors. I used this same method of work with FeatherstonesMethod and now it too is able to run without producing errors. Now that the code runs it was time to make sure that the output is correct which is a lot more involved than the previous step of work. To begin I solved for the spatial inertia by hand and used this calculation to create test code for Body.spatial_inertia. As expected the code initially was completely incorrect but it now passes the test. I have since been working on the tests for the joint code. Since this code is completely new to the sympy repository it takes a lot more planning than the body test did. Also I need to solve the kinematics by hand for the joints so that I have a base for the test code. This is where I am currently located in the process.

Also this week I addressed review comments on SymbolicSystem and have moved that PR closer to being able to merge. One of the current hang ups is trying to force Sphinx to autodocument the __init__ method. I think the best solution currently is to move the relevant code back to the main docstring for the class and not worry about trying to document the __init__ method.

While working on rewriting the joint code I came across a bug in frame.py and have created a docstring with a fix to this along with a test to make sure the fix works.

Lastly I reviewed a PR that adds a docstring to a method that did not yet have a docstring. The PR had some information in it that was incorrect and after some research I was able to make some suggestions for its implementation.

Future Directions

Next week is the last full week of GSoC and my main priority is getting the final evaluation information correctly finished so that the work can be processed correctly. My next goal is to make sure SymbolicSystem gets merged into SymPy. This is not entirely in my hands, however, as I will be having to wait for feedback and so while waiting I will be pulling off different parts of FeatherstonesMethod for separate PR’s at the recomendation of my advisor. These separate PR’s I hope to possibly include in my final evaluation.

PR’s and Issues

• (Open) [WIP] Added system.py to physics/mechanics PR #11431
• (Open) [WIP] FeatherstonesMethod PR #11415
• (Open) Added docstring to jordan_cell method PR #10356

August 11, 2016

Moving the Panel

The panel that we began working on last week has finally been completed. Here are the major changes:

1. Every UI element now has a set_center function. This function does exactly what it says, it sets the center of the UI element to where we want.
2. We now store the relative positions of the elements within a panel. This is so that we can move the panel around and the elements can be re-allotted centers in accordance with the panel’s new center.
3. Major changes in how sliders work. This is basically to facilitate the movement of individual slider elements when the slider as a whole is moved.

Using the above, we can now move the panel around.

Moving the Panel Around

Aligning the Panel

Now, the panel can be left-aligned or right-aligned to the window. Left-alignment means that its position with respect to the left window boundary will remain constant. Similarly for right-alignment. This was done using the set_center in the above step and window modification events.

A Right-Aligned Panel

What’s next

GSoC ends in less than a couple of weeks. In this final sprint the following needs to be done:

• A file dialog needs to be built. For now, we’ll be saving and opening files.
• Refactoring for existing code.
• Making PRs into master.

Shridhar Mishra (italian mars society)

Final week

Things done:
Integration of tango and PyKinect2.
Can get skeleton data along with RGB and other information like depth.
skeleton co-ordinates available in numpy format which is compatible with pytango.
Client side code done.
IPC between C# and python in place.

Things to do.
Fix bug related to in compatible data type for Push_change_event on tango server.
test the system for transmission.

mr-karan (coala)

• The coding period is about to end and this is the final week to clean up all work and prepare documentation. I have completed most of my tasks and also got coala-bears docs website merged in master repo. The website underwent a major overhaul since last week and a lot of changes have been implemented. A few of the changes include
• Linking to any page works without the need for Get Started button.
• Improved design of cards with more info present on card-reveal option.
• Implemented fuzzy search filter
• A small text added which shows the filter activated
• Add an option to reset the filter
• Changed color scheme & improved UI a bit

You can have a look here https://youtu.be/lcZUh-US8TU

• For the docs, I needed all available bear’s Asciinema URLs which were present on @coala, so I hacked up a real quick script to grab the urls from tweets and created a new ASCIINEMA_URL attribute in coala. Here’s the PR. and for coala-bears:here

• I have also refactored Syntax Highlighting PR based on reviews I got. There were a few formatting issues and the PR is likely to get approved soon-ish. Hope it makes to coala 0.8 release on time. :smile:

• I also created a VultureBear which checks for dead code analysis in your Python code. Check it out here:

• I am also working on RustLintBear which will the first bear in coala to support Rust language.

• Thanks to Lasse, I am now the maintainer of coala engagement related tasks

I’ll be wrapping up my work this week and will be submitting a document which will have links to all my commits during the GSoC period.

Happy Coding!

August 10, 2016

mike1808 (ScrapingHub)

GSOC 2016 #5: Creating bridges

Last week I started working on some killer-feature for Splash. It will allow you to write Lua scripts using almost the same Element (Node, HTMLElement) API as in JavaScript plus some additional helpful methods.

For example, you want to save the screenshot of the image when it will be loaded. Here is the script for it:

function main(splash) {
assert(splash:go(splash.args.url))
assert(splash:wait(1))

local shots = {}

local element = splash:select('#myImage') -- selecting the element by its CSS selector
element.onload = function(event)          -- ataching the event listener
event:preventDefault()
table.insert(shots, element:png())     -- making a screenshot of the element
end

return shots
end


The Element API is still in development and can be changed

JS <-> PyQt <-> Python <-> Lua

Let’s see how the communication between JS and Lua is implemented. Imagine that we are going to execute the following Lua code:

element:click()


Lua

splash is a table which has metatable and prototype Splash. In Lua it means that splash is an instance of Splash class. click method is wrapped into several Lua functions. After executing those function, we eventually will call the click Python method. This is possible because of [Lupa] runtime for Lua which allows to inject Python methods into Lua code.

Python

click is a method of _ExposedElement Python class which contains all the methods and properties which can be accessed in Lua. It binds Python functions with Lua functions.

Let’s return to our click method. It do the following procedure when it’s called:

• calls private_node_method passing the "click" string which means that we want to call the click method of our JavaScript DOM element
• private_node_methodis another method _ExposedElement and it calls the node_method method of self.element object which is an instance of HTMLElement class;
• HTMLElement is a class which have API for communicating with the JavaScript HTMLElement
• HTMLElement#node_method calls PyQt method evaluateJavaScript() with the following JS code:
window[elements_storage][element_id]["click"]()

• Description
• elements_storage is our elements storage which is a PyQT object; it allows us to save DOM elements for the further access
• element_id is a unique ID which allows us to identify our element object
• "click" is a method name which want to call (in this case it is “click”)

The elements storage is added to the JS window object using the addToJavaScriptWindowObject method of PyQt.

So, our Python self.element is connected to the JS node using the element_id.

PyQt

PyQt allows us to have WebKit runtime environment in our Python application. Using addToJavaScriptWindowObject we can add instances of QObject to the JS window object. Thereby it will allow us to call Python methods in JS.

JS

In JS our node can be accessed through window[storage_name][element_id] object.

This flow was OK for the one direction: from Lua to JS. But what if want to call Lua function from JS? That can happen when we assign an event handler for some event. In our first example we’ve assigned an event handler for the load event.

JS -> Lua

Let’s examine this code:

element.onload = function(event)
event:preventDefault()
table.insert(shots, element:png())
end


We assign an event handler for load event of our element. How it’s working?

1. When onload property of element is accessed it calls the __newindex metamethod of element.
2. This metamethod checks whether the requested property has the'on' prefix. If it does, we calls the private method set_event_handler of element.
3. In its turn set_event_handler calls Python method private_set_event_handler of _ExposedElement passing the event name for which we want to assign a handler, and the reference to handler function itself.
4. The crazy parts start here. We wrap our Lua function in Lua coroutine which will allow us to execute it when the event will be fired.
5. We pass that coroutine to set_event_handler method of HTMLElement Python class.
6. It saves that coroutine and in another storage which is called event handlers storage returns its ID .
7. Using PyQt evaluateJavaScript() method we execute the following JS code:
window[elements_storage][element_id].onload = function(event) {
window[event_handlers_storage].run(
event_handler_id,
)
}


You may think: what does window[event_handlers_storage].run do and what is window[events_storage]?

• window[event_handlers_storage].run
• it calls our event handlers storage (which was injected in the same way as elements storage) run method;
• that method, using the specified event_handler_id, calls the saved coroutine;
• that coroutine will call our Lua function that was assigned to onload property of our element Lua table;
• window[events_storage]
• it’s another storage, but now for events;
• the main reason for it is calling few methods of our event (preventDefault, stopPropagation, etc).

As you can see, in order to access our DOM element we should go through all the layers until we reach to JS.

The following days I will finish up writing tests and documentation for the newly created Element object. Also I will try to refactor those classes and methods to make the Lua <-> JS path more simple.

August 08, 2016

Yen (scikit-learn)

Workaround to use fused types class attributes

In some of my previous blog posts , we’ve seen Cython fused types’ ability to dramatically reduce both memory usage and code duplication.

However, of coure there are still some deficiencies of Cython fused types. In this blog post, we are gonna see one of the biggest inconvenience of fused types and how to address it using a hacky workaround.

Fused Types Limitation

When you enter the official page of Cython fused types, you can easily find the following warning:

Note Fused types are not currently supported as attributes of extension types. Only variables and function/method arguments can be declared with fused types.

It means that if you have a class written in Cython such as the following:

cdef class Printer:
cdef float num;

def __init__(self):
self.num = 0
print self.num


You are not allowed to use fused types to make function printNum become more generalizable. For example, if we change the type of attribute num from int to numeric of above code snippet:

Note: cython.floating can either refer to float or double.

from cython cimport floating
cdef class Printer:
cdef floating num;

def __init__(self):
self.num = 0
print x


It will result in error below since fused types can’t be use as extension type attributes:

Fused types not allowed here.


Intuitive Solution

Base on my previous experiences with Cython fused types and suggestion from my mentor Joel, it is intuitive to declare the attribute we want it to be fused types as void*, and then further typecast it in every functions it will be accessed.

To be more concrete, let’s look at the code:

from cython cimport floating
cdef class Printer:

// We wish num to be fused types, so declared it as void*
cdef void *num;

def __init__(self):
cdef float num = float(5)
self.num = &num

// Typecast it when we want to access its value
cdef floating *num = <floating*>self.num
print num[0]


However, above code will again result in error due to a unwritten rules of fused types:

Fused types can only be used in a function when any of its arguments is declared to be that dused types.

Why this rules exists is because Cython fused types works by generating multiple C functions, each function’s name involve the actual type it refers to. Nonetheless, once the fused types is not involved in the function’s signature, it will cause error since each generated functions will have the same name.

Workaround

Base on the above unwritten rules, here’s the final workaround we can adopt:

from cython cimport floating
cdef class Printer:

// We wish num to be fused types, so declared it as void*
cdef void *num;

cdef bint is_float;

// Used as fused type arguments
cdef float float_sample;
cdef long long_sample;

def __init__(self):
cdef float num = float(5)
self.num = &num

if type(num) == float:
self.is_float = True
else:
self.is_float = False

if self.is_float:
self._print(float_sample)
else:
self._print(double_sample)

// Underlying function
def _print(self, floating sample):
// Typecast it when we want to access its value
cdef floating *num = <floating*>self.num
print num[0]


As you can see, we have to also modified the functions that we want to access, keeping the original function signature as a wrapper and then introduce fused types function arguments into its underlying implementation function.

That’s it, although it looks really hacky but it works! Hope that Cython can add this functionality soon.

Summary

Please leave any thought you have after reading, let’s push Cython’s limitation together!

SanketDG (coala)

Things that needed fixing.

This week I am going to talk about the things that I have been working on, mainly on PR #2423. This is the second last week of GSoC and I fixed a lot of quirks that I was facing for the first two weeks.

The first one deals with opinionated documentation styles. Most projects follow this style of documentation:

:param x:           blablabla
:param muchtoolong: blablabla


While this format is used in most projects, it requires a lot of maintainance (and patience). One extra line or a few extra words and you have to literally “re-design” the entire documentation comment. But there’s another life-saving style though.

:param x:
blablabla
:param muchtoolong:
blablabla


When I was writing the parsing algorithm for extracting documentation metadata, I had completely forgotten about this style, and thus when I took the algorithm for a test drive, it indeed failed. The bug and the solution were both simple. The algorithm expects a space after the metadata symbols, which wouldn’t ideally happen in the second style. Thus, removing the space clearly solves the problem and parses everything correctly. This affects parsing of the first style in a small way, where we now have to account for an extra space.

In the future: I am hoping to improve this, the current process of searching for strings is not the most effecient way, I am thinking of slowly transitioning to regex for this.

Another tiny bug that I found was in the documentation extraction algorithms that were already implemented by my mentor @Makman2.

To talk about this, I need to explain what a documentation marker means in my project. Its basically a 3-element tuple of strings that define how a documentation comment starts, continues and ends.

So for a python docstring it would look something like ('"""', '', '"""'). For javadoc, it would look like ('/**', ' *', ' */')

Now the bug was that for documentation comments that were identified with no middle marker i.e. marker[1] = '', it was completely ignoring lines that only contained a \n i.e. an empty line. This would lead to wrong parsing. The solution (for now) was a simple if-statement to insert a newline if it found a empty line.

Also, I fixed Escaping. Although I am still not sure that the solution is bulletproof and would work for all cases, it’s good enough. Also, turns out that I have been doing the setting extraction wrong, getting the escaped value (and not the unescaped one).

Also, I removed some code! As a developer, it feels great to remove more and more lines of code that you don’t need. First, I removed a relatively useless exception handling in the parsing algorithm.

Second, I moved one of the functions that loaded a file and returned its lines as a list. It was being used in the testing classes. It was being used in three separate files, and thus was moved to a new file TestUtils.py, from where it was then used. REFACTOR EVERYTHING!!!

Lastly, now DocumentationComment requires a DocstyleDefinition object, instead of language and docstyle (which I always thought was redundant). This kind of falls in refactoring, and thus more removing!

Coming to things that were added, I finalized the design of the assembling functions, with the help of my mentor. So we decided on having two functions. One constructor-like function that would just arrange the documentation text from the parsed documentation metadata. So it wouldn’t be responsible for the markers and indentation. It returns a DocumentationComment object that contains all the required things for the final assembling. This function could sometimes act like a constructor, where it takes parsed metadata and spits out a readymade DocumentationComment object ready for use.

The final assembling function just assembles the documentation taking into account the markers and indentation. It returns the assembled documentation comment as a string that can be added/updated in files. While developing this, I actually found out that my algorithm for doing this was totally buggy and would not work for a lot of corner cases, so I am in the process of working them out.

Also, on a side note, I figured out the metadata settings. This is important, because its important to implement some variable functionality as settings, because it imparts freedom to the user to define what they want to parse. Right now, the concept is at its infancy, for example, settings for a conventional python docstring would like:

param_start = :param\ # here's a space
param_end = :
return_sep = :return:


That’s all for this blog post, I guess. I am almost done with the work in the core repo. I can finally start developing some cool bears!

Things that needed fixing.

This week I am going to talk about the things that I have been working on, mainly on PR #2423. This is the second last week of GSoC and I fixed a lot of quirks that I was facing for the first two weeks.

The first one deals with opinionated documentation styles. Most projects follow this style of documentation:

:param x:           blablabla
:param muchtoolong: blablabla


While this format is used in most projects, it requires a lot of maintainance (and patience). One extra line or a few extra words and you have to literally “re-design” the entire documentation comment. But there’s another life-saving style though.

:param x:
blablabla
:param muchtoolong:
blablabla


When I was writing the parsing algorithm for extracting documentation metadata, I had completely forgotten about this style, and thus when I took the algorithm for a test drive, it indeed failed. The bug and the solution were both simple. The algorithm expects a space after the metadata symbols, which wouldn’t ideally happen in the second style. Thus, removing the space clearly solves the problem and parses everything correctly. This affects parsing of the first style in a small way, where we now have to account for an extra space.

In the future: I am hoping to improve this, the current process of searching for strings is not the most effecient way, I am thinking of slowly transitioning to regex for this.

Another tiny bug that I found was in the documentation extraction algorithms that were already implemented by my mentor @Makman2.

To talk about this, I need to explain what a documentation marker means in my project. Its basically a 3-element tuple of strings that define how a documentation comment starts, continues and ends.

So for a python docstring it would look something like ('"""', '', '"""'). For javadoc, it would look like ('/**', ' *', ' */')

Now the bug was that for documentation comments that were identified with no middle marker i.e. marker[1] = '', it was completely ignoring lines that only contained a \n i.e. an empty line. This would lead to wrong parsing. The solution (for now) was a simple if-statement to insert a newline if it found a empty line.

Also, I fixed Escaping. Although I am still not sure that the solution is bulletproof and would work for all cases, it’s good enough. Also, turns out that I have been doing the setting extraction wrong, getting the escaped value (and not the unescaped one).

Also, I removed some code! As a developer, it feels great to remove more and more lines of code that you don’t need. First, I removed a relatively useless exception handling in the parsing algorithm.

Second, I moved one of the functions that loaded a file and returned its lines as a list. It was being used in the testing classes. It was being used in three separate files, and thus was moved to a new file TestUtils.py, from where it was then used. REFACTOR EVERYTHING!!!

Lastly, now DocumentationComment requires a DocstyleDefinition object, instead of language and docstyle (which I always thought was redundant). This kind of falls in refactoring, and thus more removing!

Coming to things that were added, I finalized the design of the assembling functions, with the help of my mentor. So we decided on having two functions. One constructor-like function that would just arrange the documentation text from the parsed documentation metadata. So it wouldn’t be responsible for the markers and indentation. It returns a DocumentationComment object that contains all the required things for the final assembling. This function could sometimes act like a constructor, where it takes parsed metadata and spits out a readymade DocumentationComment object ready for use.

The final assembling function just assembles the documentation taking into account the markers and indentation. It returns the assembled documentation comment as a string that can be added/updated in files. While developing this, I actually found out that my algorithm for doing this was totally buggy and would not work for a lot of corner cases, so I am in the process of working them out.

Also, on a side note, I figured out the metadata settings. This is important, because its important to implement some variable functionality as settings, because it imparts freedom to the user to define what they want to parse. Right now, the concept is at its infancy, for example, settings for a conventional python docstring would like:

param_start = :param\ # here's a space
param_end = :
return_sep = :return:


That’s all for this blog post, I guess. I am almost done with the work in the core repo. I can finally start developing some cool bears!

mkatsimpris (MyHDL)

Documentation

Today, I start writing the documentation for all of my modules and the complete frontend part. I will use Sphinx. As, the backend is not ready yet, I will try to fill the time by doing this task.

Riddhish Bhalodia (dipy)

Brain Extraction Walkthrough!

The last week and coming few weeks I will working on polishing all three of my PR’s i.e adaptive denoising, local PCA denoising and the robust brain extraction.

I have already described the tutorials of the local PCA and adaptive denoising in one of the previous blogposts (here), so in this I will focus on explaining the brain extraction tutorials, and then describe what all things are left to do and new exciting directions that are the real output of this google summer of code project.

Brain  Extraction Walkthrough!

The brain extraction which we developed takes help from a template data (T1 image with skull and it’s corresponding brain mask). So let us first load the related modules

We need the affine information as the algorithm for the brain extraction performs image registration as one of it’s major steps (here).

Now we apply the brain_extraction function which takes the input, template data and template mask as inputs, along with their affine information. There are five other parameters which can be given to the function.

The same_modality takes boolean value true if the input and template are of same modality and false if they are not, when it takes value false the only useful parameters are the patch_radius and threshold, rest are only used when the modalities are same.

The patch_radius and block_radius are the inputs for block wise local averaging which is used after the registration step in the brain extraction. The parameter value which is set to 1 as defaults governs the weighing, the threshold value governs the eroded boundary coefficient of the extracted mask. For more info on how these parameters works please look at the fast_patch_averaging function in dipy.segment.

First we look at the input and template with the same modality (both are T1 images)

Now the data we have used for this experiment is the IBSR database which has manually segmented brain masks as well. This is good because we can compare our output of the brain extraction with their manual mask (from above figure we can see that the algorithm does a pretty good job).

So to compare the two masks we use Jaccard’s Measure as follows

For the above image we get the jaccard’s index as 0.8428 which means it’s is very close to the manually extracted mask.

Now we look at how the brain extraction behaves when we choose two images with different modalities. In this case our template is of T1 modality and the inpu timage is of B0 modality.

This is the whole brain extraction tutorial. To get an idea of how fast does this algorithm works I have put the runtime with the data sizes

[A] For same_modality = True

Input T1 volume :  (256, 256, 128)
Template volume : (193, 229, 193)
Time taken : 521.42 seconds

[B] For same_modality = False

Input B0 volume : (128, 128, 60)
Template volume : (193, 229, 193)
Time taken : 43.98 seconds

This concludes this blog in the coming week I will put up one of my last GSOC blog which will summarize the projects and point to the new directions which have emerged from these 3 months.

Thank You

kaichogami (mne-python)

GSoC Summary

GSoC 2016 is almost coming to an end. I got a very a nice opportunity to connect and work with extremely knowledgeable, experienced and helpful people. For the three months I learnt about brain signals, machine learning, got experience in designing API with collaboration of a community, talked to prominent researchers and developers and lastly
left at least a tiny bit contribution to a big open source world.
My project involved making the decoding module of MNE compatible with scikit-learn, mainly enabling them to be pipelined and evaluated with cross_val_score. I started coding after discussing the API with Jean and Dennis. My first task involved refactoring Xdawn algorithm. It was done successfully however I took a lot of time which could have been completed sooner. Following that I have implemented two other classes
one of which was an improvement of Jean’s work.

There is still a lot of work to do with decoding, which I would follow up after gsoc. Everyone in the community are extremely helpful and forgiving to a new player like myself. All my mentors responded to my queries and guided my
without any delay. I am especially thankful to Jean, who never showed any hesitation in helping me.

August 07, 2016

GSoC '16: Update

Hello again!

Big advancements and changes for this update.

I have almost got my whole project merged! It is in the very last stages with one or two tiny changes to make and then it's done!

There have been a few changes design-wise:

• The number of questions has been reduced to just one: this is the ultimate quickstart setup. You just need to give the project directory now and the coafile will be automatically generated. No interaction from the user at all!
Basically, the question asking the user for files to match is now everything by default. And the files to ignore is automatically identified from the gitignore file. Pretty neat huh?

• No more complicated section globs. Instead of having an unnecessarily long section, we're now generating concise globs that virtually do the same thing.

• Settings filling: instead of leaving the mandatory settings to be asked for at runtime, we're now prompting the user for the values at coafile generation itself. This is more logical.

Here's the coafile generated when I ran coala-quickstart on coala-quickstart's project directory:

[default]
bears = LineLengthBear, LineCountBear, SpaceConsistencyBear, InvalidLinkBear, KeywordBear, FilenameBear
files = **.py, **.yml, **.rst, **.c, **.js
ignore = .git/**, **/build/**, **/htmlcov/**, htmlcov/**, **/src/**
max_lines_per_file = 1000
use_spaces = True
cs_keywords, ci_keywords =

[python]
bears = CPDBear, PyCommentedCodeBear, RadonBear, PyUnusedCodeBear, PEP8Bear, PyImportSortBear, PyDocStyleBear, PyLintBear
files = **.py
language = python

[yaml]
bears = YAMLLintBear
files = **.yml

[restructuredtext]
bears = reSTLintBear
files = **.rst

[c]
bears = GNUIndentBear, ClangASTPrintBear, CPPCheckBear, CSecurityBear, ClangBear, ClangComplexityBear
files = **.c

[javascript]
bears = CPDBear, ESLintBear, JSComplexityBear, JSHintBear
files = **.js
language = python


I really like this: this was how I envisioned the coafile to look like originally and it's panning out even better.

I'm now in the last week of my project. I'm expecting the PR to be merged today and then I'll be focussing on the prototype I have for guessing each bear's params. I'll make an update post again next week.

Till then,

GSoC '16: Final Report

GSoC 2016 was one of the best things I've had the opportunity to participate in. I've learned so much, had a lot of fun with the community the whole time, got to work on something that I really like and care about, got the once-in-a-lifetime opportunity to visit Europe, and still get paid in the end. And none of this would have been possible without the support and help from the coala community as a whole. Especially Lasse, who was my mentor for the program, from whom I've learned so, so much. And Abdeali, who introduced me to coala in the first place and help me get settled in the community. It honestly wouldn't have been possible without any of them, and I really mean it. Seriously, thank you :)

List of commits I've made over the summer

The last three months have been action packed. Check 'em out for yourself:

coala-quickstart

Commit SHA Commit
b8d8349 Add tests directory for testing
df99516 py.test: Execute doctests for all modules
3d01aed Create coala-quickstart executable
28a33f9 Add coala bear logo with welcome message
759e445 generation: Add validator to ensure path is valid
111d984 generation: Identify most used languages
4ace132 generation: Ask about file globs
8f7fe23 generation: Identify relevant bears and show help
839fa19 FileGlobs: Simplify questions
7c98e48 Settings: Generate sections for each language
b28e20c Settings: Write to coafile
69a5d2f Generate coafile with basic settings
60bee9a Extract files to ignore from .gitignore
62978ad Change requirements
36c8486 Enable coverage report
d78e85e Bears: Change language used in tests
4a8819e setup.py: Add myself to the list of maintainers
54f21c6 gitignore: Ignore .egg-info directories
6a7b63a Bears: Use only important bears for each language

coala

Commit SHA Commit
45bfec9 Processing: Reuse file dicts loaded to memory
ef287a4 ConsoleInteraction: Sort questions by bear
7d57784 Caching: Make caching default
1732813 Processing: Switch log message to debug
01890c2 CachingUtilitiesTest: Use Section
868c926 README: Update it
f79f53e Constants: Add strings to binary answers
2d7ee93 LICENSE: Remove boilerplate stuff
da6c3eb Replace listdir with scandir
ad3ec72 coalaCITest: Remove unused imports
91c109d Add option to run coala only on changed files
5a6870c coala: Add class to collect only changed files
622a3e5 Add caching utilities
e1b3594 Tagging: Remove Tagging

coala-utils

Commit SHA Commit
27ee83c Update version
64b0e0b Question: Validate the answer
1046c29 VERSION: Bump version
bd1e8fa setup.cfg: Enable coverage report
79fee96 Question: Use input instead of prompt toolkit
cfd81c1 coala_utils: Move ContextManagers from coalib
c5a4526 Add MANIFEST
f019962 Change VERSION
9db2898 Add map between file extension to language name
a52a309 coala_utils: Add Question module

That's a +2633 / -471 change! I honestly didn't know it'd be that big. Anyway, those were the technical stats. On to the showcase!

Stuff I worked on

My primary GSoC proposal: coala-quickstart

coala-quickstart

And here's the coafile that's generated:

Pretty neat stuff, huh? :)

Anyway, that was my whole project in a nutshell. I worked on other stuff too during the coding period. Here are some of the results:

Caching in coala

This is another thing I'm proud of: caching in coala. Remember how you had to lint all your files every time even if you changed just one line? No more. With caching, coala will only collect those files that have changed since the last run. This produces a terrific improvement in speed:

Trial 1 Trial 2 Trial 3 Average
Without caching 9.841 9.594 9.516 9.650
With caching 3.374 3.341 3.358 3.358

That's almost a 3x improvement in speed!

Initially, caching was an experimental feature since we didn't want to break stuff! And this can break a lot of stuff. But fortunately, everything went perfectly smoothly and caching was made default.

The coala README page got a complete overhaul. I placed a special emphasis on simplicity and the design; and to be honest, I'm quite happy with the outcome.

Other miscellaneous stuff

I worked on other tiny things during the coding phase:

• #2585: This was a small bugfix (to my annoyance, introduced by me). This also led to a performance improvement.
• #2322: listdir is a new python3.5 feature that is faster than the traditional scandir that is used to get a directory's contents.
• e1b3594: I removed Tagging with this commit. It was unused.
• #11, #14: A generic tool to ask the user a question and return the answer in a formatted manner. This is now used in several packages across coala.

There were other tiny changes, but you can find them in the commit list.

Conclusion

It's really been a blast, right from the start to the start to the finish. Thanks to everyone who has helped me in any way. Thanks to Google for sponsoring such an awesome program. Thanks to the PSF for providing coala with an opportunity at GSoC. I honestly can't see how this would have been possible without any of you.

To everyone else, I really recommend contributing to open-source. It doesn't have to be coala. It doesn't even need to be a big project. Just find a project you like: it can even be a silly project that doesn't do anything useful. The whole point is to get started. GSoC is one way to easily do that. There is such a wide variety of organizations and projects, I'm pretty sure at least one project will be to your liking. And you're always welcome at coala. Just drop by and say hello at our Gitter channel.

mkatsimpris (MyHDL)

Week 11

This week I complete the convertible tests for the frontend part and for the new color converter. Vikram, made a PR for the backend part, so the next days we can integrate it with my part and complete the encoder. However, the backend still lacks from complete test coverage with a software prototype. The next days till 15 Augoust which is the end of the coding period I will try to finish the

Vikram Raigur (MyHDL)

Huffman Module

I was stuck a bit while implementing the Huffman Module initially. I was thinking about implementing huffman tables on fly but soon realised it’s a very difficult task.

Finally I changed my plan to build the tables using the JPEG Standards.

I added all the huffman encoded values in a csv file and then built rom tables using the csv file.

The huffman module have a small state machine sitting inside it which makes a unique serial code from the given parallel code which is huffman encoded.

We generally concat both variable length integer (run length encoded) and variable length code(huffman encoded) and store them in a FIFO.
In some cases when the input to the run length encoder do not have any zeroes, it will be bad to compress them using huffman encoder because it do not save any space.

Fianlly, the huffman encoder is merged into the main repo.

Quantizer module

The Quantizer module:

This module uses a divider placed in its core. The Quantizer ROM is build using standard JPEG values.

Right Now, the quantizer rom has fixed values. In future we plan to implement some process so that Quantizer ROM that can be programmed by the user.

The Quantizer module is added to the Main repo .

August 06, 2016

Avishkar Gupta (ScrapingHub)

Code Review, Optimizations and Formal Benchmarking

Hi,

Firstly, sorry for the extremely late blog post this time around, however I was waiting on my mentor’s comments before I gave another status report because I wanted his take on where we are progress wise and the status of the pull request. So without further ado, let’s get into it.

The majority of the last two week were spent in writing unit tests and cleaning the code wherever possible, removing any outdated constructs and pushing code for review. I also formalized the benchmarking suite using the djangobench code as a starting point as mentioned previously, however there were some fin- ishing touches left the last time which were completed in this code cycle.

The test coverage of the patch is complete and we have 100% diff coverage, all seems well there.

Finishing with the benchmarking, I started looking into documentation as due to the refactor a re-write of the Signal API documentation is in order, even though we still do have full backward compatability support. Also, now since the code review comments are in I’ll be looking to work on those issues and get them sorted out at the earliest. Also, we agreed that the benchmarks would not make sense as a part of scrapy bench since that would require that we keep the old dependencies as a part of the project, which would make no sense as they are no longer required. The best solution that we came up with for the same is to write the benchmarks elsewhere however then just include a link to them somewhere in the PR itself so we have a history of the same maintained.

All in all, we’re happy with how the project has turned out, and we’ll probably be seeing the PR merged into the mainline sometime in the future.

I’ll update this post as soon as I can think of more stuff I want to write :)

tushar-rishav (coala)

Python f-strings

Hey there! How are you doing? :)

Since past couple of days I’ve been attending the EuroPython conference at Bilbao, Spain and it has been an increíble experience so far! There are over a dozen amazing talks with something new to share every day and the super fun lightning talks at the end of the day. If for some reason you weren’t able to attend the conference then you may see the talks live at EuroPython YouTube channel.

In this blog I would like to talk briefly about PEP498 - Literal String Interpolation in Python. Python supports multiple ways to format text strings (%-formatting, format formatting and Templates). Each of these are useful in some ways but they do lack in other aspects. For eg. the simplest version of format style is too verbose.

Clearly, there is a redundancy. place is being used multiple times. Similarly, % formatting is limited with the types (int, str, double) that can be parsed.

f-strings are proposed in PEP498. f-strings are basically a literal strings with ‘f’ or ‘F’ as prefix. It embeds expressions using braces that are evaluated at runtime. Let’s see some simple examples:

I think that’s simpler and better than other string formatting options. If this feature interests you and you want to learn more about it then I recommend checking out the PEP498 documentation.

Cheers!

ghoshbishakh (dipy)

Google Summer of Code Progress August 7

Yay! We have dynamically generated gallery and tutorials page now!

Progress so far

The major changes are in the gallery and in the new tutorials page.

Instead of showing the manually entered images from the admin panel, the gallery now fetches all images from all the tutorials in the latest documentation.

This is actually done using by scraping the tutorials page from the json docs.

Although the docs are now built in json format but still the body is represented as an HTML string. As a result there was no way out other than parsing the HTML. And the best HTML parsing library that I know of is Beautiful Soup.

And all the extracted images are displayed in the honeycomb gallery.

Tutorials Page

Although each version of documentation has a list of tutorials separately, we wanted a dedicated page which will contain the tutorials with thumbnails and descriptions and they will be grouped into several sections. So similar to the gallery page I parsed the tutorials index page and went into each tutorial and fetched the thumbnails and descriptions. Then this list of tutorials is displayed as an exapandable list of groups.

What next?

The github statistics visualizations page is one major task. Another major task is somehow make the automatically generated gallery and tutorials page editable so that we can change the thumbnails or descriptions. Also the coding period is about to end in 2 weeks so documenting the code and merging all pull requests is a priority.

Dynamic Factor Model

There is one more item in my proposal, which I haven't yet mention in my reports, although I've been working on it before refactoring phase and TVP model implementation. This is a Markov switching dynamic factor model, we will use its following specification:
y, as usual, is an observation process, and (1) is an observation equation. f is a factor, changing according to (2) - factor transition equation, which is a VAR model with Markov switching intercept. Observation error is a VAR model, too, as (3) states.
Statsmodels already has a non-switching DFM realization in statespace.dynamic_factor.py file, which has almost the same specification, but without Markov switching intercept term in factor transition equation, so the first challenge was to extend DynamicFactor class and add analogous, but non-switching intercept. This parameter is required, because it is used for switching intercept initialization in Maximum Likelihood Estimation. regime_switching/switching_dynamic_factor.py and regime_switching/tests/test_switching_dynamic_factor.py files contain my experiments with MS-DFM, which were unsuccessful due to reasons, discussed in the next section.

Irresistible obstacles

Non-switching DynamicFactor class is a big piece of code itself, and since a lot of its functionality is shared with switching model, the only right solution is to extend it by SwitchingDynamicFactor class. The problem is that this class wasn't supposed to be extended, so it was quite tricky until I realised that it's a bad idea. For example, I have to substitute DynamicFactor's KalmanSmoother instance by an ugly descendant of KimSmoother with some interface changes to achieve compatibility with non-switching model. After a series of similar sophisticated manipulations I came up with a thought that it's impossible to construct a SwitchingDynamicFactor class without changes in the parent class. However, in my experience there are not so many changes needed.
Another problem is about testing data. I use this Gauss code sample from Kim and Nelson book. This is the only code I know to test MS-DFM against. But the disappointment is that this testing model is incompatible with presented formerly - observation equation contains lagged factor terms, while ours uses only the current factor. I also tried to use some tricks, the main was to group lagged factors into one vector factor. After several errors and considerations, I figured out that this is a bad idea, because a transition noise covariance matrix becomes singular. The only solution I see now is to extend DFM and MS-DFM model, so that they could handle lagged factors in observation equation, but this is a time-consuming challenge.

What's next?

The thing I'm working on right now is constructing a generic forecasting to Kim filter, which is the last important feature to be added. I spent a couple of days just thinking how to implement this, but now I'm finally writing the code. Forecasting should be a very visual thing, so I would add it to the existing notebooks, which is also a kind of testing.

Literature

[1] "State-space Models With Regime Switching" by Chang-Jin Kim and Charles R. Nelson.

Aakash Rajpal (italian mars society)

Work Continues!

Hey All, as the deadline for final submission nears its end , I have been working to finalize my project to fix the few bugs that keep on showing up and annoy me. The Project is about 80% done according to me :p. I am able to render the HUD Dynamically on the Oculus now with my own demo model made in the Blender Game Engine. The Only Work left now is to integrate my HUD with the V-ERAS models for Blender and render the final scene on the Oculus DK2. I have being going through the models available on the repository of IMS V-ERAS to find a model suitable to render the HUD on.

My College has also started, so for these final days will be very hectic. Hope for the Best

Nelson Liu (scikit-learn)

(GSoC Week 10) scikit-learn PR #6954: Adding pre-pruning to decision trees

The scikit-learn pull request I opened to add impurity-based pre-pruning to DecisionTrees and the classes that use them (e.g. the RandomForest, ExtraTrees, and GradientBoosting ensemble regressors and classifiers) was merged a week ago, so I figure that this would be an appropriate place to talk about what this actually does and provide an example of it in action.

Decision Tree Node Impurity - A Recap

Note: if you're familiar with what the "impurity" of a node in a decision tree is, feel free to skip this

In decision tree-based classification and regression methods, the goal is to iteratively split to minimize the "impurity" of the partitioned dataset (see my week 2 blog post for more details about this). The definition of node impurity varies based on the method used to calculate it, but in rough terms it measures how "pure" a leaf node is. If a leaf node contains samples that all belong to one class (for classification) or have the same real-valued output (for regression), it is "pure" and thus has an impurity of 0.

The ultimate goal of decision tree-based models is to split the tree such that each leaf node corresponds to the prediction of a single class, even if there is only one sample in that class. However, this can lead to the tree radically overfitting the data; it will grow in a manner such that it will create a leaf node for every sample if necessary. Overfitting is when the decision tree continues to grow and reduce the training set error, but at the expense of teh test set error. In other words, it can basically memorize the samples of the train set, and may lose the ability to generalize well to new datasets. One method for avoiding overfitting in decision trees is pre-pruning.

In pre-pruning, you stop the decision tree growth before it perfectly fits the training data; this is because (as outlined in the previous paragraph) fitting the training data perfectly often leads to overfitting.

The scikit-learn tree module, there are a variety of methods used during tree growth to decide whether a node should be split or whether it should be declared a leaf node.

• Is the current depth of the tree greater than the user-set max_depth parameter? max_depth is the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. (depth >= max_depth)

• Or is the number of samples in this current node less than the value of min_samples_split? min_samples_split is the minimum number of samples required to split an internal node. (n_node_samples < min_samples_split)

• Or is the number of samples in this current node less than 2 * min_samples_leaf? min_samples_leaf is the minimum number of samples required to be in a leaf node. (n_node_samples < 2 * min_samples_leaf)

• Or is the total weight of all of the samples in the node less than min_weight_leaf? min_weight_leaf defines the minimum weight required to be in a leaf node. (weighted_n_node_samples < min_weight_leaf)

• Or lastly, is the impurity of the node equal to 0?

By changing the value of each of these constructor parameters, it's possible to achieve a pseudo-prepruning effect. For example, setting the value of min_samples_leaf can define that each leaf has more than one element, thus ensuring that the tree cannot perfectly (over)fit the training dataset by creating a bunch of small branches exclusively for one sample each. In reality, what this is actually doing is simply just telling the tree that each leaf doesn't HAVE to have an impurity of 0.

Enter min_impurity_split

My contribution was to create a new constructor parameter, min_impurity_split; this value defines the minimum impurity that the node must have in order to not be a leaf. For example, if the user-defined value of min_impurity_split is 0.5, a node with an impurity of 0.7 would be further split on but nodes with impurities of 0.5 and 0.2 would be declared leaves and receive no further splitting. In this manner, it's possible to control the grain to which the decision tree fits the data, allowing for coarser fits if desired.

This is great and all, but does it improve my estimator's performance?

min_impurity_split helps to control over-fitting and as a result can improve your estimator's performance if it is overfitting on the dataset. I ran a few benchmarks in which I plotted the number of nodes in trees fit with a variety of parameters, and their performance on a held-out test set on the Boston housing prices regression task. The code to generate these plots can be found here.

Note: in the chart on the bottom, the y-axis represents the Mean Squared Error (MSE) -- thus, lower error is better.

These graphs show some interesting things that I want to point out. If you look at the first chart, you'll see that the tree grown with default parameters is massive, with over 700 nodes. With large trees, it's generally quite easy to overfit since it indicates that the tree is maybe creating leaf nodes for individual samples, which is generally detrimental for generalization. On the other hand, the tree grown with min_impurity_split = 3 is much more modest, with ~150 nodes. If you examine the graph, you'll see that as the value of min_impurity_split decreases, the number of nodes increases which makes sense.

Looking at the second chart, you will see that setting various values of min_impurity_split and other parameters generally serve to improve the performance of the estimator by reducing MSE. These parameters all limit tree growth in some way. There are some notable exemptions though --- for example, note that the tree grown with max_depth = 3 has a relatively high MSE. This is because, referencing the first chart, it is tiny! As a result, this tree could have been under-fit. Thus, it's important to maintain a balance in tree size. Trees with too many nodes overfit easily, but you don't want to prune so much that you render it incapable of fitting in general! Let's take a look at some plots to see how tuning the value of min_impurity_split affects tree size, training set accuracy, and test set accuracy.

These images confirm our previous intuitions. As the min_impurity_split increases, it's easy to see that the number of nodes in the tree decreases. Similarly, the training set mean squared error also increases since the tree is no longer memorizing samples. The test set accuracy is more unpredictable, however; as a result, you should always try different values of min_impurity_split and to find the one that works best for your task.

Per usual, thanks goes out to my mentors Raghav RV and Jacob Schreiber for taking a look at the PR and reviewing my code.

You're awesome for reading this! Feel free to follow me on GitHub if you want to track the progress of my Summer of Code project, or subscribe to blog updates via email.

August 05, 2016

tsirif (Theano)

Multi-GPU/Node interface in Platoon

Last weeks I was working on a new interface in Platoon which will support collective operations for Theano’s GPU shared variables over multiple GPUs over multiple hosts. This will enable Platoon to train your Theano models using multiple GPUs even if they do not reside in the same host.

Usage

In order to use it, a worker file needs to be provided. A worker file defines the training process of a single set of model parameters in a parallel and distributed manner. Optionally and in case you want to extend the distributed computation capabilities of the training process, you are encouraged to provide a controller file which extends the default one (platoon.channel.controller module) in this framework. User must invoke the platoon-launcher script in order to start training with the new interface.

Platoon is configured through the command-line arguments of this launcher and in case of their absence (or if it needed) through environmental variables or Platoon configuration files. Please read platoonrc.conf in package’s root directory to learn about every way that Platoon can be configured.

If single-node is explicitly specified through command-line arguments, the specified devices will be used in the GPU communicator world in the order they are parsed. The same thing applies also for lists of devices found in platoon environmentals or configuration files.

e.g. usage:

• platoon-launcher lstm -D cuda0 cuda3 (explicit config)
• platoon-launcher lstm (config with envs/files - may be multi-node)

If multi-node is explicitly specified through command-line arguments, extra configuration through appropriate environmentals per host or files needs to be done in order to describe which devices will be used in each host. Host names are given the same way they are given in MPI’s mpiexec.

e.g. usage:

• platoon-launcher lstm -H lisa0 lisa1 (gpus on lisa0 and gpus on lisa1)

Please notice that this launcher is used to set up the new worker interface (the old is still usable - but not in multi-node configs). The new worker interface supports only CUDA devices currently. NVIDIA’s NCCL collectives library and pygpu are required for multi-GPU, while mpi4py is required in addition for multi-node.

API description and how it works

I will now describe how the new API works and its usage in training code.

Platoon uses a controller/worker architecture in order to organize multiple hosts which own multiple GPUs. A controller process is spawned in each host, which is responsible for organizing its worker processes and communicating with controller processes in other hosts for computing. In addition, there are as many worker processes in each host as there are devices which participate in the computation procedure. Each worker process is responsible for a single computation device. By this I mean that a worker process will contain Theano code which act on a single device and will use a Worker instance in order to exploit multi-GPU/node computation.

By default, someone who wishes to write code for training a model with Platoon must write the code which will run for worker processes. Theano functions are to be created as usual which will be executed on a single Theano device. This device is configured for the worker process by the THEANO_FLAGS="device=<...>" environmental variable which is set by the launching procedure. Among single GPU computation there will be multi-GPU/node computations which are caused by calls to Platoon’s interface. While developing training code, the user must have access to the corresponding pygpu.gpuarray.GpuArray(s) which will be used as arguments to Platoon’s new interface.

import os
from platoon.worker import Worker
from pygpu.gpuarray import asarray
import numpy as np

# Instantiate a worker
worker = Worker(control_port=5567)
# Get GPU context from worker
gpuctx = worker.gpuctx
# How many workers are there across all hosts
total_nw = int(os.environ['PLATOON_TEST_WORKERS_NUM'])

# Create Theano shared variables for input and output
inp = np.arange(32, dtype='float64')
sinp = asarray(inp, context=gpuctx)
out = np.empty_like(inp)
sout = asarray(out, context=gpuctx)

# Call to interface
worker.all_reduce(sinp, '+', sout)

expected = total_nw * inp
actual = np.asarray(sout)
assert np.allclose(expected, actual)


Minimal example code for a worker process

When a call to worker.all_reduce is made, the internal pygpu.gpuarray.GpuArrays are fetched and used as arguments to the corresponding AllReduce collective operation in a local pygpu GPU communicator world. This GPU comm world is local in a sense that it is composed only of a single host’s GPUs, in order to effectively utilize NVIDIA’s NCCL optimized framework. So we are expecting to have concurrent NCCL operations for each host. When the pygpu collective has finished and we are having a multi-node training procedure, a single worker out of the workers in each host will copy the result from its GPU to a memory buffer in the host. This memory buffer is shared (through the means of posix ipc) among all workers processes in a host and their controller process. Then this worker requests from its controller to execute the corresponding MPI collective operation with the other controller processes in a inter-node MPI communicator world. The result from this operation is received in the same shared buffer. When the MPI operation has finished, all workers write back concurrently the result from the shared buffer to the destination GpuArray in their GPUs.

# Execute collective operation in local NCCL communicator world
res = self._local_comm.all_reduce(src, op, dest)

if dest is not None:
res = dest
res.sync()

# Create new shared buffer which corresponds to result GpuArray buffer
if self._multinode:
res_array = self.shared(res)

self.lock()
first = self.send_req("platoon-am_i_first")
if first:
# Copy from GpuArray to shared memory buffer
res.sync()

# Request from controller to perform the same collective operation
# in MPI communicator world using shared memory buffer
self.send_req("platoon-all_reduce", info={'shmem': self._shmem_names[res.size * res.itemsize],
'dtype': str(internal_res.dtype),
'op': op})
self.unlock()

# Concurrently copy from shared memory back to result GpuArray
# after Controller has finished global collective operation
res.write(res_array)
res.sync()

if dest is None:
return res


Simplified code from Worker class demonstrating program flow

Right now, I am testing thoroughly this new interface. I am interested to see the behavior of the system if an unexpected error occurs. I expect processes to shut down as cleanly as possible. For the next steps, I would like to include modules in Platoon which will allow creating a training and validating procedure with ease through ready-to-use configurable classes of training parts. This way Platoon will also provide a high-level gallery of reusable training algorithms for multi-GPU/node systems.

Till then, keep on coding
Tsirif, 08-06-2016

My EuroPython Experience

What a blast! I had a lot of fun at EuroPython, and it wasn't just the conference.

To start off, it was exciting to meet the guys: Lasse, Max, Tushar, Udayan, Adrian, Alex and Justus. My only interaction with them was through Gitter previously. We had a lot of fun (more on that later): everyday after the conference, we all go over to the Airbnb and do our own sprints which I enjoyed from start to finish.

And the conference itself was one of the best experiences I've ever had: I learned so much about Python: iterables, meta-classes, performance optimizations, parallel computing and much, much more.

But my favorites were the Lightning Talks. A Lightning Talk is an hour long event where several speakers get five minutes on stage to talk about virtually anything they want. Lasse got two opportunities on stage and Max gave a talk as well. And then on the last day, the whole team got to present a video, which we made the night before. It is one of the most hilarious things I've seen :D

I also had the opportunity to co-conduct a 3-hour workshop on making a contribution to open source with Tushar. It was an interesting experience and I never fully understood the amount of effort that needs to go into a talk/workshop till then.

And on the last two days, we had sprints. I just juggled with several small issues and PRs (and of course, my GSoC project). It was different talking in person with everybody instead of Gitter (although we did use Gitter when the person was over 3 feet away :P). And we got a lot of stuff done (and I got a bar of chocolate from Lasse!).

Anyway, that was my EuroPython experience. I went on a tour to France, Belgium and Poland after that for a week: Europe is truly beautiful. Hope I can make it next year as well :)

:wq

Prayash Mohapatra (Tryton)

Nearing Completion

Well yes, I recently made my (hopefully) final commit to the repo. With everything going as per the plan, I am really happy to strike out the items to be done in my workflowy list.

Last two weeks’ work enables one to select predefined exports to be navigated and selected from mouse and key presses. CSV is being properly generated for the selected records. Submitted a PR adding custom quote character support for PapaParse’s unparse method.

Removed the open/save option from the export dialog as the generated export file is being sent to the browser to handle.

liscju (Mercurial)

Coding Period IX - X Week

In the last two week i have done some minor style fixes, but also added support for connecting to redirection server through https. It was not hard because in mercurial opening connection via urlopener.url supports https protocol. I also added https support in example server using ssl.wrap_socket on BaseHttpServer.HttpServer sockets. There is one problem with this solution. Client connect to redirection server by themselves, so they need to have certificate of redirection server to trust its certificate or to trust redirection server by default. Probably the best solution would be if main repository server propagate authorization information to clients but this needs to be carefully designed because solution supports few procotols.

The second thing i have done was i added command for converting large file repository to the one that uses redirection feature. This was not hard either because mercurial already has module for converting repositories revision after revision, my goal was to detect which files in revisions are large files, send them to redirection server and make sure they are not put in local store or cache.

Breaking Lines

As the GsoC period comes to an end i have only two major tasks left to do, one being making a LineBreakBear, which suggests line breaks, when the user has Lines of code that are more than a max_line_length setting, and the Other is adding Indents based on Keywords in the IndentationBear.

This week i was able to device a simple algorithm to suggest line breaks. If you’ve followed my blogs you’d know that there’s something I like to call an encapsulator :P, it’s a fancy name for different but not all types of brackets. So the algorithm is as follows:

1. Get all occurences of lines which exceed max_line_length
2. Check wether these lines have an encapsulator which starts before the limit
3. Find the last encapsulator started in this line before the limit.
4. Suggest a line break at that point with the new line being indented in accordance with the indentation_width setting.

Now this algorithm is really simple and does not consider border cases such as hanging-indents.

Hopefully by the next blog posts i’d have completed my Project. I’ll have lots to share about my experience this year

aleks_ (Statsmodels)

if remaining_time < 20 * seconds_per_day: get_coding()

2 and a half weeks left - and quite a few tasks too. However, I am confident that I will do a good job in this remaining time, also because last week showed that some ticks on the todo-list may be achieved quickly. I hope that I can tick the next item (Granger-causality) this week. The item following it (impulse-response analysis) will be more straightforward as it will be mainly about reusing existing code from the VAR-framework.
With that, let's get coding! : )

Utkarsh (pgmpy)

Introduction to Markov Chains

Markov Chains are integral component of Markov Chain Monte Carlo (MCMC) techniques. Under MCMC Markov Chain is used to sample from some target distribution. This post tries to develop basic intuition about what Markov Chain is and how we can use it to sample from a distribution.

In a layman terms we can define Markov Chain as a collection of random variables having the property that, given the present, the future is conditionally independent of the past. This might may not make sense to you right now but this will be the core of the discussion when we discuss about MCMC algorithms.

Lets us now take a formal (mathematical) look at the definition of Markov Chain and some of its properties. A Markov Chain is a stochastic process that undergoes transition from one state to another on a given set of states called state space of Markov Chain.

I used a term stochastic process which is a random process that evolves with time. We can perceive it as probabilistic counterpart of a deterministic process where instead of evolving in a one way (deterministic) process can have multiple directions in which it can evolve or it has some kind of indeterminacy to its future. One example of a stochastic process is Brownian Motion.

A Markov is characterised by following three elements:

• A state space , which is a set of values (state ) chain is allowed to take.

• A transition model , which specifies for each pair of state , the probability of going from to .

• An initial state distribution , , which defines the probability of being in any one of the possible states at the initial iteration t = 0.

We can define distribution over subsequent time , , , using chain dynamics as

=

I earlier described a porperty of Markov chain which was

Given the present, the future is conditionally independent of the past

This property is called as Markov Property or memoryless property of Markov chain, which is mathematically described as:

There are other two properties of interest which we can usually find in most of the real life application of Markov Chains:

• Stationarity : Let sequence of some random elements of some set be a stochastic process, then a stochastic process is stationary if for every positive integer k the distribution of the k-tuple does not depend on ‘n’. Thus a Markov Chain is stationary if it is stationary stochastic process. This stationarity property in Markov Chains implies stationary transition probabilities which in turn gives rise to equilibrium distribution. It is not necessary that all Markov Chains have equilibrium distribution but all Markov Chains used in MCMC do.

• Reversibility: A Markov Chain is reversible if the probability of transition is same as the probability of reverse transition . Reversibility in Markov Chain implies stationarity.

Finite State Space Markov Chain

If the state space of Markov Chain takes on a finite number of distinct values, the transition operator can be defined using a square matrix

The entry represents transition probability of moving from state to state .

Lets first use an example Markov chain and understand these terms using that. I’ll use a Markov chain to simulate Gambler’s Ruin problem. In this problem suppose that there are two players and playing poker. Initially both of them had with them. In each round winner gets a dollar and loser loses one and game will continue till any one of them loses his all money. Consider that probability of winning for is . Our task is to estimate probability of winning the complete game for player . Here is how our Markov chain will look like:

The state space of Markov Chain is As state space is finite, we can write the transition model in form of a matrix as

transition = [[1, 0, 0, 0, 0],
[0.51, 0, 0.49, 0, 0],
[0, 0.51, 0, 0.49, 0],
[0, 0, 0.51, 0, 0.49],
[0, 0, 0, 0, 1]]


The initial money with is , so we can consider start state as vector start = [0, 0, 1, 0, 0]. Now with these characterisation we will simulate our Markov Chain and try to reach stationary distribution, which will give us probability of winning.

import numpy as np
import matplotlib.pyplot as plt
iterations = 30  # Simulate chain for 30 iterations
initial_state = np.array([[0, 0, 1, 0, 0]])
transition_model = np.array([[1, 0, 0, 0, 0], [0.51, 0, 0.49, 0, 0], [0, 0.51, 0, 0.49, 0],
[0, 0, 0.51, 0, 0.49], [0, 0, 0, 0, 1]])
transitions = np.zeros((iterations, 5))
transitions[0] = initial_state
for i in range(1, iterations):
transitions[i] = np.dot(transitions[i-1], transition_model)
labels = [0, 0, 0, 0, 0, 0]
plt.figure()
plt.hold(True)
plt.plot(transitions)
labels[0], = plt.plot(range(iterations), transitions[:,0], color='r')
labels[1], = plt.plot(range(iterations), transitions[:,1], color='b')
labels[2], = plt.plot(range(iterations), transitions[:,2], color='g')
labels[3], = plt.plot(range(iterations), transitions[:,3], color='m')
labels[4], = plt.plot(range(iterations), transitions[:,4], color='c')
labels[5], = plt.plot([20, 20], [0, 1.2], color='k', linestyle='dashed')
plt.legend(labels, ['money=0','money=1','money=2','money=3', 'money=4', 'burn-in'])
plt.hold(False)
#plt.show()
print("Probability of winning the complete game for P1 is", transitions[iterations - 1][4])


The output of above code sample is: Probability of winning the complete game for P1 is 0.479978863078, which is a good approximation of original result 0.48(see the link, for calculation of exact result). In Trace plot of Markov chain one can see that in starting there were fluctuations but after some-time chain reached an equilibrium/stationary distribution as probabilities are not changing much in subsequent iterations. Mathematically a distribution is a stationary distribution if it satisfies following property:

Using the above property we can see that our chain has approximately reached stationary distribution as following condition returns True.

np.allclose(transitions[-1], np.dot(transitions[-1], transition_model), atol=1e-04)


The initial period of about 20 iterations(here) is called burn-in period of Markov Chain( see the dotted line in plot ) and is defined as the number of iterations it takes the chain to move from initial conditions to stationary distribution. I find Burn-in period to be a misleading term so I’ll call it Warm-up period. The Burn-in term was used by early authors of MCMC who were from physics background and has been used since then :/ .

One interesting thing about stationary Markov chains is that it is not necessary to sequentially iterate to predict future state. One can predict future state by raising the transition operator to the N-th power, where N is the iteration a which we want to predict, and then multiplying it by the initial distribution. For example if we wanted to predict probabilities after 24 iteration we could simply have done:

Lets look at a more interesting application of stationary Markov chain. Here we will create our own naive page ranking algorithm using a Markov Chain. For computing transition probabilities from page to (for all pairs of , ) we use a configuration parameter and two factors which are dependent on the number of pages that links to and whether the page has link to page . Here is a python code for the same:

import matplotlib.pyplot as plt
import numpy as np

alpha = 0.77  # Configuration parameter
iterations = 20
num_world_wide_web_pages = 4.0
# Consider world wide web has 4 web pages only
# Following is mapping between number of links to page
links_to_page = {0: 3, 1: 3, 2: 1, 3: 2}

# Returns transition probability of x -> y
global alpha
global num_world_wide_web_pages
constant_val = (1.0 - alpha)/num_world_wide_web_pages
else:
return constant_val

transition_probs = np.zeros((4,4))
# Page 1 is not linked to itself
# Page 1 is linked to every other page
for i in range(1,4):
# Page 2 is not linked to itself
# Page 2 is linked to every other page
for i in [0, 2, 3]:
# Page 3 is only linked to page 4
# Page 3 is not linked to every other page except 4
for i in range(3):
# Page 4 is linked to 1 and 3 and is not linked to 2 and itself
for i in range(4):

transitions = np.zeros((iterations, 4))
transitions[0] = np.array([1, 0, 0, 0])  # Starting markov chain from page 1, initial distribution

for i in range(1, iterations):
transitions[i] = np.dot(transitions[i-1], transition_probs)

labels = [0, 0, 0, 0, 0]
plt.figure()
plt.hold(True)
labels[0], = plt.plot(range(iterations), transitions[:,0], color='b')
labels[1], = plt.plot(range(iterations), transitions[:,1], color='r')
labels[2], = plt.plot(range(iterations), transitions[:,2], color='g')
labels[3], = plt.plot(range(iterations), transitions[:,3], color='k')
labels[4], = plt.plot([10, 10], [0, 1], color='y', linestyle='dashed')
plt.legend(labels, ['page 1', 'page 2', 'page 3', 'page 4', 'burn-in'])
plt.hold(False)
plt.show()


Our algorithm will rank pages in order Page 4, Page 3, Page 1, Page 2 :o .

Continuous State-Space Markov Chains

A Markov chain can also have continuous state space that exist in real numbers . In this we cannot represent transition operator as a matrix , but instead we represent it as a continuous function on the real numbers. Like Finite state-space Markov chains continuous state-space Markov chains also have a warm-up period and a stationary distribution but here stationary distribution is also over continuous set of variables.

Lets look at example on how to use a continuous state space Markov chain to sample from continuous distribution. Here our transition operator will be normal distribution with mean as half of the distance between zero and previous state and unit variance. We will throw away certain amount of states generated in start as they will be in warm-up period , the subsequent states that our chain reaches in stationary distribution will be our samples. Also we can run multiple chains simultaneously to draw samples more densely.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(71717)
warm_up = 100
n_chains = 3

transition_function = lambda x, n_chains: np.random.normal(0.5*x, 1, n_chains)
n_iterations = 1000
x = np.zeros((n_iterations, n_chains))
x[0] = np.random.randn(n_chains)

for it in range(1, n_iterations):
x[it] = transition_function(x[it-1], n_chains)

plt.figure()
plt.subplot(222)
plt.plot(x[0:200])
plt.hold(True)
minn = min(x.flatten())
maxx = max(x.flatten())
l = plt.plot([warm_up, warm_up],[minn, maxx], color='k', lw=3)
plt.legend(l, ['Warm-up'])
plt.title('Trace plot of first 200 samples')
plt.hold(False)
plt.subplot(224)
plt.plot(x)
plt.hold(True)
l = plt.plot([warm_up, warm_up],[minn, maxx], color='k', lw=3)
plt.legend(l, ['Warm-up'], loc='lower right')
plt.title("Trace plot of entire chain")
plt.hold(False)
samples = x[warm_up+1:,:].flatten()
plt.subplot(121)
plt.hist(samples, 100)
plt.legend(["Markov chain samples"])
mu = round(np.mean(samples), 2)
var = round(np.var(samples), 2)
plt.title("mean={}, variance={}".format(mu, var))
plt.show()


Ending Note

In the above examples we deduced the stationary distribution based on observation and gut feeling :P . However, in order to use Markov chains to sample from a specific target distribution, we have to design the transition operator such that the resulting chain reaches a stationary distribution that matches the target distribution. This is where MCMC methods come to rescue.

jbm950 (PyDy)

GSoC Week 12

This week my main work was on Featherstone’s articulated body algorithm. I started by prototyping what I thought his algorith might look like in python code (the algorithm was pulled from chapter 7 of his book). With the passes prototyped it was apparent that I would need a full description of the kinematic tree and so I prototyped the building of the kinematic tree from a single “base” body. I then went on to see what it would look like if the kinematic tree was built during the first pass of his articulated body algorithm and decided that keeping the two separate would result in cleaner code.

With the three passes prototyped and the kinematic tree built I started digging into Featherstone’s book to better determine the definition of each of the variables in the algorithm. While doing this I ended up reading a second source where Featherstone describes the articulated body algorithm and it was helpful in furthering my understanding of the algorithm as it was a condensed summary. I then compared the written version of the algorith in his book and this article with the two matlab versions he has posted online and the python version her provides a link for online. This helped me see where some terms he includes in his book he doesn’t include in his code. It also helped me to see what code for the algorithm might look like.

After working on the mock up of the passes and trying to better understand them, I switched focus to the joint code that needs to be finished so that it can be used in my implementation of the articulated body algorithm. This has lead to some confusion about the design decisions that were made in the past when putting together the joint code and this is the current stage I am sitting at as I await feedback on some of my questions.

This week I also looked over a couple of documentation PR’s. One was a simple matter of fixing some indentation and seems mostly ready to merge but the second turned some docstrings into raw strings so they could add latex math code. I don’t know what the general stance is on the latter but I’m of the opinion that the docstrings should be human readable since people may actually look through the code for them or hope that help(function) provides something useful. In this case the latex math code is cluttered and would be better off in .rst files where people are only going to be reading the rendered version. On that PR I am awaiting response from someone with sympy to see if this is indeed prefered.

Future Directions

Hopefully I’ll recieve some feedback about the joints and Featherstone’s method so I can keep moving forward with these. In the mean time there are a few other bits of code I will need to complete that the algorithm uses that is not directly related to my questions. If I finish these tasks before recieving feedback I will move forward with changing the joint code as I think would be best.

PR’s and Issues

• (Open) [WIP] Added system.py to physics/mechanics PR #11431
• (Open) Intendation fixes – sympy/concrete/summations.py PR #11473
• (Open) Adjustments to Legendre, Jacobi symbols docstrings PR #11474
• (Open) [WIP] FeatherstonesMethod PR #11415

August 04, 2016

Raffael_T (PyPy)

Only bug fixes left!

All changes of cpython have been implemented in PyPy, so all that's left to do now is fixing some bugs. Some minor changes had to be done, because not everything of cpython has to be implemented in PyPy. For example, in cpython slots are used to check the existence of a function and then call it. The type of an object has the information of valid functions, stored as elements inside structs. Here's an example of how the __await__ function get's called in cpython:
ot = Py_TYPE(o);
if (ot->tp_as_async != NULL) {
getter = ot→tp_as_async→am_await;
}
if (getter != NULL) {
PyObject *res = (*getter)(o);
if (res != NULL) { … }
return res;
}
PyErr_Format(PyExc_TypeError,
"object %.100s can't be used in 'await' expression",
ot→tp_name);
return NULL;

This 'getter' directs to the am_await slot in typeobject.c. There, a lookup is done having '__await__' as parameter. If it exists, it gets called, an error is raised otherwise.
In PyPy all of this is way more simple. Practically I just replace the getter with the lookup for __await__. All I want to do is call the method __await__ if it exists, so that's all to it. My code now looks like this:
w_await = space.lookup(self, “__await__”)
if w_await is None: …
res = space.get_and_call_function(w_await, self)
if is not None: …
return res

I also fixed the _set_sentinel problem I wrote about in the last post. All dependency problems of other (not yet covered) Python features have been fixed. I can already execute simple programs, but as soon as it gets a little more complex and uses certain asyncio features, I get an error about the GIL (global interpreter lock):
“Fatal RPython error: a thread is trying to wait for the GIL, but the GIL was not initialized”
First I have to read some descriptions of the GIL, because I am not sure where this problem could come up in my case.
There are also many minor bugs at the moment. I already fixed one of the bigger ones which didn't allow async or await to be the name of variables. I also just learned that something I implemented does not work with RPython which I wasn't aware of. My mentor is helping me out with that.
I also have to write more tests, because they are the safest and fastest way to check for errors. There are a few things I didn't test enough, so I need to catch up on writing tests a bit.

Things are not going forward as fast as I would love it to, because I often get to completely new things which I need to study first (like the GIL in this case, or the memoryview objects from the last blog entry). But there really shouldn't be much left to do now until everything works, so I am pretty optimistic with the time I have left. If I strive to complete this task soon, I am positive my proposal will be successful.

Ranveer Aggarwal (dipy)

Working on a 2D Panel

Last week, we built a basic 3D orbital menu with most of the existing 2D elements successfully ported to 3D. Sliding in 3D is still a bit of a problem, and we are exploring ways to do it. For the time being 3D sliding has been pushed back to later.

The next component of the project is building a panel. A panel is a collection of (for now) 2D elements. It is synonymous to a window.

Implementation

We used a 2D rectangle to give the panel a background. The panel has an add_element function that takes in a 2D UI element and its relative position as arguments. If the size of the panel is 200x200 pixels and if I specify the relative position of the 2D UI element as (0.4, 0.4), then the position of the element inside the panel (with the panel’s lower left corner as the origin) will be (0.4*200, 0.4*200). Applying appropriate transformations will get the element to the position where we want it to be.

The Result

Here’s how the panel currently looks like:

The 2D Panel

What Next

Next, certain issues with the panel are to be fixed and certain enhancements are to be done, for example, we want to click and drag the panel around. Also, we want the panel to be left aligned or right aligned.

srivatsan_r (MyHDL)

Nearing completion!

The RISC-V project is almost over, just some tests have to be completed. This week I was doing the controller module that generates control signals after decoding the instructions. I then made the cult_div module.

It was easy to make these modules as I had the reference code in Verilog for the V-scale processor.

While writing the test for must_div module I faced some problems as I was not getting the expected output. I couldn’t figure out why that happened. So, now my project partner is debugging the code.

I m now currently writing tests for controller module and trying to figure out how exactly the controller works by giving some inputs and checking the output.

Hopefully we will complete this project within a week.

Anish Shah (Core Python)

GSoC'16: Week 9 and 10

Just a few weeks left for GSoC to end. I am working on completing my patches and creating a blog post for final submission

Reviews

I submitted 6 patches as a part of my GSoC work.

1. Add a GitHub PR field on issue’s page

Status: Complete

I cleaned up the logic and the patch is much simpler than the one I submitted first!

Status: Complete

3. GitHub webhooks

Status: Almost Complete

I updated the patch according to the reviews suggested by my mentor. The initial idea is completed. I added just one new thing that still needs to be reviewed - if issue is not referenced in PR title/body, then a new issue is created on b.p.o.

4. PR status on issue’s page

Status: Almost Complete

I have updated the patch. It just needs one final review. :)

Status: In Progress I need some inputs from PSF community on how many GitHub comments we need to add on bpo. You can find the e-mail thread here

6. Convert patches to PR

Status: In Progress

Work Product Submission

For my final submission, I’m creating one big blogpost with all the documents of the work that I have done till now. :) It’s still a work in progress. You can find it here.

Thank you for reading this blogpost. This is it for today. See you again. :)

August 03, 2016

sahmed95 (dipy)

IVIM documentation

Intravoxel incoherent motion¶

The intravoxel incoherent motion (IVIM) model describes diffusion and perfusion in the signal acquired with a diffusion MRI sequence that contains multiple low b-values. The IVIM model can be understood as an adaptation of the work of Stejskal and Tanner [Stejskal65]in biological tissue, and was proposed by Le Bihan [LeBihan84]. The model assumes two compartments: a slow moving compartment, where particles diffuse in a Brownian fashion as a consequence of thermal energy, and a fast moving compartment (the vascular compartment), where blood moves as a consequence of a pressure gradient. In the first compartment, the diffusion coefficient is $$\mathbf{D}$$ while in the second compartment, a pseudo diffusion term $$\mathbf{D^*}$$ is introduced that describes the displacement of the blood elements in an assumed randomly laid out vascular network, at the macroscopic level. According to [LeBihan84], $$\mathbf{D^*}$$ is greater than $$\mathbf{D}$$.
The IVIM model expresses the MRI signal as follows:
$S(b)=S_0(fe^{-bD^*}+(1-f)e^{-bD})$
where $$\mathbf{b}$$ is the diffusion gradient weighing value (which is dependent on the measurement parameters), $$\mathbf{S_{0}}$$ is the signal in the absence of diffusion gradient sensitization, $$\mathbf{f}$$ is the perfusion fraction, $$\mathbf{D}$$ is the diffusion coefficient and $$\mathbf{D^*}$$ is the pseudo-diffusion constant, due to vascular contributions.
In the following example we show how to fit the IVIM model on a diffusion-weighteddataset and visualize the diffusion and pseudo diffusion coefficients. First, we import all relevant modules:
import matplotlib.pyplot as pltfrom dipy.reconst.ivim import IvimModelfrom dipy.data.fetcher import read_ivim
We get an IVIM dataset using Dipy’s data fetcher read_ivim. This dataset was acquired with 21 b-values in 3 different directions. Volumes corresponding to different directions were registered to each other, and averaged across directions. Thus, this dataset has 4 dimensions, with the length of the last dimension corresponding to the number of b-values. In order to use this model the data should contain signals measured at 0 bvalue.
img, gtab = read_ivim()
The variable img contains a nibabel NIfTI image object (with the data) and gtab contains a GradientTable object (information about the gradients e.g. b-values and b-vectors). We get the data from img using read_data.
data = img.get_data()print('data.shape (%d, %d, %d, %d)' % data.shape)
The data has 54 slices, with 256-by-256 voxels in each slice. The fourth dimension corresponds to the b-values in the gtab. Let us visualize the data by taking a slice midway(z=27) at $$\mathbf{b} = 0$$.
z = 27b = 20plt.imshow(data[:, :, z, b].T, origin='lower', cmap='gray',           interpolation='nearest')plt.axhline(y=100)plt.axvline(x=170)plt.savefig("ivim_data_slice.png")plt.close()
Heat map of a slice of data
The region around the intersection of the cross-hairs in the figure contains cerebral spinal fluid (CSF), so it so it should have a very high $$\mathbf{f}$$ and $$\mathbf{D^*}$$, the area between the right and left is white matter so that should be lower, and the region on the right is gray matter and CSF. That should give us some contrast to see the values varying across the regions.
x1, x2 = 160, 180y1, y2 = 90, 110data_slice = data[x1:x2, y1:y2, z, :]plt.imshow(data[x1:x2, y1:y2, z, b].T, origin='lower',           cmap="gray", interpolation='nearest')plt.savefig("CSF_slice.png")plt.close()
Heat map of the CSF slice selected.
Now that we have prepared the datasets we can go forward with the ivim fit. Instead of fitting the entire volume, we focus on a small section of the slice as selected aboove, to fit the IVIM model. First, we instantiate the Ivim model. Using a two-stage approach: first, a tensor is fit to the data, and then initial guesses for the parameters $$\mathbf{S_{0}}$$ and $$\mathbf{D}$$ obtained from this this tensor by _estimate_S0_D is used as the starting point for the non-linear fit of IVIM parameters using Scipy’s leastsq or least_square function depending on which Scipy version you are using. All initializations for the model such as split_b are passed while creating the IvimModel. If you are using Scipy 0.17, you can also set bounds by setting bounds=([0., 0., 0., 0.], [np.inf, 1., 1., 1.])) while initializing the IvimModel. It is recommeded that you upgrade to Scipy 0.17 since the fitting results might at times return values which do not make sense physically. (For example a negative $$\mathbf{f}$$)
ivimmodel = IvimModel(gtab)
To fit the model, call the fit method and pass the data for fitting.
ivimfit = ivimmodel.fit(data_slice)
The fit method creates a IvimFit object which contains the parameters of the model obtained after fitting. These are accessible through the model_params attribute of the IvimFit object. The parameters are arranged as a 4D array, corresponding to the spatial dimensions of the data, and the last dimension (of length 4) corresponding to the model parameters according to the following order : $$\mathbf{S_{0}, f, D^*, D}$$.
ivimparams = ivimfit.model_paramsprint("ivimparams.shape : {}".format(ivimparams.shape))
As we see, we have a 20x20 slice at the height z = 27. Thus we have 400 voxels. We will now plot the parameters obtained from the fit for a voxel and also various maps for the entire slice. This will give us an idea about the diffusion and perfusion in that section. Let(i, j) denote the coordinate of the voxel. We have already fixed the z component as 27 and hence we will get a slice which is 27 units above.
i, j = 10, 10estimated_params = ivimfit.model_params[i, j, :]print(estimated_params)
Next, we plot the results relative to the model fit. For this we will use the predict method of the IvimFit object to get the estimated signal.
estimated_signal = ivimfit.predict(gtab)[i, j, :]plt.scatter(gtab.bvals, data_slice[i, j, :],            color="green", label="Actual signal")plt.plot(gtab.bvals, estimated_signal, color="red", label="Estimated Signal")plt.xlabel("bvalues")plt.ylabel("Signals")S0_est, f_est, D_star_est, D_est = estimated_paramstext_fit = """Estimated \n S0={:06.3f} f={:06.4f}\n            D*={:06.5f} D={:06.5f}""".format(S0_est, f_est, D_star_est, D_est)plt.text(0.65, 0.50, text_fit, horizontalalignment='center',         verticalalignment='center', transform=plt.gca().transAxes)plt.legend(loc='upper right')plt.savefig("ivim_voxel_plot.png")plt.close()
Plot of the signal from one voxel.
Now we can map the perfusion and diffusion maps for the slice. We will plot a heatmap showing the values using a colormap. It will be useful to define a plotting function for the heatmap here since we will use it to plot for all the IVIM parameters. We will need to specify the lower and upper limits for our data. For example, the perfusion fractions should be in the range (0,1). Similarly, the diffusion and pseudo-diffusion constants are much smaller than 1. We pass an argument called variable to out plotting function which gives the label for the plot.
def plot_map(raw_data, variable, limits, filename):    lower, upper = limits    plt.title('Map for {}'.format(variable))    plt.imshow(raw_data.T, origin='lower', clim=(lower, upper),               cmap="gray", interpolation='nearest')    plt.colorbar()    plt.savefig(filename)    plt.close()
Let us get the various plots so that we can visualize them in one page
plot_map(ivimfit.S0_predicted, "Predicted S0", (0, 10000), "predicted_S0.png")plot_map(data_slice[:, :, 0], "Measured S0", (0, 10000), "measured_S0.png")plot_map(ivimfit.perfusion_fraction, "f", (0, 1), "perfusion_fraction.png")plot_map(ivimfit.D_star, "D*", (0, 0.01), "perfusion_coeff.png")plot_map(ivimfit.D, "D", (0, 0.001), "diffusion_coeff.png")
Heatmap of S0 predicted from the fit
Heatmap of measured signal at bvalue = 0.
Heatmap of perfusion fraction values predicted from the fit
Heatmap of perfusion coefficients predicted from the fit.
Heatmap of diffusion coefficients predicted from the fit
References:
 [Stejskal65] Stejskal, E. O.; Tanner, J. E. (1 January 1965). “Spin Diffusion Measurements: Spin Echoes in the Presence of a Time-Dependent Field Gradient”. The Journal of Chemical Physics 42 (1): 288. Bibcode: 1965JChPh..42..288S. doi:10.1063/1.1695690.
 [LeBihan84] (1, 2) Le Bihan, Denis, et al. “Separation of diffusion and perfusion in intravoxel incoherent motion MR imaging.” Radiology 168.2 (1988): 497-505.
Example source code
You can download the full source code of this example. This same script is also included in the dipy source distribution under the doc/examples/ directory.

mr-karan (coala)

So this week I was/am busy with making the coala bears website. I had initially decided to make a simple Jekyll static website, which would display data from an external yaml file. But after talking to the community, Lasse, Tushar and me settled on for Angular JS framework, mainly because we love the filters Angular provides. I had not much experience with Angular, so over the weekend I picked it up and managed to get a fairly decent+basic functionality website ready in 2-3 days. I also discovered about ngrok through which I demoed the website to the community. I will soon deploy it somewhere else permanently once I am finished with working on the feedback I got from the community. There are some more things that need to be pushed and make it v1.0 at least so we can have it merged and released.

I am excited that there is around a week or 10 days to submit our work. Looking back, 3 months, well time really flew by, but I learnt a lot thanks to the wonderful coala community. More on these farewell lines, but a bit later.

Now is the time for action, not words

• Work on Syntax Highlighting PR since it’s not merged yet.
• Complete coala-bears website.

Happy Coding!

Yashu Seth (pgmpy)

Linear Gaussian

Hello, once again. With my project being in its last stages, I am wondering how much will I miss this awesome summer. But anyways I’ll spare this post from pouring the nostalgic feelings, and keep them to myself till the final post :-) .

Today I’ll be describing the linear gaussian CPDs and the linear gaussian bayesian network.

Linear Gaussian CPD

A linear gaussian conditional probability distribution is defined on a continuous variable. All the parents of this variable are also continuous. The mean of this variable, is linearly dependent on the mean of its parent variables and the variance is independent.

For example,

P(Y ; x1, x2, x3) = N(β1x1_mu + β2x2_mu + β3*x3_mu + β0 ; σ^2)

For its representation pgmpy will have a class named LinearGaussianCPD in the module pgmpy.factors.continuous. To instantiate an object of this class, one needs to provide a variable name, the value of the beta_0 term, the variance, a list of the parent variable names and a list of the coefficient values of the linear equation (beta_vector), where the list of parent variable names and beta_vector list is optional and defaults to None. Let me share some API with you to get a better picture.


Parameters
----------

variable: any hashable python object
The variable whose CPD is defined.

beta_0: int, float
Represents the constant term in the linear equation.

variance: int, float
The variance of the variable defined.

evidence: iterable of any hashabale python objects
An iterable of the parents of the variable. None
if there are no parents.

beta_vector: iterable of int or float
An iterable representing the coefficient vector of the linear equation.

Examples
--------

# For P(Y| X1, X2, X3) = N(-2x1 + 3x2 + 7x3 + 0.2; 9.6)

>>> cpd = LinearGaussianCPD('Y', 0.2, 9.6, ['X1', 'X2', 'X3'], [-2, 3, 7])
>>> cpd.variable
'Y'
>>> cpd.variance
9.6
>>> cpd.evidence
['x1', 'x2', 'x3']
>>> cpd.beta_vector
[-2, 3, 7]
>>> cpd.beta_0
0.2



Linear Gaussian Bayesian Network

A Gaussian Bayesian is defined as a network all of whose variables are continuous, and where all of the CPDs are linear Gaussians. These networks are of particular interest as these are an alternate form of representaion of the Joint Gaussian distribution.

These networks are implemented as the LinearGaussianBayesianNetwork class in the module, pgmpy.models.continuous. This class is a subclass of the BayesianModel class in pgmpy.models and will inherit most of the methods from it. It will have a special method known as to_joint_gaussian that will return an equivalent JointGuassianDistribution object for the model. Let me share the API of this method.


>>> from pgmpy.models import LinearGaussianBayesianNetwork
>>> form pgmpy.factors import LinearGaussianCPD
>>> model = LinearGaussianBayesianNetwork([('x1', 'x2'), ('x2', 'x3')])
>>> cpd1 = LinearGaussianCPD('x1', 1, 4)
>>> cpd2 = LinearGaussianCPD('x2', -5, 4, ['x1'], [0.5])
>>> cpd3 = LinearGaussianCPD('x3', 4, 3, ['x2'], [-1])
>>> jgd = model.to_joint_gaussian()
>>> jgd.variables
['x1', 'x2', 'x3']
>>> jgd.mean
array([[ 1. ],
[-4.5],
[ 8.5]])
>>> jgd.covariance
array([[ 4.,  2., -2.],
[ 2.,  5., -5.],
[-2., -5.,  8.]])



For more details, you can refer my ongoing PR, #709.

I hope I kept the things simple and interesting. Good Bye . I will be back soon with an another post.

August 01, 2016

Ravi Jain (MyHDL)

Finite State Machines

Well Its been tough couple of weeks. Proceedings in my university has caused me to slow down a bit. But i have been making progress. I was working on RxEngine Block when things got too complex and I decided to take a step back and refer to use of Finite State Machines(FSMs) to develop much more simple and readable code. As it turns out i ended up rewriting the TxEngine Block from scratch as well. In midst of all this the file system in my local repo got too clumsy as i had multiple versions of RxEngine implementation and planned to wait out for the final revision of the blocks to avoid problems with commits later on while rebasing. I shall push the latest code in a day or two for review.

While implementing TxEngine block using FSMs I added the underrun functionality which was remaining in the previous implementation. Also I did a rough implementation of Flow Control Block which accepts request from client to send pause control frames and triggers the TxEngine for the same.

Also i had discussion about how to provide clocks to sub-blocks and handling the reset with Josy, one of the mentors, who suggested providing clocks to sub-blocks directly in the top block as opposed to relaying them through the sub-blocks. A good reason that i can think of to support it is that, if your system is a bit big and complex it might cause problems in simulation. I shall discuss more about it in detail in upcoming blocks.

Sheikh Araf (coala)

[GSoC16] Week 10 update

I’m at the final stage of my GSoC project, which is writing a coafile editor for the Eclipse plug-in I’ve been working on. After reading a lot of articles and tutorials I’ve finally settled on how to go about writing the editor.

To implement the editor I’ll be extending the FormEditor class. This approach is helpful because it lets you add one or more FormPage as well as a StructuredTextEditor to view the raw text file.

Next I’ll be using the Eclipse SWT to implement the GUI of the FormPage. It will look something like this:

After implementing this if I have some time left I’ll also work on the bear creation wizard for the plug-in.

July 31, 2016

GSoC week 10 roundup

@cfelton wrote:

The slump for most students has continued, all students need to
be making daily commits, weekly PRs, and weekly blogs. If you
(student) are not following through with these you are in jeopardy
of failing.

Student week 10 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 95%
@mkatsimpris: 28-Jul, >5, Y
@Vikram9866: 25-Jun, >5, Y

riscv:
health 96%, coverage 91%
@meetsha1995: 14-Jul, >5, N
@srivatsan: 21-Jul, >5, Y

gemac:
health 87%, coverage 89%
@ravijain056, 22-Jul, 0, N

pyleros:
health missing, coverage 70%
@formulator, 26-Jun, 0, N

@meetsha1995 and @sriramesh4: there has been low activity on

@Ravi_Jain: there has been no publicly available progress,

if you agree or disagree with the fail evaluation.

Links to the student blogs and repositories:

Merkourious, @mkatsimpris: gsoc blog, github repo
Vikram, @Vikram9866: gsoc blog, github repo
Meet, @meetshah1995, gsoc blog: github repo
Srivatsan, @srivatsan-ramesh: gsoc blog, github repo
Ravi @ravijain056: gsoc blog, github repo
Pranjal, @forumulator: gsoc blog, github repo

Posts: 5

Participants: 4

meetshah1995 (MyHDL)

Let's build a processor !

We finally choose Zscale (by Berkeley Architecture Research) as the core for our project as it had a verilog implementation and was a simplistic core using RISC-V ISA, meeting all our specifications.

The Zscale core like any other processor consisted of the standard modules found in a processor :

• Controller
• ALU (Arithmetic and Logic Unit)
• Register File
• CSR File
• Pipeline stages
• Immediate Generators
• Muxes
Currently we have ported all the modules to myHDL with tests (with the exception of one module) and we are currently assembling them to form the core.

Zscale is just like another processor implementation, the reason why such a complex processor can be described in hardware so compactly is because of the beautifully designed ISA.

As I was converting the core to myHDL, I realised the placement of each every bit in the ISA was a ingeniously planned thought which made it easy to write logic for the processor.

To conclude, as we near the completion of this processor, the entire myHDL community will have a  RISCV processor which supports the RV32I ( crux of the ISA ).

See you next week,
MS

July 30, 2016

Utkarsh (pgmpy)

Monte Carlo Methods

Monte Carlo methods is a class of methods or algorithms in which we try to approximate the numerical results using repeated random sampling. Lets us look at couple of examples to develop some intuition about Monte Carlo methods.

The first example is about famous Monty Hall problem. For those who don’t know about the Monty Hall problem here is the statement:

“Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?”

There are also certain standard assumptions associated with it:

• The host must always open a door that was not picked by the contestant.

• The host must always open a door to reveal a goat and never the car.

• The host must always offer the chance to switch between the originally chosen door and the remaining closed door.

Now lets try to find out the solution of above problem using Monte Carlo Method. To find the solution using Monte Carlo Methods, we need to simulate procedure (as mentioned in statement) and calculate probabilities based on outcomes of these experiments. Don’t know about you but I’m too lazy to try simulating this experiment manually :P, so I wrote a python script which does it on my behalf ;).

import numpy as np
# counts the number of times we succeeded on switching
successes_on_switch = 0
prior_probs = np.ones(3)/3
door = ['d1', 'd2', 'd3']
# since door are symmetrical we can run simulation assuming we select door 1 always (without loss of generality)
# So now host can choose only door 2 and door 3
# Running simulation for 1000000 times
for car_door in np.random.choice(door, size=1000000, p=prior_probs):
# car is behind door
if car_door == 'd1':
# we choose door 1 and car is behind door 1, so success with switching is zero
successes_on_switch += 0.0
elif car_door == 'd2':
# we choose door 1 and car is behind door 2, monty can choose only door 3, so success on switching
successes_on_switch += 1.0
elif car_door == 'd3':
# we choose door 1 and car is behind door 3, monty can choose only door 2, so success on switching
successes_on_switch += 1.0
success_prob_on_switch = successes_on_switch/1000000.0
print('probability of success on switching after host has opened a door is:', success_prob_on_switch)


After I ran the script I got output : probability of success on switching after host has opened a door is: 0.666325 . You might get a different output (because of randomness) but it will be approximately same. And the actual solution you get by solving conditional probabilities is which is approximately 0.6666 . As evident result is quite a good approximation of actual result.

The next example is about approximating value of π.

The method is simple : We choose a unit area square and circle inscribed in it. The areas of these will be in ratio π/4. Now we will generate some random points and count the number of points inside the circle and the total number of points. Ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

Here is a python code for which does the above mentioned simulation:

import numpy as np
x = np.random.rand(7000000) # Taking 7000000 random points in between [0, 1), for x-coordinate
y = np.random.rand(7000000) # Taking 7000000 random points in between [0, 1), for y-coordinate
points_in_circle = (np.square(x) + np.square(y) <=1).sum() # points which lie in the circle x^2 + y^2 =1 (circle centred at origin with unit radius)
pi  = 4 * points_in_circle / 7000000.0
print(u"The approximate value of π is: ", pi)


The output I got : The approximate value of π is: 3.14158742857 which is approximately equal to value of π, which is 3.14159 .

If you observe both of the above examples, there is a nice overlapping structure to these solutions

• First define the input type and its domain for the problem

• Generate random numbers from the defined input domain

• Apply deterministic operation over these numbers to get the required result

Though the above examples are simple to solve, Monte Carlo methods are useful to obtain numerical solution to problems which are too complicated to be solved analytically. The most popular class of Monte Carlo methods are Monte Carlo approximations for integration a.k.a “Monte Carlo integration”.

Suppose that we are trying to estimate the integral of function over some domain .

Though these integrals can be solved analytically, and when a closed form solution does not exist, numeric integration methods can be applied. But numerical methods quickly become intractable with even small number of dimensions which are quite common in statistics. Monte Carlo Integration allows us to calculate an estimation of the value of integration .

Assume that we have a probability density function (PDF) defined over the domain . Then we can re-write the above integration as:

The above integration is equal to or expected value of with respect to random variable distributed according to .

This equality is true for any PDF on D, as long as whenever . We know that we can estimate the value of by generating a number of random samples according to distribution of random variable and finding their average. As more samples are taken this value is sure to converge to expected value.

In this way we can estimate the value of by generating a number of random samples according to p, computing f/p for each sample, and finding the average of these values. This process as described is what we call Monte Carlo Integration.

One might be worried, what if , but probability of generating a sample at is 0, so none of our samples will cause the problem.

We can write the above procedure into following simple steps:

If itegration is of format

, where is domain

• First find volume over the domain, i.e.
• Choose as a uniform distribution over , and draw samples,

• Now we can approximate as:

Lets use the above method and try approximating integral of .

Let us define as unfiorm distribution between 0 and 1, i.e (0, 1) .

The volume is:

We will now draw N independent samples from this distribution, find the expectation of that value which will be our Monte Carlo approximation for .

Here is a python code:

import numpy as np
N = 1000000  # Number of Samples we want to draw
x = np.random.rand(N)  # Drawing N samples from p(x)
Expectation = np.sum(np.exp(x*x / 2)) / N   # Taking average of those samples
print("The Monte Carlo approximation of integration of e^{x^2/2} for limits (0, 1) is:", Expectation)


The output is:The Monte Carlo approximation of integration of e^{x^2/2} for limits (0, 1) is: 1.19477498217. The actual value of integration which I calculated using WorlframAlpha is 1.19496.

I got a more closer estimate, when I increased the sample size to 100 million: 1.1949555144469735 .

Let us now try approximating the expected value of a Truncated normal distribution. The truncated normal distribution is the probability distribution of a normally distributed random variable whose value is either bounded below or above (or both)

The probability density function for Truncated normal distribution is defined as:

,

where , and is the cummulative density function of standard normal distribution.

Now we can approximate the expected value of the Truncated normal distribution.

We will define as

expected value is given by,

and,

Now we will draw independent samples from uniform (2, 7)

So, we can approximate expected value as

Here is the python code for the above procedure:

import scipy.stats
import numpy as np
N = 1000000  # 1 million sample size
x = 5*np.random.rand(N) + 2  # Sampling over uniform (2, 7)
pdf_vals = scipy.stats.truncnorm.pdf(x, a=2,b=7,loc=3,scale=1)  # f(x; 3, 1, 2, 7)

monte_carlo_expectation = 5 * np.sum(pdf_vals*x)/N
actual_expectation = scipy.stats.truncnorm.mean(a=2,b=7,loc=3,scale=1)

print("The monte carlo expectation is {} and the actual expectation of Truncated normal distribution f(x; 3, 1, 2, 7) is {}".format(
monte_carlo_expectation, actual_expectation))


The output of above code sample which I got was: The monte carlo expectation is 5.365583152790689 and the actual expectation of Truncated normal distribution f(x; 3, 1, 2, 7) is 5.373215532554829 .

Wrapping Up

In above examples it was easy to sample from the probability distribution directly. However, in most of the practical problems the distribution we want to sample from are far more complex. In my upcoming posts I’ll cover Markov Chain Monte Carlo, Metropolis Hasting algorithm, Hamiltonian Monte Carlo and No U Turn Sampler, which cleverly allows us to sample from sophisticated distributions.

Recap

The past week I attended Europython 2016. It was the first time I was attending a tech conference so I was impressed with stuff other people might find usual or normal. I also managed to finish the packaging script for the coala-bears. Since both of those announcements come with long stories I will try to brief them in their corresponding sections. Let's start with...

Europython 2016

The conference was held in Bilbao (where it was held last year also), in Spain this year. I was excited to meet the coala team at last. Also, I had never been to Spain before so adding all those up, it was a promising trip for me.

On the first day I found out 2 things about Bilbao while we were struggling to find our accommodation: it is a really beautiful city; very few people actually speak English. The later didn't matter much since we were spending most of our day at the conference venue but it was really funny to try ordering food when we were eating out in the evening.

The Europython schedule was as follows: keynotes from 9 to 10 a.m. usually and then workshops and/or talks until lunch, after that more workshops and talks, it was awesome. I have to admit, I couldn't be there for "every" keynote because we were hacking on coala almost every night. The cover picture is taken by our mighty leader Lasse and it features a part of the coala team after lunch.

I attended both workshops and talks but often I had to compromise because there were 5 talk tracks and 2 workshop tracks from which I could choose. Being a beginner with python myself, I learned about a lot of technologies and how to use them like tensorflow and theano for machine learning.

All in all, it was a great experience with lots of learning, tasty food, "mostly" interesting conversations and fun group activities (real community bonding).

Packaging

Enough with the fun, let's talk a bit about work. I have explained in the past a bit about the need for a packaging tool for bears. The first issue that I encountered was not as trivial as it may sound: choosing the package format. Initially I wanted to go with pypi but my project focuses on enabling users to write bears with other languages, which meant that code different that python had to be packaged, so I decided (with some help from the mentors) to go with conda.

The second issue was that I couldn't write the tool from scratch in order to avoid code duplication. For that matter I had to extend an already developed tool (for the second time in this gsoc) by another coalanian. That tool was handling packaging and uploading to pypi for every bear in the existing coala-bears repo. That way we keep bear independence if someone would want to install them separately. I could reuse some methods, for example the file generation from templates, which was fortunate for me.

Now I can proudly announce that you can create a conda package for any bear just by pointing the tool to the bear directory. It will create all the necessary files for you and it will try to fetch the repository URL from your .git/config file (if not possible it will just prompt you for the URL).

Wrap Up

I am now entering in the last milestone of my Gsoc in which I will create templates for other languages with code stubs already in them (functions for creating Result objects, reading input, sending output). I am thinking initially I will try to make one for a compiled language (C++) and an interpreted one (Javascript out of the browser with node).
After that I am going to write tutorials on how to use all the tools I developed (extended) and how to write a bear in a different language. Cya

Preetwinder (ScrapingHub)

GSoC-4

Hello,
This post continues my updates for my work on porting frontera to python2/3 dual support.
This blog post got delayed, sorry for that. My work on tests still continues. The last few days have mostly been focused on getting the existing tests to run on Python 3 on travis. I also made many new changes to the existed PY3 PR after feedback from my mentors, link - https://github.com/scrapinghub/frontera/pull/168 Once this PR is merged, only three tests will fail in Python 3, and frontera should run in single process mode successfully. Hopefully it will be merged in the next few days. After that the remaining work is some new tests(mainly for HBase and workers), I am already working on these, and it shouldn’t take more than two or three days. And the final job of making changes to HBase, workers, ZMQ, and encoders/decoders to work in Python 3. The challenge of this was significantly reduced by the recent dicision by my mentors to convert URL’s to ASCII representation, thus eliminating the need to worry about and storing encoding information. So it shouldn’t take long for me to cover this. I want to spend the final week on finalizing the release and making changes to the documentation.

GSoC-4 was originally published by preetwinder at preetwinder on July 29, 2016.

TaylorOshan (PySAL)

Flow Associations and Spatial Autoregressive models

In the last few weeks I had the opportunity to attend the 2016 Scipy conference with several of my mentors and contributors to the PySAL project. In this time I also completed the the three types of spatial weights for flows: network-based weights, proximity based weights using contiguity of both origins and destinations, and lastly, distance based weights using a 4-dimensiona ldistance (origin x, origin y, destination x, destination y). These three types of weights can be used within the vector-based Moran's that was coded in previous weeks to explore spatial autocorrelation, as well as within a spatial autoregressive (lag) model. In the process of building the distance-based weights, I was also able to contribute some speed-ups to the general DistancBand class, which have been incorporated into the library. Specificially, the DistanceBand class now avoids looping during construction, and there is a build_sp boolean parameter that when set to false will provide speed-ups if ones is using a relatively large threshold (or no threshold) such that the distance matrix is more dense than sparse.

More recently, work has been focusing on developing a version of the spatial lag model where there is a spatial lag for the origins, destination and origins-destinations spatial relationships. It looks like it will be possible to extend the exisitng ml_lag.py script to estimate parameters, though the proper covariance matrix will be more involved. During last weeks meeting, my mentors and I discussed several apporaches to developing code to carry out the estimation of the covariance matrix, which is what I will continue to work on before pivoting to the final phase of the project where I will clean up the code and finish documentation.

jbm950 (PyDy)

GSoC Week 11

Somehow I think I was off by a week. I think last week’s blog post covers week 9 and 10 and this week’s covers week 11. This week I created a full draft for all components of the SymbolicSystem class that will take the place of a equations of motion generator “base class” that was discussed in my project proposal. I began by creating all of the docstrings for the class followed by the test code. With the documentation and test code written it was a simple matter to finish off the code for the class itself. Lastly I added documentation to two places in sympy, one place contains the autogenerated documentation from the docstrings and the other place I adapted an example from pydy to show how to use the new class.

After working on SymbolicSystem I decided to try to finish off an old PR of mine regarding the init_printing code that Jason and I had discussed at Scipy. The idea was to build separate dictionaries to pass to the different printers in ipython based on the parameters that the specific printers take. The idea was to find this information using inspect.getargs(). The problem arose when trying to implement this solution because each separate printer has an expr argument and a **settings argument and the different possible paramters are processed internally by the printer. This means that there would not be an elegant way to build dictionaries for each printer.

The next thing I worked on this week was looking into Jain’s version of the order(N) method as suggested last week. When I started looking over his book, however, I found that uses a rather different set of notion than Featherstone and had some additional terms. I have decided to move forward with Featherstone’s method due to the summer coming to an end and I am already familiar with his version of the method. To that end I reread the first part of chapter 7 in Featherstone’s book where he discusses the articulated body method.

I reviewed two PR’s this week. This work was rather quick as they were simply documentation additions. I verified the method docstrings matched what the mehtod actually does and that the modual docstring included the different functions present in the file. Determining that they were correct I gave the +1 to merge and they have both since been merged.

Future Directions

The plan for next week is to focus entirely on the order(N) articulated body method of forming the equations of motion. I plan on writing the three passes for the method as if I have all of the information and methods I need in order to make it work. I expect this to be the best way to determine what additional code I will need in addition to finding my weak points in how well I understand the method. Once I have a skeleton of the of how the algorithm is supposed to work I will stop working directly on the algorithm itself and start working on the peripheral code such as the joints and body code or spatial vector processing methods.

PR’s and Issues

• (Open) [WIP] Added system.py to physics/mechanics PR #11431
• (Merged) Added docstrings to delta and mid property methods PR #11432
• (Merged) Added top-level docstring for singularities.py PR #11440

July 28, 2016

Ranveer Aggarwal (dipy)

Last week, we began building an orbital menu. There is now some progress, with the basic elements working well with it.
We now have text, a button and cubes working directly with the orbital menu which work very similar to the 2D GUI.

Using the ideas from Marc’s example, the Assembly-Follower combination was integrated into our existing code. The assembly has the object, an orbit (a disk) and of course, the parts. We allotted these cubes, positions at 360/(number of parts) angles to the X-Axis. This is how it looks like:

The possibilities are endless! :D

ToDo

• Sliders currently don’t work because sliding in 3D is a bit complex.
• There is a need to explore better methods to allot elements to the orbit.

Shridhar Mishra (italian mars society)

Ironpython and Unity

Using Ironpython with Unity game engine.

We already know that we can use python to use .net internal calls.
Now we may use the same to start a console that can accept a scripting language in Unity engine.
To do this we have to include certain dll files.
These dll files must be present in Assets>plugins
• IronPython.dll
• IronPython.Modules.dll
• Microsoft.Scripting.Core.dll
• Microsoft.Scripting.dll
• Microsoft.Scripting.Debugging.dll
• Microsoft.Scripting.ExtensionAttribute.dll
• Microsoft.Dynamic.dll

Once the Plugins are in place.

Initiate the Cs code
PythonEngine engine = new PythonEngine(); engine.LoadAssembly(Assembly.GetAssembly(typeof(GameObject))); engine.ExecuteFile("Test.py");
Where test.py is the python code.

Initiate python side:

import UnityEngine from UnityEngine
import *
Debug.Log("Hello world from IronPython!")

References:

Aron Barreira Bordin (ScrapingHub)

Scrapy-Streaming [6/7] - Scrapy With Java

Hi,

In these weeks, I’ve implemented the Java library to develop Scrapy’s Spider. Now, you can develop scrapy spiders easily using the scrapystreaming lib.

It’s a helper library to help the development process of external spiders in Java. It allows you to create the scrapy streaming json messages using Java objects and methods.

Examples

I’ve added a few examples about it, and a quickstart section in the documentation.

PRs

Aron.

Scrapy-Streaming [6/7] - Scrapy With Java was originally published by Aron Bordin at GSoC 2016 on July 27, 2016.

fiona (MDAnalysis)

The private life of cats

Welcome back to another fun round of Python-wrangling (a.k.a. what I’ve learnt to do with Python during GSoC)! Today I’ll be talking about ‘private’ variables and the property attribute.

If you’re not familiar with Python but want to follow along, you could have a look back at the brief notes I made back here, or go check out a (proper) tutorial.

‘Private’ and ‘public’ in Python

Let’s say we have a House class. For many instances of House – including (of course) ours – it’ll have a cat attribute. Within House, we can interact with the cat as self.cat, so we can define the necessary methods for looking after them (feed, clean, worship, feed again…). But the cat is also accessible from outside the house (as house_instance.cat): this means anyone could come along and look at or, heaven forbid, swap our cat with something else – even something non-feline! A house’s cat is ‘public’, where we might prefer they be ‘private’.

Many programming languages have ways of declaring variables as ‘private’ (more or less, which things can’t be used outside of the current bit of code). In Python, ‘private’ variables don’t strictly exist. Instead, we follow a convention: if we name something beginning with an underscore (say, _cat), we’re indicating that object shouldn’t be considered public and so shouldn’t be used outside of e.g. the class it’s an attribute of. People can still technically access and change our cat through our_house._cat, but we’re politely asking them not to.

(The use of underscores isn’t just for when we don’t want people to mess with attributes, it can also be for things which don’t really serve a direct purpose to the user – say, a method that performs an intermediate step, and so isn’t useful by itself. By including the leading underscore, we can indicate to a user that they need not worry about this bit of the code, and it simplifies documentation. There are also a couple other uses for leading and trailing underscores in Python – you could see more about that in here.)

But what if we decide we do want people to see our cat? (He is, after all, the very best cat). What we can do is make a property attribute, cat, that returns the value of _cat when we want its value but will throw an error if we try set it. We can do this using @property, the property decorator.

A tangent on decorators

There may arise situations when using Python where we might want to ‘modify’ a bunch of functions/methods in the same way. For example, say we’ve programmed our Python friend to take care of several cat-related chores at certain times; but as any cat owner knows, as soon as you do something for your cat, you’re probably going to need to do it again pretty soon.

We want to modify all the cat-caring actions so instead of performing them just once at the specified time, there’s a five minute delay then the action is performed a second time. To do this, we’ll use a decorator.

property is a built-in decorator in Python that we can apply to a ‘getter’ method (and optionally, additional ‘setter’ and ‘deleter’ methods – more on these below) to get a property attribute which, instead of having a stored value like other attributes, will return the result of the ‘getter’ method when it’s value is asked for. Above, we’re using a simple ‘getter’ that just returns the value of another attribute, but we could have something that say, returns something different depending on values of other attributes, or calculates a new value - see some of the examples below.

Conditional setting

Back to our House/cat situation: we’ve prevented anyone being able to come along and swap our cat for a fish, but what if we did want to let people change him, so long as we can control what he’s changed to? Obviously, we still want a cat, and let’s say we’ll also only accept the change if the proposed new cat is cuter than our current one. Having used @property to create a cat property, we can now add a ‘setter’ for cat using @cat.setter. Now, rather than throwing an error when we try house_instance.cat = new_value, this ‘setter’ method will be run; depending on new_value, there may or may not be any change.

We could similarly use @cat.deleter to specify a ‘deleter’ to run when we use del(cat) - for example, when we move house we can use this to both remove the _cat attribute and perform clean-up tasks like removing all the cat hair or uninstalling the cat door.

Some practical examples

On the more practical side of things, here’s some snippets showing how I’ve been using getters, setters and ‘private’ attributes in varoius ways for my GoSC project, namely the ‘add_auxiliary’ part. Again, you can go see the full version on GitHub.

Some brief background - the main interface when reading auxiliary data is the auxiliary reader class (there will be different reader classes for different auxiliary formats, but they all inherit from AuxReader, where all the common bits are defined); the reader has an auxstep attribute which is an ‘auxiliary step’, which stores the data and time for each step (again, there are different auxiliary step classes for different formats, with AuxStep having the common bits).

And that’s a brief rundown of private vs. public and properties in Python - I hope you found that fun and informative!

See you next time!

July 27, 2016

Back Home

Ah, what a wonderful trip it has been. I just came back from EuroPython2016 in Bilbao which ended 3 days ago. And I loved it. Not only have I been with wonderful people which I finally got the chance to meet in person, but also the conference was amazing!

So the trip was cool?

Yes. It was really good. The conference was full of amazing talks, sprints, and not to forget lightning talks! The only ones that disappointed me were workshops. Some of them were too hard for me to attend, and some on which I did were pretty bad, and it usually took us half of the time just to set up their tools and stuff, and everyone having dependency issues, lots of time wasted..

Hehe, I coded during the conference in 7 days probably as much as I did in 14 at home. The productivity at late night coding sessions was quite high, by having people near me that were there to help me with any question. So instead of working a day on something and reworking it the next day, I could go on the best approach and do it in the first day. Cool, huh?

GSoC progress

I can safely say that the project is overall almost done, only the last, grunt work things yet to be done. The upload Tool works really well, and its merged. The installation tool is quite close and does what it should. What’s left? Filling in all the requirements. And this is the worst part. I now have to fill in the correct requirements for each bear. Upon doing this, I see my project done in no time

mr-karan (coala)

This has been a good week. I have finally been able to get some success with my Syntax Highlighting project. Before we dive into the inner details of how it unfolded, I want to put this quote here.

“Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” - Bill Gates

I had been working on this task on/off since almost a month now. I started with diving deep in the code-base, which is like finding a needle in a haystack. Once I got to the place where I needed to make the change, the post process wasn’t easy for me. I had to understand what different functions did in order to incorporate Syntax Highlighting. I am using Pygments library for this task. I really like the library and it has helped me a lot in making this process a lot simpler. After experimenting a lot with Pygments I got the required code in shape ready to be plugged in. I was able to get Syntax Highlighting on my terminal but I wouldn’t call it as a moment of joy as I had to do something about adding bullet marks on spaces/tabs like it was in the previous version. I somehow stumbled upon VisibleWhiteSpaceFilter which was exactly what I was looking for. Since it made my work easy, I thought to implement something extra and I added background highlighting for result.message by overriding Style from pygments.style class.

You can see the whole thing in action here.

The PR is currently in review but I am glad that finally it worked and my GSoC project is nearing completion. Oh and coming back to the quote in the beginning. Well, I always believe in not reinving the wheel and above I have proved how you can implement some cool stuff in as few LOC as possible.

• Get this PR accepted.
• Complete some more bears
• Get started coala-bears website

Happy Coding!

July 26, 2016

Upendra Kumar (Core Python)

Hello everyone. Needed Feedback for pip_tkinter.

Hello my fellow GSoCers. Hope you all have very good time and regularly check for your blog feeds in terri.toybox.ca/python-soc/.

I just needed feedback for my tkinter based pip GUI application in order to improve it further.  Your contribution as feedback can be very valuable and helpful to let me know what people may expect from this project. Let me tell you about this project :

We have made a preliminary version of GUI for PIP. This project is intended to provide a GUI version for “pip”  ( target audience for the project : beginners in Python or people who are not familiar with command line).

How to install pip_tkinter?

• Clone the repo from https://github.com/upendra-k14/pip_gui.git. Otherwise run the following command :git clone https://github.com/upendra-k14/pip_gui.git
• Navigate to the // folder
• git checkout dump_code
• From inside the pip_gui folder run : python3 -m pip_tkinter
Please post as many issues and suggestions here : https://github.com/upendra-k14/pip_gui/issues

The project idea is discussed on these issues on Python Bug Tracker :

2. Issue #27051 : Create PIP GUI

The GitHub repo of the project : https://github.com/upendra-k14/pip_gui/tree/dump_code

Pulkit Goyal (Mercurial)

Iterating Dictionaries

Dictionaries, also known as hash tables are one of the basic data structure we use in programming and hence a built-in data type in python. It is defined as a set of key-value pairs. The python library functions for reading dictionary items has undergone changes in implementation with newer versions of the language.

Karan_Saxena (italian mars society)

This period has been phenomenal.

1) I am finally able to process Full HD RGBA Frame in OpenCV.
2) Using PykinectTk, the process of estimating user movements is being done.

Onwards and Upwards!!

Leland Bybee (Statsmodels)

Examples

At this point a version of the distributed estimation code is complete and it is worth spending some time detailing how it can be used through some examples. To use the distributed estimation code you need to initialize a DistributedModel instance by providing a generator to produce exog and endog for each machine. Additionally, the estimation method as well as join method need to be provided, and any corresponding arguments. The following example shows how this works for OLS using the debiasing procedure:

from statsmodels.base.distributed_estimation import DistributedModel

def _exog_gen(exog, partitions):
"""partitions exog data"""

n_exog = exog.shape[0]
n_part = np.ceil(n_exog / partitions)

ii = 0
while ii < n_exog:
jj = int(min(ii + n_part, n_exog))
yield exog[ii:jj, :]
ii += int(n_part)

def _endog_gen(endog, partitions):
"""partitions endog data"""

n_endog = endog.shape[0]
n_part = np.ceil(n_endog / partitions)

ii = 0
while ii < n_endog:
jj = int(min(ii + n_part, n_endog))
yield endog[ii:jj]
ii += int(n_part)

debiased_mod = DistributedModel(zip(_endog_gen(y, m), _exog_gen(X, m)), m,
model_class = OLS,
estimation_method=_est_debiased,
join_method=_join_debiased)


Note that this is actually the default for DistributedModel, so

debiased_mod = DistributedModel(zip(_endog_gen(y, m), _exog_gen(X, m)), m)


would give the same thing. Then to fit the model you just call

debiased_params = debiased_mod.fit(fit_kwds={"alpha": 0.2})


fit_kwds need to be specified (in the case for the regularization procedure), because we don’t want to constrict the fit procedures that are allowed. To get an idea of what a slightly more complicated DistributedModel might look like consider the set up for logistic regression:

debiased_mod = DistributedModel(zip(_endog_gen(y, m), _exog_gen(X, m)), m,
model_class=GLM,
init_kwds={"family": Binomial()})


At this point it is probably worth noting that for the examples above everything is going to be run sequentially. That means that each partition is handled in sequence. The use case here is for instances where a data set is too large to fit into memory. However, we also have support for truly distributed estimation that includes parallel computing. Currently, this is all handled through joblib. The handling of what distributed estimation is used is taken care of by the parallel_method argument to fit:

joblib_params = mod.fit(parallel_method="joblib", fit_kwds={"alpha": 0.2})


To explicitly use the sequential approach set parallel_method="sequential". One nice thing about using joblib is that it allows for some flexibility on the backend used. This means that if you have a computing cluster that is set up with something like distributed you can use that as well, for an example with distributed

from joblib.parallel import parallel_backend, register_parallel_backend
from distributed.joblib import DistributedBackend

register_parallel_backend('distributed', DistributedBackend)
backend = parallel_backend('distributed')
parallel_backend = backend,
fit_kwds={"alpha": 0.2})


To wrap up I wanted to include a couple of plots that focus on the debiasing procedure to show how it can perform compared to a naive averaging approach and against the global Lasso estimate. There are two plots, the first shows the performance in L2 for different values of N with a fixed m (number of machines) and fixed p (number of variables). The second shows the same thing but for a fixed N and variable m. When N is fixed it is 1000, when m is fixed it is 5 and when p is fixed (always) it is 100. It is also worth noting that thresholding was done on the debiased parameters as is recommended by the source paper, this also gives me example of a case where join_kwds are used:

debiased_mod = DistributedModel(zip(_endog_gen(y, m), _exog_gen(X, m)), m,
join_kwds={"threshold": 0.1})
debiased_params = debiased_mod.fit(fit_kwds={"alpha": 0.2})


In both cases the results shown by the plots makes sense. Both the naive and debaised procedures improve as N increases, but the debiasing one converges faster to the global estimate. Similarlly both deteriorate as m increases but the naive procedure consistently does worse.

mike1808 (ScrapingHub)

GSOC 2016 #4: Sneaky bug

In this blog post I want to tell you about bug in Splash which was undiscovered almost a year.

First meeting

The past two weeks I was working on HTMLElement class which makes working with HTML DOM elements easier. I thought I’ve almost finished it, but when I’ve started to write tests one strange thing happened. In that test I selected several DOM elements using splash:select and did assertion of their HTML node types. I had 5 different elements: p, input, div, span and button.

function main(splash)
assert(splash:go(splash.args.url))
assert(splash:wait(0.5))

local p = splash:select('p')
local input = splash:select('input')
local div = splash:select('div')
local span = splash:select('span')
local button = splash:select('button')

return {
p=p:node_property('nodeName'):lower(),
input=input:node_property('nodeName'):lower(),
div=div:node_property('nodeName'):lower(),
span=span:node_property('nodeName'):lower(),
button=button:node_property('nodeName'):lower(),
}
end


The weird thing started when the actual types returned by splash were p, button, button, button, button.

As you could notice the test failed. Also, you could notice that only the first element had the correct type and the other ones have the type of the last element. To test that I’ve tried to swap some splash:select calls. The result was the same: the first value was correct and the other ones had the type of the last splash:select.

Investigation

After some thoughts I assumed that the issue was in some method that becomes the same (static) for the all instances of HTMLElement or _ExposedElement. I examined both classes but didn’t find any strange initialization which overrides the class methods. To confirm my thoughts I logged every splash:select and element:node_property call to see the instance on which these methods are called. It turned out that only the first and the last instances of _ExposedElement were used. So, the issue is in the function that calls these methods.

Where those function are called? From Lua. For a moment I thought that our Lua runner (lupa) is broken (because there is some not fixed bug in it), but that idea was thrown away quickly. I wondered if this bug is in our Lua wrappers code so it must show itself somehow. For that moment the only thing that could go wrong is a return value of splash:call_later because it creates an instance of _ExposedTimer which is the only class which can be created as many times as you want (on the contrary, Splash, Response, Request and Extras class are created once during the Lua script execution). I initialized several timers and wrote a simple test to check whether my assumption about the bug was write. And it was confirmed - the bug is in our Lua wrappers, because I got the same issue with the instances of _ExposedTimer.

I started examining methods of wraputils.lua and noticed several strange things:

1. Metemethods are initialized on the prototype table after each setup_property_access call.

2. In those metamethods for getters/setters we are using self, but the other properties are retrieved/assigned from/to the cls.

So what was happening? Why the first splash:select element was working correctly and the other ones except the last one not? The answer is pretty obvious. During the first call of splash:select metamethods for Element wasn’t set and hence not called. So everything was working as it should work. However, after the first call we are setting those metamethods, so after every next call they are called when we assigning methods to the instance of Element and in the __newindex method we are setting that method to the Element. So when executing span:node_property('nodeName') it actually calls Element:node_property because of our __index metamethod.

Solution

After understanding why that happened the solution comes to mind very quickly: assign getters/setters to the self and call rawget and rawset on the self. Which was done in my PR.

Conclusion

It was very interesting bug. During the work on it I’ve learned many things about how OOP and metamethods works in Lua. I hope that I’ll meet such kind of challenging tasks in my future work with Splash.

Time Varying Parameters

Let's consider the following process:
Here yt is an observed process, xt is an exogenous vector, beta are so called time varying parameters that change with time, as (2) equation states. e and v are white noise terms:
Presented model has a name of Time-Varying-Parameter model, and it was a part of my proposal. As you can see, it is non-switching, but it is used to evaluate a good start parameters for switching model likelihood optimization.
TVP and MS-TVP models occurred to be the easiest and the most pleasant items of my proposal. Due to their simplicity I didn't have any difficulties implementing and debugging them. During MLE their parameters occurred to converge to expected values nicely, as well.

TVP: Implementation and testing

TVP model was implemented in upper-level statespace module (tvp.py file), rather then in the regime_switching. Implementation is a concise extension of MLEModel class. I used Kim and Nelson's (1989) modelling changing conditional variance or uncertainty in the U.S. monetary growth ([1], chapter 3.4) as a functional test and for iPython notebook demonstration.
A special thing about TVP is that its MLE results class (TVPResults) has a plot_coefficient method, which can draw a nice plot of time varying parameters, changing with time:

Heteroskedastic disturbances

Adding heteroskedastic disturbances to observation equation (1) allows to make a model regime-switching:
where St is a Markov regime process.

MS-TVP: Implementation and testing

TVP model with heteroskedastic disturbances is implemented in switching_tvp.py file of regime_switching module. It is as concise and elegant, as a non-switching analog. I'm going to implement coefficient plotting soon.
I used Kim's (1993) Time-varying-parameter model with heteroskedastic disturbances for U.S. monetary growth uncertainty to perform functional testing. One nice thing about MS-TVP is that it finds a near-correct likelihood maximum from a non-switching start. As you can see in tests.test_switching_tvp.TestKim1993_MLEFitNonswitchingFirst class, I use 0.05% relative tolerance.

What's next?

The remaining part of the summer will be about improving and polishing existing models. Now I am working on adding heteroskedastic disturbances to transition equation (2). As I noted above, I have to add coefficient plotting for a switching model. Other goals are making a MS-TVP notebook demonstration and overall improvement of MS-AR model.

Literature

[1] "State-space Models With Regime Switching" by Chang-Jin Kim and Charles R. Nelson.

Riddhish Bhalodia (dipy)

Brain Extraction Explained!

As promised I will outline the algorithm we are following for the brain extraction using a template, which is actually a combination of elements taken from [1] and [2].

Step 1

Read the input data, input affine information , template data, template data mask, template affine information

Step 2

We perform registration of the template data onto the input data, this involves two sub steps.

(2.a) Affine Registration

Perform the affine registration of template onto the input and get the transformation matrix which will be used in the next step

(2.b) Non-Linear Registration (Diffeomorphic Registration)

Using the above affine transform matrix as the pre-align information we perform the diffeomorphic registration of the template over the input.

These two steps gets most of it done! (this is also followed in [2])

Step 3

We use the transformed template and the input data to use a non-local patch similarity method for assigning labels to the input data, this part is used from [1].

This is it! The branch for brain extraction is here

Experiments and Results

I am currently experimenting with NITRC IBSR data which has a manual brain extraction given with it. This will help me to validate the correctness of the algorithm.

Next Up…

• Functional tests for the brain extraction process
• More datasets, even the harder ones
• Refining the code
• Better measure for validation

References

[1]“BEaST:Brain extraction based on nonlocal segmentation technique”
Simon Fristed Eskildsen, Pierrick Coupé, Vladimir Fonov, José V. Manjón, Kelvin K. Leung, Nicolas Guizard, Shafik N. Wassef, Lasse Riis Østergaard and D. Louis Collins  NeuroImage, Volume 59, Issue 3, pp. 2362–2373.
http://dx.doi.org/10.1016/j.neuroimage.2011.09.012

[2] “Optimized Brain Extraction for Pathological Brains (OptiBET)”
Evan S. Lutkenhoff, Matthew Rosenberg, Jeffrey Chiang, Kunyu Zhang, John D. Pickard, Adrian M. Owen, Martin M. Monti
December 16, 2014 PLOS http://dx.doi.org/10.1371/journal.pone.0115551

Aakash Rajpal (italian mars society)

Oculus working yaay

Hey everyone, Sorry for this late post.

I was busy setting up the Oculus, It has been a pain but at the end a sweet one :p. A week before, I was down thinking of even quitting the program . I had my code ready to run but it just wouldn’t show up on the Oculus . I was lost , but somewhere inside I knew I could do it. So I got up one last time, sat through the day tweaking my code, tweaking the Blender Game Engine , changing  configuration for Oculus and At last Bazzingaa.

Thank God I said to myself and eventually my code was running on the Oculus:p.

Here is a link to the DEMO VIDEO GSOC

Levi John Wolf (PySAL)

A Post-SciPy Chicago Update

After a bit of a whirlwind, going to SciPy and then relocating to Chicago for a bit, I figure I’ve collected enough thoughts to update on my summer of code project, as well as some of the discussion we’ve had in the library recently.

I’ve actually seen a lot of feedback on quite a bit of my postings since my post on handling burnout as a graduate student. But, I’ve been forgetting to tag posts so that they’d show up in the GSOC aggregator! Bummer!

The Great Divide

Right before SciPy, a contributor suggested that it might be a reasonable idea to split the library up into independent packages. Ostensibly motivated by this conversation on twitter, the suggestion highlighted a few issues (I think) with how PySAL operates, both on a normative level, on a proecedural level, and in our code. This is an interesting suggestion, and I think it has a few very strong benefits.

Lower Maintainence Surface

Chief among the benefits is that minimizing the maintainence burden makes academic developers much more productive. This is something I’m actually baffled by in our current library. I understand that technical debt is hard to overcome and that some parts of the library may not exist had we started now rather than five years ago. But, it’s so much easier to swap in ecosystem-standard packages than it is to continue maintaining code that few people understand. This is also much more true when you recognize that our library does, in many places, exhibit effective use of duck typing. The barrier to us using something like pygeoif or shapely as a computational geometry core is primarily mental, and conversion of the library to drop/wrap unnecessary code in cg, weights, and core would take less than a week of full-time work. And, it’d strongly lower the maintenance footprint of the library, which I think is a central benefit of the split package suggestion.

Plus, the idea that splitting up the library into many, more loosely-coupled packages seems like a stroke towards the R-style ecosystem, which is exactly what the linked twitter thread suggests. But, I think that R actually has some comfy structural incentives for the drivers of its ecosystem to do what they do. Since an academic can make a barely-maintained package that does some unique statistical operation and get a Journal of Statistical Software article out of it, the academic-heavy ecosystem in R is angled towards this kind of development. And, indeed, with a very small maintainence surface, these tiny packages get shipped, placed on a CV, and occasionally updated. Thus, the social incentives align to generate a particular technical structure, something I think Hadly overstates in that brief conversation as a product of object oriented programming. While OO isn’t a perfect abstraction, I’m kind of done with blaming OO for everything I don’t like, and I think that the claim that OO encourages monolithic packages is, on its face, not a necessary conclusion. It comes down to defining efficient interfaces between classes and exposing a consistent, formal API. I don’t really think it matters whether that API is populated or driven using functions & immutable data or objects & bound methods. Closures & Objects are two sides of the same coin, really. Mostly, though, thinking that the social & technical differences in R and Python package development can be explained through quick recourse to OO vs. FP (when I bet the majority of academic package developers don’t even deeply understand OO or FP) is flippant at best. I really think more of it is the structure of academic rewards, and the predominance of academics in the R ecosystem.

But that’s an aside. More generally, fragmenting the library would make it easier for new contributors to derive academic credit from their contributions.

Cleaner Dependency Logic

I think many of the library developers also feel limited by the strict adherence to a minimal set of dependencies, namely scipy and numpy. By splitting the package up into separate modules with potentially different dependency requirements, we legitimate contributors who want to provide new stuff with flashy new packages.

To be clear, I think the way we do this right now is somewhat frustrating. If a contribution is done using only SciPy & Numpy and is sufficiently integrated into the rest of the library, it gets merged into “core” pysal. If it uses “extra” libraries but is still relevant to the project, we merge it into a module, contrib. This catch-all module contains some totally complete code from younger contributors, like the spint module for spatial interaction models or my handler module for formula-based spatial regression interfaces, as well as code from long-standing contributors, like the viz module. But, it also contains incomplete remnants of prior projects, put in contrib to make sure they weren’t forgotten. And, to make matters worse, none of the stuff in contrib is used in our continuous integration framework. So, even if an author writes test suites, they’re not run routinely, meaning that the compatibility clock is ticking every time code is committed to the module. Since it’s not unittested and documentation & quality standards aren’t the same as the code in core, it’s often easier to write from scratch when something breaks. Thus, fragmenting the package would “liberate” packages in contrib that meet standards of quality for introduction to core but have extra dependencies.

But why is this necessary?

Of course, we can do much of what fragmentation provides technologically using soft dependencies. At the module level, it’s actually incredibly easy. But, I have also built tooling to do this at the class/function level, and it works great. So, this particular idea about having multiple packages doesn’t solve what I think is fundamentally a social/human problem.

The rules we’ve built around contribution do not actively support using the best tools for the job. Indeed, the social structure of two-tiered contribution, where the second tier has incredibly heterogeneous quality, intent, and no support for coverage/continuous integration testing, inhibits code reuse and magnifies not-invented-here syndrome intensely. We can’t exploit great packages like cytools, have largely avoided merging code that leverages improved computational runtimes (using numba & cython), and haven’t really (until my GSOC) programmed around pandas as a valid interaction method to the library.

Most of the barriers to this are, as a mentioned above, mental and social, not technical. Our code can be well-architected, even though we’ve implemented special structures to do things that are more commonly (sometimes more efficiently) solved in other packages or using other techniques.

And, there’s some freaking cool stuff going on involving PySAL. Namely, the thing that’s been animating me is its use in Carto’s Crankshaft, which integrates some PySAL tooling into a PL/Python plugin for Postgres. They’ll be exposing our API (or a subset of it) to users through this wrapper, and that feels super cool! So, we’ve got good things going for our library. But, I think that continued progress needs to address these primarily social concerns, because the code, technologically, I think is more sound than one could expect from full-time academic authors.

July 24, 2016

shrox (Tryton)

Refactoring

Right now I am working on refactoring the code that I have written till now.

I need to simplify functions, make them easier to understand and make sure that my code conforms to the standards of the Tryton and Relatario codebases.

mr-karan (coala)

The last two weeks I had been busy with making some more bears for coala. I had found

• write-good which helps in writing good English documentation and checks the text files for common english mistakes. I really liked this tool so thought to wrap this in a linter bear and implemented WriteGoodLint Bear.

You can see it in action here:

• happiness, well as the name suggests, happiness is a tool which lints Js files for common syntax and semantic errors and confirms to a style which is well defined in their docs and is actually the one I like for Javascript files. It’s a fork of Standard which is another style guide, but happiness has a few better changes, so I wrapped this and implemented HappinessLintBear.

You can see it in action here:

• httpolice is a tool which is a linter for HTTP requests and responses. It can be used on a HAR file. If you go to Developer Tools of your browser, head over to Networks Tab, right click and save the request as HAR file. Then this tool can be used to lint that file. We didn’t have anything of this kind in coala-bears yet so I thought to wrap this and implement HTTPoliceLintBear. There have been some issues with lxml dependenices on AppVeyor and I’m figuring out how to solve so that the tests pass and this PR also gets merged.

Future Work

• Syntax Highlighting: There has been some clarity on how to implement this and I have till now used highlight class from Pygments but now planning to rather make a new own class ConsoleText which will help setting specific attributes to certain parts of text. Like separting the bullet marks from the string and so on. I plan to extensively work on this so I can complete this task by end of this week.

• coala-bears website: I’ll be starting off with the prototype of this website and also some basic functionality like filtering the bears based on the parameters apart from Language they support.

Happy Coding!

July 23, 2016

Yen (scikit-learn)

Interactive Cython with IPython, no compilation anymore!

Debugging Cython is sometimes very annoying because unfortunately, there aren’t many blog posts or tutorials about Cython on the Internet. We often need to learn it by a trial-and-error manners. To make things even worse, unlike Python, Cython code need to be compiled everytime after we make some changes to it, which means that it will make our debugging process more tedious.

What if we can try out Cython in IPython notebooks, an interactive environment, without the need of compilation?

Let’s see how this can be done.

Ipython Notebooks

NOTE: Feel free to skip this section if you are already familiar with it.

At first, let’s see what IPython notebooks is.

An IPython notebook is a powerful interactive shell, which lets you write and execute Python code in your web browser. Therefore, it is very convenient to tweak your code and execute it in bits and pieces with IPython. Besides that, it also has great support for interactive data visualization and use of GUI toolkits. For these reasons, IPython notebooks are widely used in scientific computing.

For more installation details and tutorials, please see this site.

Cython Problem

Traditionally, we leverage on module distutils to compile Cython code which gives us full control over every step of the process. However, The main drawback of this approach is that it requires a separate compilation step. This is definitely a disadvantage since one of Python’s strengths is its interactive interpreter, which allows us to play around with code and test how something works before committing it to a source file

Well, don’t worry, IPython notebook is here to save us!

%%cython Magic

IPython can integrage Cython flawlessly by typing some convenient commands that allow us to interactively use Cython from a live IPython session. These extra commands are IPython-specific commands called magic commands, and they start with either a single (%) or double (%%) percent sign. They provide functionality beyond what the plain Python interpreter supplies. IPython has several magic commands to allow dynamic compilation of Cython code, see here for more details.

Before we can use these magic Cython commands, we first need to tell IPython to load them. We do that with the %load_ext metamagic command from the IPython interactive interpreter, or in an IPython notebook cell:

In [1]: %load_ext Cython


There will be no output if %load_ext is successful, and IPython will issue an error message if it cannot find the Cython-related magics.

Great! Now we can use Cython from IPython via the %%cython magic command:

In [2]: %%cython
cdef int add(int x, int y):
return x + y


The %%cython magic command allows us to write a block of Cython code directly in the IPython interpreter. After exiting the block with two returns, IPython will take the Cython code we defined, paste it into a uniquely named Cython source file, and compile it into an extension module. If compilation is successful, IPython will import everything from that module to make the function we defined available in the IPython interactive namespace. The compilation pipeline is still in effect, but it is all done for us automatically. We can now call the function we just defined:

In [3]: add(1, 2)


Cool! Now IPython will print the result of your function, i.e., 3, under this block of code.

Generated C code

Sometimes, it is a good practice to inspect the generated C source files to check the sanity of our program. The generated source files are located in the $IPYTHONDIR/cython directory (~/.ipython/cython on an OS X or *nix system). The module names are not easily readable because they are formed from the md5 hash of the Cython source code, but all the contents are there. Summary I really hope I would have known this convenient tips to debug my Cython code when I first knew Cython, it can really save tons of your time and efforts of your fingers :) Let’s run Cython code without overhead! Raffael_T (PyPy) Progress async and await It's been some time, but I made quite some progress in the new async feature of Python 3.5! There is still a bit to be done though and the end of this years Google Summer of Code is pretty close already. If I can do it in time will mostly be a luck factor, since I don't know how much I will still have to do in order for asyncio to work. The module is dependent of many new features from Python 3.3 up to 3.5 that have not been implemented in PyPy yet. Does async and await work already? Not quite. PyPy now accepts async and await though, and checks pretty much all places where it is allowed and where it is not. In other words, the parser is complete and has been tested. The code generator is complete as well, so the right opcodes get executed in all cases. The new bytecode instructions I need to handle are: GET_YIELD_FROM_ITER, GET_AWAITABLE, GET_AITER, GET_ANEXT, BEFORE_ASYNC and SETUP_ASYNC. These opcodes do not work with regular generators, but with coroutine objects. Those are based on generators, however they do not imlement __iter__ and __next__ and can therefore not be iterated over. Also generators and generator based coroutines (@asyncio.coroutines in asyncio) cannot yield from coroutines. [1] I started implementing the opcodes, but I can only finish them after asyncio is working as I need to test them constantly and can only do that with asyncio, because I am unsure what the values normally lying on the stack are. That is also valid for some functions in coroutine objects. Coroutine objects are working, however they are missing a few functions needed for the async await-syntax feature. These two things are the rest I have to do though, everything else is tested and should therefore work. What else has been done? Only implementing async and await would have been too easy I guess. With it comes a problem I already mentioned, and that is the missing dependencies of Python 3.3 up to 3.5. The module sre (offers support for regular expressions) was missing a macro named MAXGROUPS (from Python 3.3), the magic number standing for the number of constants had to be updated as well. The memoryview objects also got an update from Python 3.3 that is needed for an import. It has a function called “cast” now, which converts memoryview objects to any other predefined format. I just finished implementing this as well, now I am at the point where it says inside threading.py: _set_sentinel = _thread._set_sentinel AttributeError: 'module' object has no attribute '_set_sentinel' What to do next? My next goal is that asyncio works and the new opcodes are implemented. Hopefully I can write about success in my next blog post, because I am sure I will need some time to test everything afterwards. A developer tip for execution of asyncio in pyinteractive (--withmod) (I only write that as a hint because it gets easily skipped in the PyPy doc, or at least it happened to me. The PyPy team already thought about a solution for that though :) ) Asyncio needs some modules in order to work which are by default not loaded in pyinteractive. If someone stumbles across the problem where PyPy cannot find these modules, –withmod does the trick [2]. For now, –withmod-thread and –withmod-select are required. [1] https://www.python.org/dev/peps/pep-0492/ [2] http://doc.pypy.org/en/latest/getting-started-dev.html#pyinteractive-py-options Update (23.07.): asyncio can be imported and works! Well that went better than expected :) For now only the @asyncio.coroutine way of creating coroutines is working, so for example the following code would work: import asyncio @asyncio.coroutine def my_coroutine(seconds_to_sleep=3):  print('my_coroutine sleeping for: {0} seconds'.format(seconds_to_sleep))  yield from asyncio.sleep(seconds_to_sleep) loop = asyncio.get_event_loop() loop.run_until_complete(  asyncio.gather(my_coroutine()) ) loop.close( (from http://www.giantflyingsaucer.com/blog/?p=5557) And to illustrate my goal of this project, here is an example of what I want to work properly: import asyncioasync def coro(name, lock): print('coro {}: waiting for lock'.format(name)) async with lock: print('coro {}: holding the lock'.format(name)) await asyncio.sleep(1) print('coro {}: releasing the lock'.format(name))loop = asyncio.get_event_loop()lock = asyncio.Lock()coros = asyncio.gather(coro(1, lock), coro(2, lock))try: loop.run_until_complete(coros)finally: loop.close() (from https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-492) The async keyword replaces the @asyncio.coroutine, and await is written instead of yield from. "await with" and "await for" are additional features, allowing to suspend execution in "enter" and "exit" methods (= asynchronous context manager) and to iterate through asynchronous iterators respectively. Ramana.S (Theano) Second Month Blog Hello there, The work of GraphToGPU optimizer was finally merged into the master of theano, giving the bleeding edge approx 2-3 times speedup. Well, that is a huge thing. Now the compilation time for the graph on the FAST_COMPILE mode had one small block, which was created from the local_cut_gpu_transfers. The nodes introduced into the graphs were host_from_gpu(gpu_from_host(host_from_gpu(Variable))) and gpu_from_host(host_from_gpu(gpu_from_host(Variable))) patterns. This caused the slowdown of local_cut_gpu_transfers and when tried to investigate where these patterns are created, it was found to be created from one of the AbstractConv2d optimizers. We (Me and Fred) spent sometime to filter out these pattern, but we finally concluded that this speedup wouldn't help as much as the effort and dropped the idea for now. There were some work done in Caching the Op classes from the base Op class so that all the instances of Op don't recreate an Op instance that was already created.(The criterion being same parameter). I tried to implement the caching from Op class using a Singleton. I also verified that the instances with the same parameters are not recreated. But there are few problems which require some higher level refactoring. Currently the __call__ methods for the Op is implemented from PureOp which when making a call to the make_node, does not identify and pass all the parameters correctly. This passing parameter issue would hopefully be resolved if all the Ops in theano support __props__, which would make me convenient to access the _props_dict and pass the parameter instead of using the generalized unconventional way from *args and **kwargs. Currently, most of the Ops in the old backend does not have __props__ implemented to make use of the _props_dict. There are few road blocks to this. The instances of Elemwise would require a dict to be passed as parameter, which is of unhashable type and hence could not implement the __props__. Early of this week, work would begin on making that parameter hashable type and hence paving way for both of this PR to get merged. Once it gets merged, there would be at least 0.5X speed up in the optimization time. Finally the work has begun on implementing a CGT style optimizer. This new optimizer does optimization in topological sort. In theano, this is being implemented as a local optimizer, aimed at replacing the cannonicalize phase. Currently theano optimizes the node only "once". The main advantage of this optimizer is, it optimizes a node more than once, by trying all the possible optimizers to the node, until None of them apply. This new optimizer applies an optimization to a node, and again tries all the optimization to the newer node(the one that is modified) and so on.. There is one drawback in this approach. After two optimization being applied, the node that is being replaced wouldn't have the fgraph attribute and hence the optimization that would require this attribute could not be tried. An example of working of the new optimizer is shown below, Current theano master: x ** 4 = T.sqr(x ** 2) This branch : x ** 4 = T.sqr(T.sqr(x)) The drawback of this branch is that we won't be able to do this type of speed up for x ** 8 onwards. When profiled with the SBRNN network, the initial version of the draft seem to give approx 20sec speed up. Isn't that a good start? :D That's it for now folks! :) July 22, 2016 Avishkar Gupta (ScrapingHub) Formalising the Benchmark Suite, Some More Unit Tests and Backward Compatibility Changes In the past two weeks I focused my efforts on finalizing the benchmarking suite and improving test coverage. From what Codecov says, we’re 83% of the way there regarding test coverage. As far as the performance of the new signals is concerned, from what the testing shows I gathered that the new signal API always takes less than half the time that is required by the old signal API for both signal connection and the actual sending of the signal. This is attributed mostly to the fact that a lot of time that was previously used up by running a combo of the getAllReceivers and liveReceivers functions together everytime was taking up a huge amount of time and was the bottleneck to the process. As it currently stands, we’re not using the caching mechanism of the library, i.e. have use_caching set to false always because the receivers which do not connect to a specific sender but rather to all require me to find a suitable key for them that can be weakref ref’d to make the entry in the WeakKeyDictionary. But enough about that, back to benchmarking. So for the benchmarking process, Djangobech the Django benchmarking library, does not benchmark the signals currently and the same is still on the TODO list in the project. They however, did provide me with some excellent modules that I used to write the scrapy benchmarking suite for signals. I would leave a link to the same here, but currently I’m in a discussion with my mentor on where to include them, as including them in the repo would require that we still keep pyDispatcher as a dependency as it is required to perform a raw apples to apples comparison of the signal code. In this post I’m also sharing results that I got using Robert Kern’s line_profiler module. . As for the compatibility changes this cycle, I added support for the old style scrapy signals, which were just standard python objects. In similar fashion to how I implemented backward compatiblity for receivers without keyword arguments, I proxied the signals through the signal manager to implement backward compatability for the objects. With that the new signals can be safely integrated into scrapy with no worries about breaking legacy code. In the coming weeks, I plan on working on finishing test coverage, maybe adding some signal benchmarks to scrapy bench and doing documentation. Nelson Liu (scikit-learn) (GSoC Week 8) MAE PR #6667 Reflection: 15x speedup from beginning to end If you've been following this blog, you'll notice that I've been talking a lot about the weighted median problem, as it is intricately related to optimizing the mean absolute error (MAE) impurity criterion. The scikit-learn pull request I was working on to add aforementioned criterion to the DecisionTreeRegressor, PR #6667, has received approval from several reviewers for merging. Now that the work for this PR is complete, I figure that it's an apt time to present a narrative of the many iterations it took to converge to our current solution for the problem. Iteration One: Naive Sorting The Criterion object that is the superclass of MAE has a variety of responsibilities during the process of decision tree construction, primarily evaluating the impurity of the current node, and evaluating the impurity of all the possible children to find the best next split. In the first iteration, every time we wanted to calculate the impurity of a set of samples (either a node, or a possible child), we would sort this set of samples and extract the median from it. After implementing this, I ran some benchmarks to see how fast it was compared to the Mean Squared Error (MSE) criterion currently implemented in the library. I used both the classic Boston housing price dataset and a larger, synthetic dataset with 1000 samples and 100 features each to compare. Training was done on 0.75 of the total dataset, and the other 0.25 was used as a held-out test set for evaluation. Boston Housing Dataset Benchmarks: Iter. 1 MSE time: 105 function calls in 0.004 seconds MAE time: 105 function calls in 0.175 seconds Mean Squared Error of Tree Trained w/ MSE Criterion: 32.257480315 Mean Squared Error of Tree Trained w/ MAE Criterion: 29.117480315 Mean Absolute Error of Tree Trained w/ MSE Criterion: 3.50551181102 Mean Absolute Error of Tree Trained w/ MAE Criterion: 3.36220472441  Synthetic Dataset Benchmarks: Iter. 1 MSE time: 105 function calls in 0.089 seconds MAE time: 105 function calls in 15.419 seconds Mean Squared Error of Tree Trained w/ MSE Criterion: 0.702881265958 Mean Squared Error of Tree Trained w/ MAE Criterion: 0.66665916831 Mean Absolute Error of Tree Trained w/ MSE Criterion: 0.650976429446 Mean Absolute Error of Tree Trained w/ MAE Criterion: 0.657671579992  This sounds reasonable enough, but we quickly discovered after looking at the numbers that it was intractable; while sorting is quite fast in general, sorting in the process of finding the children was completely unrealistic. For a sample set of size n, we would divide it into n-1 partitions of left and right child, and sort each one, on every node. The larger dataset made MSE take 22.25x more time, but it made MAE take 88.11x (!) slower. This result was obviously unacceptable, so we began thinking of how to optimize; this led us to our second development iteration. Iteration 2: MinHeap to Calculate Weighted Median In iteration two, we implemented the algorithm / methodology I discussed in my week 6 blog post. With the method, we did away with the time associated with sorting every sample set for every node and possible child and instead "saved" sorts, using a modified bubblesort to insert and remove new elements from the left and right child heaps efficiently. This algorithm had a substantial impact on the code --- rerunning the benchmarks we used earlier yielded the following results (MSE results remained largely the same due to run-by-run variation, but accuracy is the same as is thus omitted): Boston Housing Dataset Benchmarks: Iter. 2 MSE time: 105 function calls in 0.004s (was: 0.004s) MAE time: 105 function calls in 0.276s (was: 0.175s)  Synthetic Dataset Benchmarks: Iter. 2 MSE time: 105 function calls in 0.065s (was: 0.089s) MAE time: 105 function calls in 5.952s (was: 15.419s)  After this iteration, MAE is still quite slower than MSE, but it's a definite improvement from naive sorting (especially when using a large dataset). I found it interesting that the new method is actually a little bit slower than the naive method we first implemented on the relatively small Boston dataset (0.276s vs 0.175s, respectively). My mentors and I hypothesized that this might be due to the time cost associated with creating the WeightedMedianCalculators (the objects that handled the new median calculation), though their efficiency in calculation is supported by the speed increase from 15.419s to 5.952s on the larger randomly generated dataset. 5.952 seconds on a dataset with 1000 samples is still slow though, so we kept going. Iteration 3: Pre-allocation of objects We suspected that there could be a high cost associated with spinning up objects used to calculate the weighted median. This is especially important because the majority of the tree code in scikit-learn is written in Cython, which disallows us of Python objects and functions. This is because we run the Cython code without the Python GIL (global interpreter lock). The GIL is a mutex that prevents multiple native threads from executing Python bytecodes at once, so running without the GIL makes our code a lot faster. However, because our WeightedMedianCalculators are Python objects, we unfortunately need to reacquire the GIL to instantiate them. We predicted that this could be a major source of the bottleneck. As a result, I implemented a reset function in the objects to clear them back to their state at construction, which could be executed without the GIL. When we first ran the C-level constructor (it is run at every node, as opposed to the Python constructor that is run only once), we evaluated whether the WeightedMedianCalculators had been created or not; if they have not been, we reacquire the GIL and create them. If they have, we simply reset them. This allowed us to only reacquire the GIL once throughout the algorithm, which, as predicted, led to substantial speedups. Running the benchmarks again displayed: Boston Housing Dataset Benchmarks: Iter. 3 MSE time: 105 function calls in 0.009s (was: 0.004s, 0.004s) MAE time: 105 function calls in 0.038s (was: 0.276s, 0.175s)  Synthetic Dataset Benchmarks: Iter. 3 MSE time: 105 function calls in 0.065s (was: 0.065s, 0.089s) MAE time: 105 function calls in 0.978s (was: 5.952s, 15.419s)  Based on the speed improvement from the most recent changes, it's reasonable to conclude that a large amount of time was spent re-acquiring the GIL. With this approach, we cut down the time spent reacquiring the GIL by quite a significant amount since we only need to do it once, but ideally we'd like to do it zero times. This led us to our third iteration. Iteration 4: Never Re-acquire the GIL Constructing the WeightedMedianCalculators requires two pieces of information, n_outputs (the number of outputs to predict) and n_node_samples (the number of samples in this node). We need to create a WeightedMedianCalculator for each output to predict, and the internal size of each should be equal to n_node_samples. We first considered whether we could allocate the WeightedMedianCalculators at the Splitter level (the splitter is in charge of finding the best splits, and uses the Criterion to do so). In splitter.pyx, the __cinit__ function (Python-level constructor) only exposes the value of n_node_samples and we lack the value of n_outputs. The opposite case is true in criterion.pyx, where the __cinit__ function is only shown the value of n_outputs and does not get n_node_samples until C-level init time, hence why we previously were constructing the WeightedMedianHeaps in the init function and cannot completely do it in __cinit__. If we could do it completely in the __cinit__, we would not have to reacquire the GIL because the __cinit__ operates on the Python level in the first place. As a result, we simply modified the __cinit__ of the Criterion objects to expose the value of n_node_samples, allowing us to do all of the allocation of the objects at the Python-level without having to specifically reacquire the GIL. We reran the benchmarks on this, and saw minor improvements in the results: Boston Housing Dataset Benchmarks: Iter. 4 MSE time: 105 function calls in 0.003s (was: 0.009s, 0.004s, 0.004s) MAE time: 105 function calls in 0.032s (was: 0.038s, 0.276s, 0.175s)  Synthetic Dataset Benchmarks: Iter. 4 MSE time: 105 function calls in 0.065s (was: 0.065s, 0.065s, 0.089s) MAE time: 105 function calls in 0.961s (was: 0.978s, 5.952s, 15.419s)  Conclusion So after these four iterations, we managed to get a respectable 15x speed improvement. There's still a lot of work to be done, especially with regards to speed on larger datasets; however, as my mentor Jacob commented, "Perfect is the enemy of good", and those enhancements will come in future (very near future) pull requests. If you have any questions, comments, or suggestions, you're welcome to leave a comment below. Thanks to my mentors Raghav RV and Jacob Schreiber for their input on this problem; we've run through several solutions together, and they are always quick to point out errors and suggest improvements. You're awesome for reading this! Feel free to follow me on GitHub if you want to track the progress of my Summer of Code project, or subscribe to blog updates via email. aleks_ (Statsmodels) Bugs, where art thou? The latest few weeks were all about searching for bugs. The two main bugs (both related to the estimation of parameters) showed up in two cases: • When including seasonal terms and a constant deterministic term in the vector error correction model (VECM), the estimation for the constant term differed from the one produced by the reference software JMulTi which is written by Lütkepohl. Interestingly, my results did equal those printed in the reference book (also written by Lütkepohl) so I believe that JMulTi has gotten a corresponding update between the release of the book and the release of the software - which would also mean that the author of the reference book made the same mistake as I ;) Basically the error was the result of a wrong construction of the matrix holding the dummy variables. Instead of the following pattern (assuming four seasons, e.g. quaterly data): [[1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0],  [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..., 0, 1, 0, 0], [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, ..., 0, 0, 1, 0]] • 1 / (number of seasons) had to be subtracted from each element in the matrix above. This isn't described in the book and it is not the way to define dummy variables I learned in my lecture about regression analysis. And due to the fact that the estimation for the seasonal parameters was actually correct (the wrong matrix had only side effects on constant terms), I kept searching the bug in a lot of different places... • A small deviation from certain parameters occurred when deterministic linear trends were assumed to be present. Thanks to an R package handling VECM, called tsDyn, I could see that my results exactly matched those produced by tsDyn when specifying linear trends to be inside the cointegration relation in the R-package. On the other hand the tsDyn output equaled that of JMulTi when tsDyn did not treat the linear trend as part of the cointegration relation. After I had seen that, the reimplementation to produce the JMulTi output in Python was easy. But also here I had searched a lot for bugs before. Now I am happy that the code works even if the bugs have thrown me behind the time schedule. I expect that coding will now continue much more smoothly. The only thing that makes me worry is the impression the bug hunting made on my supervisors. Being unable to push code while searching for bugs may have looked as if I am not doing anything though I spent hours and hours reading my code and checking it against Lütkepohl's book. So while I was working much more than 40 hours per week in the last few weeks (once even without a single day off), it may have looked completely different. What counts now is that I will continue to give my best in the remaining weeks of GSoC even if I don't get a passing grade from my supervisors. After all, it's not about the money, it's about being proud of the end product and about knowing that one has given his very best : ) Upendra Kumar (Core Python) Creating documentation with Sphinx This week I worked on creating documetation and creating a web crawler for my new feature of providing users with option of installing packages from PythonLibs. It’s really a great tool for creating docs. In just few steps I could create docs as compared to creating a whole website based on django or using static webpages on github. We can create docs in few steps : 1. mkdir docs 2. Go to docs directory and run sphinx-quickstart 3. Navigate to docs/source/conf.py and change : sys.path.insert(0, os.path.abspath('../..')) 4. Now run sphinx-apidoc -f -o source/ ../mypackage/ 5. Our directory structure should be like this : myproject/ |-- README |-- setup.py |-- myvirtualenv/ |-- mypackage/ | |-- __init__.py | -- mymodule.py -- docs/ |-- MakeFile |-- build/ -- source/  6. Finally run the following command to create html files from .rst files make html We can also play with MakeFile to configure the settings based on our preference. ghoshbishakh (dipy) Google Summer of Code Progress July 22 It has been about 3 weeks after the midterm evaluations. The dipy website is gradually heading towards completion! Progress so far The documentation has been completely integrated with the website and it is synced automatically from the github repository where the docs are hosted. The honeycomb gallery in the home page is replaced with a carousal of images with content overlays that will allow us to display important announcements at the top. The news feed now has sharing options for Facebook, Google Plus and Twitter. Google analytics has been integrated for monitoring traffic. There are many performance optimizations like introducing a layer of cache and enabling GZipping middleware. Now the google page speed score is even higher than the older website of dipy. All pages of the website has meta tags for search engine optimizations. And of course there has been lots of bug fixes and the website scales a lot better in mobile devices. The current pull request is #13 You can visit the site under development at http://dipy.herokuapp.com/ Documentation Integration The documentations are now generated and uploaded to the dipy_web repository using a script. Previously html version of the docs were built, but this script builds the json docs that allows us to integrate them within the django templates vrey easily. Then using github API the list of documentations are synced with the django website models. Now the admins with proper permissions can select which documentation versions to display in the website. Those selected documentations are displayed in the navbar dropdown menu. This is done by passing the selected docs in the context in a context preprocessor. Now when a user requests a documentation, the doc in json format is retrieved from github parsed and the urls in the docs are processed so that they work properly within the django site. Then the docs are rendered in a template. Cache Processing the json documentations every time a page is requested is an overhead. Also in the home page, every time the social network feeds are fetched which is not required. So a cache is used to reduce the overhead. In django adding a cache is really really simple. All we need to do is setup the cache settings and add some decorators to the views. For now we are using local memory cache, but in production it will be replaced with memcached. We are keeping the documentation view and the main page view in cache for 30 minutes. But this creates a problem. When we change some section or news or publications in the admin panel then the changes are not reflected in the views and we need to wait for 30 minutes to see the changes. In order to solve the issue the cache is cleared whenever some changes are made to the sections, news etc. Search Engine Optimizations One of the most important steps for SEO is adding proper meta tags in every page of the webiste. These also include the open graph tags and the twitter card tags so that when a page is shared in a social network, it is properly rendered with the correct title, description, thumbnail etc. The django-meta app provides vrey useful template that can be included to render the meta tags properly provided a meta object is passed in the context. Ideally all pages should have its unique meta tags, but there must be a fallback so that if no meta attributes are specified then some default values are used. So in order to generate the meta objects we have this function: And in settings.py we can specify some default values: Googe Analytics Adding google analytics is very simple. All we need to do is put a code snippet in every template or just the base template that is extended by all other templates. But in order to make it more easy to customize, I have kept it as a context preprocessor that will take the Tracking ID from settings.py and generate the code snippet in the templates. What’s next We have to add more documentation versions (the older ones) and add a hover button in the documentation pages to hop from one documentation version to another just like the django documentations. We have to design a gallery page that will contain images, videos and tutorials. I am currently working on a github data visualization page for visualization of dipy contributors and activity in the dipy repository. Will be back with more updates soon! :) liscju (Mercurial) Coding Period - VII-VIII Week The main thing i managed to do in the last two weeks was to make new clients(with redirection feature) put files in redirection place by themselves instead of pushing to the main repository. The new client in order to obtain the redirection destination asks main repo server for this information and later it communicates directly with the redirection server. This behaviour is proper because it is sure that transaction of push will only suceed when the client succesfully put all large files in redirection destination. In other case the transaction will fail, so main server repo will not have any revision that large files are not in redirection destination. Second thing i have done was to add/tweak current test cases for redirection module. The next thing i have done was to research what functionalities redirection server should have. There were some discussion if server should be thin/rich in functionalities but the general conclusion is that it should be thin - it should support only getting files and pushing files. The one thing we demand from redirection server that it should check if pushed large file has proper hash because it is the only way to be sure that consecutives clients will download files with proper content. The last thing i managed to do was to make putting files by old clients not saving those files temporarily before sending to redirection server. So far those files was saved because when old clients pushes files, the main repo server doesn't know the size of the file and in the result he doesn't know how to set Content-Length in the request to the redirection server. This was overcome by using Chunked Transfer Encoding. This functionality of the http 1.1 protocol enables sending files chunk by chunk knowing only single chunk size that is sent. You can read more about this on wikipedia: https://en.wikipedia.org/wiki/Chunked_transfer_encoding Abhay Raizada (coala) week full of refactor My Project has grown a lot now, we are officially going to support C, C++, python3, JS, CSS and JAVA with our generic algorithms, though they’ll still be experimental owing to the nature of the bears. The past two weeks were heavily concentrating on refactoring algorithms of the AnnotationBear and IndentationBear, the IndentationBear received only small fixes while the AnnotationBear had to undergo a change in the algorithm, the new and improved algorithm also adds the feature of distinguishing between single-line and multi-line strings while earlier there were just strings. The IndentationBear is almost close to completion barring basic things like: • It still messes up your doc strings/ multi-line strings. • Still no support for keyword indents. the next weeks efforts will go into introducing various indentation styles into the bear and fixing these issues, before we move on to the LineBreakBear and the FormatCodeBear. Prayash Mohapatra (Tryton) Few methods left Well yes, according to my Trello board, I am just a couple of methods away from completely porting the Import/Export feature from Python (GTK) to JavaScript (sao). The journey now feels rewarding, especially since I just learnt that GNU Health uses Tryton as their framework too. There has been no problem as such since last two weeks. Made the predefined exports be used, created, saved and removed. Can now get the records selected in the tab and could fetch the relevant data from the ‘export_data’ RPC call. Got confidence in making the RPC calls in general. Feeling comfortable around promises. Now I smile at the times when the folks at my college club would use promise for every concurrency issue, and I would be staring at them poker-faced. Would soon move into writing the tests for the feature, something I am waiting eagerly for. Have a nice weekend. Ravi Jain (MyHDL) Started Receive Engine! Its been a long time since my last post!(2 weeks phew)! Sorry for the slump. Anyways During the period i successfully merged Transmit Engine after mentor’s review. I later realised that i missed adding functionality of client underrun used to corrupt current frame transmission. I shall make sure to add that in next merge. Next I started looking towards GMII, which partly stalled my work cause i was unable to clearly understand what I have to do for that. So I decided to move on and complete Receive Engine with Address Filter First. Till now i have finished receiving the destination address from the data stream and filtering using the address table by matching it against frame’s destination address. If there is any match, the receiver starts forwarding the stream to client side, otherwise just ignores it. Next i look forward to add error check functionalities to be able to assert Good/Bad Frame at the end of the transmission. jbm950 (PyDy) GSoC Week 8 & 9 Last week I did not end up writing a blog post and so I am combining that week’s post with this week. Last week I attended the SciPy 2016 conference and was able to meet my mentor, and many other contributers to SymPy, in person. I was also able to help out with the Pydy tutorial. During this time at the conference (and this current week) I was able to flesh out the remaining details on the different portions of the project. I have updated PR #353 to reflect the api decisions for SymbolicSystem (previously eombase.EOM). In line with trying to put the finishing touches on implementation details before diving in to code, Jason and I met with someone who has actually implemented the algorithm in the past to help us with details surrounding Featherstone’s method. He also pointed me to a different description of the same algorithm that may be easier to implement. This week I also worked on rewriting the docstrings in physics/mechanics/body.py because I found the docstrings currently there to be somewhat confusing. I also did a review on one of Jason’s PR’s where he reduces the amount of work that *method.rhs() has to do when inverting the mass matrix by pulling out the kinematical information before the inversion takes place. Future Directions With the work these past two weeks being focused on implementing the different parts of the projects, I will start implementing these various parts next week. I will first work on finishing off the SymbolicSystem object and then move towards implementing the OrderNMethod. This work should be very straight forward with all the work that has been put into planning the api’s. PR’s and Issues • (Merged) Speeds up the linear system solve in KanesMethod.rhs() PR #10965 • (Open) Docstring cleanup of physics/mechanics/body.py PR #11416 • (Open) [WIP] Created a basis on which to discuss EOM class PR #353 July 21, 2016 Ranveer Aggarwal (dipy) Going 3D: An Orbital Menu The next UI element is a menu with items that circle around a 3D object while still facing the camera. Basically, we need a menu that follows a 3D object. If the 3D object is moved, the menu should follow it, while still facing the camera. For now, we’d like the elements of the menu to be arranged in a circle. Understanding the Follower Menu A vtkFollower inherits from vtkActor and takes in the renderer’s active camera as an attribute. And that is it! That’s all that’s required to get an element to follow the camera. This is how a vtkFollower looks like: # mapper = ... followActor = vtk.vtkFollower() followActor.SetMapper(mapper) followActor.SetCamera(renderer.GetActiveCamera())  Building an Assembly Now, if we keep adding actors like this, we’ll run into a problem, that is, all of them will face the camera, but since their origins are separate, they’ll move differently. They will move with respect to the world coordinate system’s origin, and not the object’s. The solution? A vtkAssembly. In a vtkAssembly, all objects move together. An assembly is kind of an aggregate actor, with properties similar to that of a vtkActor. Marc-Alex had worked on a vtkAssembly-vtkFollower combination and that gave me a real good head start. Here’s how his menu looked like. A Basic Follower Menu Now, the task is to integrate this menu with our existing DIPY framework and modify the previously created elements to work with this menu. Anish Shah (Core Python) GSoC'16: Week 7 and 8 Makefile Last week, I told you guys about adding Fedora docker image. After that, I added an argument OS in Makefile. Now, the developers can choose between Ubuntu and Fedora while building the docker image. I have created two separate Dockerfile for Ubuntu and Fedora. You can find these changes in the same pull request here. Reviews I spent other part of the last two week on updating the patches according to reviews suggested by my mentor Maciej Szulik and other PSF members - @berker.peksag and @r.david.murray. Add GitHub PR field on issue page I renamed the schema to shorter names like - github_pullrequest_url to just pull_request because in future we might want to move to different provider. Likewise on the HTML wide, I renamed few names to smaller ones. You can check out the patch here In the first patch, after extracting the pull request id, I used to lookup in the DB to check if the pull request exists in the DB. @r.david.murray suggested not to do a lookup and just do a mechanical translation. The updated patch is here Thank you for reading this blogpost. This is it for today. See you again. :) July 20, 2016 Yen (scikit-learn) How to set up 32bit scikit-learn on Mac without additional installation Sometimes you may want to know how scikit-learn behaves when it’s running on 32-bit Python. This blog post try to give the simplest solution. Step by Step Below I’ll go through the procedure step by step: I. Type the following command and make sure it outputs 2147483647. arch -32 /System/Library/Frameworks/Python.framework/Versions/2.7/bin/python -c "import sys; print sys.maxint"  II. Modify line 5 of Makefile exists in root directory of scikit-learn becomes: PYTHON ?= arch -32 /System/Library/Frameworks/Python.framework/Versions/2.7/bin/python  and modify line 11 to: BITS :=$(shell PYTHON -c 'import struct; print(8 * 	struct.calcsize("P"))')


III. Type

sudo make


in the root directory of scikit-learn and you are good to go!

Verification

You can verify if 32-bit version of scikit-learn built successfully by typing:

arch -32 /System/Library/Frameworks/Python.framework/Versions/2.7/bin/python


to enter 32-bit Python shell.

After that, type:

import sklearn


to check if sklearn can now run on 32-bit Python.

Hope this helps!

Scipy 2016!

Last week I went to Austin, TX to Scipy2016. I wasn’t sure what to expect. How would people communicate? Would I fit in, what talks would interest me? Fortunately the conference was a huge success. I have came away a far more confident and motivated programmer than when I went in.

So what were the highlights of my experience at Scipy?

On a personal level, I got to meet some of my coworkers, the members of the Beckstein Lab. Dr. Oliver Beckstein, David Dotson, and Sean Seyler are brilliant physicists and programmers who I have been working with on MDAnalysis and datreant. It was surreal to meet the people you have been working with over the internet for 3 months and get an idea of how they communicate and what they enjoy outside of work. It was the modern day equivalent of meeting penpals for the first time. I especially appreciated that David Dotson and Sean Seyler, both approximately four years my senior, provided invaluable advice to a recent graduate. If you’re reading this, thanks guys.

The most valuable moments were the conversations I had in informal settings. There is a huge diversity in career trajectories among those attending Scipy, everyone has career advice and technical knowledge to impart upon a young graduate as long as you are willing to ask. I had excellent conversations with people from Clover Health, Apple data scientists, Andreas Klockner (Keynote Speaker), Brian Van de Ven (Bokeh Dev), Ana Ruvalcaba at Jupyter, the list goes on…

Fascinating, Troubling, and Unexpected Insights

• Scipy doubled in size in the last year!
• So many free shirts (and stickers), don’t even bother coming with more than one shirt, also nobody wears professional attire.
• Overheard some troubling comments made by men at Scipy, e.g. “Well, all the women are getting the jobs I’m applying for…” (said in a hallway group, this is not appropriate even if it was a joke)
• The amount of beer involved in social events is kind of nuts; this probably comes with the territory of professional programming.
• There are a lot of apologists for rude people, someone can be extremely nonverbally dismissive and when you bring it up to other people they will defend him (yes, always him) saying something to the effect of ‘he has been really busy recently’. Oliver Beckstein is a shining example of someone who is very busy and makes a conscious effort to always be thoughtful and kind.
• Open source does not always imply open contribution, some companies represented at Scipy maintain open source projects while making the barriers to contribution prohibitively high.
• A lot of people at Scipy apologize for their job (half-seriously) if they aren’t someone super-special like a matplotlib core developer or the inventor of Python. Your jobs are awesome people!
• It is really hot in Austin.
• git pull is just git fetch + git merge.
• A lot of women in computing have joined and left male dominated organizations not because people are necessarily mean, but because they’ve been asked out too much or harassed in a similar fashion. Stay professional folks.
• Cows turn inedible corn into edible steak.
• As a young professional you have to work harder and take every moment more seriously than those older than you in order to get ahead.
• Breakfast tacos are delicious.
• Being able to get out of your comfort zone is a professional asset.
• Slow down, take a breath, read things over, don’t make simple mistakes.

July 19, 2016

Sheikh Araf (coala)

[GSoC16] Week 8 update

Time flies and it’s been an astonishingly quick 8 weeks. I’ve finished work for the first coala Eclipse release, and the plug-in will be released with coala 0.8 in the next few days.

Most of the coala team is at EuroPython so the development speed has slowed down. Nevertheless there are software development sprints this weekend and coala will be participating too.

We also plan on having a mini-conference of our own, and will have lightning talks from GSoC students and other coala community members.

As I’m nearing the end of my GSoC project, I’ve started reading up some material in order to get started with implementing the coafile editor. Currently I’m planing to extend either the AbstractTextEditor class or the EditorPart` class, or something similar maybe.

Cheers.

Kuldeep Singh (kivy)

After Mid-Term Evaluation

Hello! guys,

It’s been a month since I wrote a blog. I passed the Mid-Term and my mentors wrote a nice review for me. After my last blog post I have worked on a couple of features. Have a look at my PRs (Pull Requests) on the project Plyer.

I am quiet happy with my work and hope to do more in future.

Visit my previous blog here.

GSoC week 8 roundup

@cfelton wrote:

There has been a little bit of a slump after the midterms
hopefully this will not continue throughout the rest of
the program

The end will approach quickly.

• August 15th: coding period ends.
• August 15-20th: students submit final code and evaluations.
• August 23-25th: mentors submit final evaluations.

Overall the progress being made is satisfactory. I am looking
forward to the next stage of the projects now that majority of the
implementation is complete: analysis of the designs, clean-up,
and documentation.

One topic I want to stress, a program like GSoC is very different
than much of the work that is completed by an undergrad student.
This effort is the students exposition for this period of time
(which isn't insignificant - (doh double negative)). Meaning, the
goal isnt' to simply show us you can get something working but you
are publishing your work to the public. Users should easily be able
to use the cores developed and the subblocks within the cores.
Developers, reviewers, contributors should feel comfortable reading
the code. The code should feel clean [1]. You (the students) are
publishing something into the public domain that carries your name
take great pride in your work: design, code, documenation, etc.

As well as the readability stated above the code should be
analyzed for performance, efficiency, resource usage, etc.
This information should be summarized in the blogs and final
documentation.

Student week8 summary (last blog, commits, PR):

jpegenc:
health 88%, coverage 97%
@mkatsimpris: 10-Jul, >5, Y
@Vikram9866: 25-Jun, >5, Y

riscv:
health 96%, coverage 91%
@meetsha1995: 14-Jul, 1, N

hdmi:
health 94%, coverage 90%
@srivatsan: 02-Jul, 0, N