Deep learning is becoming my favorite topic for personal projects. Its proliferation in the last few years has produced a vast array of applications and made it much more accessible. The one aspect of it I don't love is its impact on my AWS bill. GPU instances can get costly …
Blogging O'Reilly's Artificial Intelligence Conference
Last Monday and Tuesday, I attended O'Reilly's first ever Artificial Intelligence Conference, courtesy of Women in Machine Learning & Data Science (WiMLDS). I tweeted about many of the talks I saw, but here is the recap on their blog.
Many thanks to WiMLDS for this opportunity. I will be attending O'Reilly …
Prediction of Zika Outbreaks
I did it! My final project1 for Metis's Data Science Bootcamp was presented just over a week ago. Before the dust completely settles on this wonderful experience, I'd like to share a few of my projects with you in the coming days.
The first of these projects involved predicting …
git commit -m 'Update life'
You may have noticed some changes around here and on my Twitter account. I've relocated to my beloved New York City and am currently attending an immersive data science bootcamp at Metis. Through mid-September, I'll be honing my existing skills in Python, statistics, and analysis and developing an entirely new set in topics like machine learning, Javascript, SQL, and Hadoop.
An integral part of Metis's bootcamp is the completion of five self-designed data science projects that both emphasize and develop data science and machine learning skills. These projects use real world1 data, tight deadlines, and simulate many of the challenges encountered by data scientists. I started my third project last Monday and look forward to sharing a few of them here and on GitHub in the coming weeks.
pdLSR: Pandas-aware least squares regression
I have a new Python project I would like to share with the community. Actually, this project isn't so new. I developed an initial version about two years before completing my postdoctoral research, and it has undergone various revisions over the past three years. Having finally made time to give it the clean-up it needed,1 I am excited to share it on GitHub.
Overview
pdLSR
is a library for performing least squares minimization. It attempts to seamlessly incorporate this task in a Pandas-focused workflow. Input data are expected in dataframes, and multiple regressions can be performed using functionality similar to Pandas groupby
. Results are returned as grouped dataframes and include best-fit parameters, statistics, residuals, and more. The results can be easily visualized using seaborn
.
pdLSR
currently utilizes lmfit
, a flexible and powerful library for least squares minimization, which in turn, makes use of scipy.optimize.leastsq
. I began using lmfit
because it is one of the few libraries that supports non-linear least squares regression, which is commonly used in the natural sciences. I also like the flexibility it offers for testing different modeling scenarios and the variety of assessment statistics it provides. However, I found myself writing many for
loops to perform regressions on groups of data and aggregate the resulting output. Simplification of this task was my inspiration for writing pdLSR
.
pdLSR
is related to libraries such as statsmodels
and scikit-learn
that provide linear regression functions that operate on dataframes. However, these libraries don't directly support grouping operations on dataframes.
The aggregation of minimization output parameters that is performed by pdLSR
has many similarities to the R library broom
, which is written by David Robinson and with whom I had an excellent conversation about our two libraries. broom
is more general in its ability to accept input from many minimizers, and I think expanding pdLSR
in this fashion, for compatibility with statsmodels
and scikit-learn
for example, could be useful in the future.