Patricia Hoffman PhD

Share |
Google Groups
Subscribe to Machine Learning Events
Email:
Visit this group

Join me on twitter:   Patricia Hoffman Twitter   Join me on Facebook Iam Gamer http://www.facebook.com/megamer

I just finished technical editing of the book Machine Learning in Action:  http://manning.com/pharrington/

The next class at the University of California Santa Cruz Extension starts in January:

2612 Introduction to Machine Learning and Data Mining

Sign up here:   Machine Learning / Data Mining Survey Course  This covers the material in Machine Learning 101 and 102 below.

Dates: Dates: Jan 29, 2013 - Apr 2, 2013                                                                Some Data For First Lesson:      myDataSet.txt
Location:  2505 Augustine Drive                                                                                   filter60HzNotchblanks.txt     filter50HzNotch.txt
                     Santa Clara, CA                                                                                            matrixletters.csv                        ProstateCancerDataESL.csv 
                                                                                                                                               myfirstdata.txt
Student's Comments:                        

"I really enjoyed the class and would recommend anyone to do it. It was a perfect blend of Theory of Data Mining and practical exercises in language R. Exercises were extremely helpful, to not only understand the concepts but how to apply them to real problems! It was one of the most fun classes I have attended in a long time. The best part is that what I have learned, I am able to directly apply in my work right away! Thanks Patricia for putting together such an excellent course!". - Sourabh Satish, Distinguished Engineer, Symantec.


"Thanks a lot for your course. I really enjoy it! I wish you could teach a second course on machine learning/data mining or on their applications."
 - Bo Wu

Dr Hoffman,

Thanks for teaching the class.
I REALLY, REALLY learned a lot from this class, in fact its the best class I have taken at UCSC. Besides the great class lectures, and
discussion, the homework, and project helped me a lot. The extra work you gave us, made me revise the subject matter once I got home, even
though it did seem like extra work, but that really paid back.

Thanks for teaching this subject to us.

I will stay in touch.
Best Regards
Navindra Yadav


Here is a brief description of past courses and web links:

The First Sequence:
Beginning Applied Machine Learning 101    Supervised learning                                                                                                  
Beginning Applied Machine Learning 102    Unsupervised Learning and Fault Detection                                                 
Text: "Introduction to Data Mining", by Pang-Ning Tan, Michael Steinbach and Vipin Kumar

The Second Sequence:
Modern Applied Machine Learning 201      Advanced Regression Techniques, Generalized Linear Models, and Generalized Additive Models  
Modern Applied Machine Learning 202      Collaborative Filtering, Bayesian Belief Networks, and Advanced Trees

Text:  "The Elements of Statistical Learning - Data Mining, Inference, and Prediction"  by Trevor Hastie, Robert Tibshirani, and Jerome Friedman


Big Data with MapReduce       Adaptation and execution of machine learning algorithms in the map reduce framework

Apache BigTop:  http://apachebigtop.pbworks.com/w/page/48434924/FrontPage

Machine Learning - Natural Language Text Documents  Statistical algorithms for accomplishing machine learning tasks on texts

 

Following are the details for these courses:


Machine Learning 101 and 102

Machine Learning automatically recognizes complex, previously unknown, novel, and useful patterns and information in all types of data. Data driven algorithms are the wave of the future and their results improve as the amount of data increases. Machine Learning algorithms are used in search engines, image analysis, multimedia database retrieval, bioinformatics, industrial automation, speech recognition, and many other fields. These are survey courses covering the concepts and principles of a large variety of data mining methods. The courses will equip the students with a working knowledge of these techniques and prepares them to  apply machine learning to real problems.  At the end of this sequence the students will collaborate on a set of projects of their choosing.

Machine Learning 101 covers supervised techniques including various types of linear regression, decision trees, k-nearest neighbors, Naive Bayes, Support Vector Machines and ensemble methods. Machine Learning 102 addresses unsupervised techniques such as k-means, expectation maximization, anomaly detection, hierarchical clustering, and density based clustering.

These courses require a moderate level of computer programming proficiency, along with an elementary level background in probability, statistics, linear algebra, and calculus. These are hands-on courses using the statistical language R for class examples and homework assignments. No prior knowledge of R is assumed, and some of the basics of open source R language are covered.


Machine Learning 201 and 202

These courses cover topics in greater depth than Machine Learning 101 and 102.  After finishing this series, participants are able to read the current literature and apply what they have read to their own work.  At the end of this sequence, students present interesting machine learning projects using a wide variety of data sources.

Machine Learning 201 begins with ordinary least squares regression and extends this basic tool in a number of directions.  Various regularization approaches are covered including ridge regression, lasso regression, lease angle regression and elastic net.  Logistic regression including coding categorical inputs and outputs is discussed. Feature space expansions along with subset selection (both forward and backward step-wise) are detailed.  These techniques naturally lead to generalizations of linear regression, known as  the "generalized linear model" and the "generalized additive model". 

Machine Learning 202 covers collaborative filtering including singular value decomposition and recommendation engines.  A section on Bayesian belief networks including expectation maximization and factor analysis is delivered.  Time is spent delving into decision trees including gradient boosting.  The Friedman, Hastie, and Tibshirani paper, "Regularization Paths for Generalized Linear Models via Coordinate Descent" is an example of the papers presented and discussed.  Support vector machines including details on kernel methods along with stochastic gradient descent are covered.    


Machine Learning - Big Data

Participants learn to adapt and execute machine learning algorithms in the map reduce framework.  Participants finishing the class are able to author their own machine learning algorithms for map reduce and to run them on Amazon Web Services.  Amazon provided AWS credits for class participants. 

Participants learn to use python code to author mappers and reducers for “hadoop-streaming”.  For most of the class “mrjob” - an open-source framework developed at Yelp is used.  Employing mrjob enables class members to program mappers and reducers in python.  The mrjob framework then submits the mapper-reducer to run locally without using hadoop, to run on Amazon Web Services, or to run them on a private hadoop cluster.  This simplifies the programming tasks. 

Topics included in this course are k-means with canopy clustering .  Implementing expectation maximization in the map-reduce paradigm is developed. Generalized Linear Models with regularized regression are covered.  Recommender systems along with singular value decomposition are discussed.  Frequent Item Set Implementations are also provided.


Machine Learning - Natural Language Text Documents

Machine Learning applied to natural language text documents will be covered, including the use of statistical algorithms for accomplishing machine learning tasks on texts - not more traditional rule-based semantics, parsing, etc.  The class starts with an introduction to basic text manipulations, and continues with comparisons of statistical techniques to semantic approaches, definition of problems in text mining, and simple text manipulations.  Various algorithms for dealing with standard text mining problems, such as indexing, automatic classification (e.g. span filtering) part of speech identification, topic modeling, sentiment extraction, etc.

         
Web Sites from previous classes:
I am teaching a machine learning class at the Hacker Dojo on Wed and Thursday evenings.  Check out the   Class Web Page  


Here's what past class members say about the class:

Stephen OConnell just posted a comment for
HackerDojo/Microsoft  Beginning Data Mining and Machine
Learning 101.

"I have taken two classes in machine learning taught by Tricia Hoffman at the Dojo.  She is very knowledgeable in the area and does a great job sharing/teaching the methods with hands-on examples.  The group homework is a great way to get engaged with the information taught in class and exchange information with fellow machine learners.  From these classes I have been able to implement these techniques in my work at a large financial firm.  I highly recommend this
class."


--------------------------------------------------------------------------------------------------

Cloud Stock Presentation Materials - It was great to see everyone so excited about the cloud and LOTS of DATA!

Link to MyDataSet

hoffman.tricia@gmail.com  (email)


Example Class: Beginning Linear Regression
Check out the Description Page on our Class Web Site

Publications

A unified view on the rotational symmetry of equilibria of nematic polymers, dipolar nematic polymers and polymers in higher dimensional space, Communications in Mathematical Sciences 6, 949-974 

Patricia Hoffman PhD pdf   

A Distributed Architecture for the C3 I (Command, Control, Communications, and Intelligence) Collection Management Expert System pdf


Web Sites

Patricia Hoffman LinkedIn

Patricia Hoffman Twitter

Data-Mi.ning 

LinkedIn Group

AnalyticBridge Group

Computer Languages

Machine Learning

Distributed Processing

Leave a comment

 google 

google

Check out the great topics from the last Data Mining Camp (October 15, 2011)

http://www.sfbayacm.org/bootcamp/forums/

http://www.djcline.com/2011/10/20/oct-15-2011-acm-data-mining-camp/

Software as a Platform Panel Discussion  October 15, 2011

The panel will discuss the benefits of moving to a software platform distributed over a large number of processors. The risks involved in the move along with advice in avoiding the pitfalls will be given. What are the costs involved with moving to one of these platforms? The popular platforms along with where the market is likely to go will be addressed. What software developer backgrounds are companies looking to hire? Do companies have in house programs to develop this talent?

Suggested Questions to Cover
- Is big getting smaller? (i.e. is Moore's law allowing hardware to catch up with big data)
- Do we all need new shoes? (should we throw all legacy code onto a bon-fire?)
- If not, how can we inter-operate?
- How valuable is it to "own" your own infrastructure all the way down to the bits?
- What kind of data mining analytics are the panelist doing?
-Are there any surprising applications that aren't what people usually list when they describe data mining?

Moderator
Dr. Patricia Hoffman, Research Scientist

Panelists

Ted Dunning has been involved with a number of startups, including MusicMatch, and Veoh Networks with the latest being MapR Technologies where he is Chief Application Architect. He is also a PMC member for the Apache Zookeeper and Mahout projects. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Bryan Duxbury leads Rapleaf's Analysis Team, which is responsible for maintaining and analyzing a database of over 200 billion people data records. He is also the project chair of Apache Thrift.

Erik Andrejko is a software engineer currently working on large scale statistical climatology and related
agronomic models at Weatherbill. In the past he has built systems at scale for collaborative filtering, search, rare event modeling and various other related problems.

Jay Kreps is a Principal Engineer and Engineering Manager at LinkedIn. One of LinkedIn's software platforms is called Voldemort: http://project-voldemort.com/ Voldemort is the open source data store used extensively atLinkedIn for online queries built to overcome the inherent scalability limitations of a relational database

Jimmy Retzlaff is a senior software engineer at Yelp, working on ad targeting. Jimmy regularly gives talks about mrjob, Yelp's open source library for doing MapReduce in Python. Before Yelp, Jimmy worked on the Amazon Kindle for nearly 5 years and also developed a system for generating interactive geographic visualizations of mutual fund sales activity.

 

Bayesian Techniques Panel Discussion  October 15, 2011

Bayesian Techniques are used to model uncertainty, for inference (to explain data), Decision Making (Decision Theory), and Risk Reduction (Predicting Future).  A huge advantage of Bayesian Techniques is the ability to use all relevant information and to unify various methods in a probabilistic framework.  This panel will discuss the types of problems that are ideally suited for Bayesian Techniques.  The current research and new developments will be addressed.  The panel will provide guidance for managers considering using these techniques to solve their problems. 

Questions to be addressed include:
How can Bayesian Techniques improve the solutions to problems?  How do Bayesian Techniques compare with other Data Mining methods?  What types of problems are  Bayesian Networks ideally suited to solving? What about Bayesian nonparametric models? What are recommendation to follow when implementing a Bayesian Technique?  In practice how is it possible to quantify uncertainty?

Moderator
Dr. Patricia Hoffman, Research Scientist

Panelists

John Mark Agosta, PhD, previously was Chief Scientist at Impermium, a real time web service for span elimination on social networks. Before that, Research Scientist at Intel Labs, working on opinion mining on the web and distributed detection of computer viruses;  Edify Corporation (automating customer interaction using statistical natural language and automated workflow), Knowledge Industries - Diagnostic Bayes Nets, and SRI.  John Mark did his thesis work at Stanford, on Bayes networks models for visual recognition. 

Lionel Jouffe, PhD, cofounder and CEO of France-based Bayesia S.A.S. Lionel Jouffe holds a Ph.D. in Computer Science and has been working in the field of Artificial Intelligence since the early 1990s. He and his team have been developing BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is highlighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007

David Draper, PhD Professor at the University of California at Santa Cruz.  Past President of the International Society for Bayesian Analysis.  Authored more than 100 articles in Journals of American Statistical Association, Royal Statistical Society, Bayesian Analysis, and both the New England Journal of Medicine, and the Journal of American Medical Association.  His seminal article has been sited more than 800 times. 

 

 

Expert Panel Discussion  November 13, 2010

The panel was just as lively as last year’s discussion. Follow the full video with all the Q/A to see what was interesting to the experts and students in attendance.

Moderator
Dr. Patricia Hoffman, Research Scientist

Panelists
and their signature question

Dr. Neel Sundaresan, Sr. Director and Head, eBay Research Labs at eBay
How does data mining apply to social and incentive networks?

Mr. Dean Abbott, Chief Scientist and Co-Founder at SmarterRemarketer, LLC
What tools are available to the data miner today?

Dr. Mike Bowles, Research Scientist and Start-up Executive
How has the field of stock market prediction changed over the past two years?

Dr. Hans Dolfing, Pattern Recognition Manager at Apple
What are the current challenges in Speech and handwriting recognition?

Dr. Susan Holmes, Standford Professor, Statistics Department
What recommendations do you have for people starting out in data mining today?

Dr. Omid Madani. Senior Computer Scientist at SRI International
How has data mining changed as the scale of data has increased so dramatically in the last few years?

Dr. Lionel Jouffe, President/CEO at BAYESIA
Can you give us a brief example where a Bayesian network did a great job in diagnostics? [poor audio, the answer to this question not included in video]

The ACM Data Mining Camp Expert Panel Video  November 13, 2010

http://www.djcline.com/2010/11/17/nov-13-2010-acm-data-mining-camp/

 http://www.lecturemaker.com/2011/09/acm-data-mining-expert-panel/#video


ACM Data Mining Camp 2010 – Expert Panel
Moderated by Dr. Patricia Hoffman
Video Link
ACM Data Mining Camp: Expert Panel Discusion with Q&A
Moderated by Dr. Patricia Hoffman

The expert panel includes: Dr. Ted Dunning, Mr. Joseph B. Rickert, Dr. Giovanni Seni,
Dr. Michael Walker, Dr. Hugh Williams, Dr. Mike Bowles, and Mr. Greg Makowski
Video Link
Make a Free Website with Yola.