I just finished technical editing of the book Machine Learning in Action: http://manning.com/pharrington/
I also edited the book Scala for Machine Learning, published by Pack: https://www.packtpub.com/big-data-and-business-intelligence/scala-machine-learning
University of California Santa Cruz Extension Courses
2612 Introduction to Machine Learning and Data MiningSign up here: Machine Learning / Data Mining Survey Course This covers the material in Machine Learning 101 described below.
Dates: Fall and Winter Quarters Some Data For First Lesson: myDataSet.txt
Location: 2505 Augustine Drive filter60HzNotchblanks.txt filter50HzNotch.txt
Santa Clara, CA matrixletters.csv ProstateCancerDataESL.csv
30164 Machine Learning and Data Mining: Clustering MethodsSign up here: Machine Learning / Data Mining Clustering This covers the material in Machine Learning 102 described below.
Dates: Spring Quarter
Location: 2505 Augustine Drive
Santa Clara, CA
International Technical University Courses
CS 933 Machine Learning (3 credit units)Sign up here: http://itu.edu/ This covers the material in Machine Learning 101 described below.
Dates: Fall, Spring, Summer Trimesters
Location: Online Courses
CS 920 Programming Paradigms (3 credit units)Sign up here: http://itu.edu
CS 920 taught by Patricia Hoffman, PhD is an extensive course on the R statistical Language. (The course number and title is due to change)
It can be taken by students who have never coded in any language before. It starts from the very beginning and continues through advanced concepts. It will give you a great jump start for my machine learning and data mining courses. In fact the last assignment for this course is similar to the first assignment of my Machine Learning courses.
Dates: Spring Trimester
Location: 355 W. San Fernando St.
San Jose, CA 95113
"I really enjoyed the class and would recommend anyone to do it. It was a perfect blend of Theory of Data Mining and practical exercises in language R. Exercises were extremely helpful, to not only understand the concepts but how to apply them to real problems! It was one of the most fun classes I have attended in a long time. The best part is that what I have learned, I am able to directly apply in my work right away! Thanks Patricia for putting together such an excellent course!". - Sourabh Satish, Distinguished Engineer, Symantec.
- Bo Wu
Thanks for teaching the class.
I REALLY, REALLY learned a lot from this class, in fact its the best class I have taken at UCSC. Besides the great class lectures, and
discussion, the homework, and project helped me a lot. The extra work you gave us, made me revise the subject matter once I got home, even
though it did seem like extra work, but that really paid back.
Thanks for teaching this subject to us.
I will stay in touch.
Navindra Yadav Google
I would like to Thank you for offering such an excellent course. I have learned a lot from your course.
Your course have given me a jump start on R and a solid foundation on Machine learning. You have covered a lot of breath that has helped me understand the broad array of techniques for machine learning. Your example programs were an excellent source of learning as well.
Thank you again.
Here is a brief description of past courses and web links:
Text: "The Elements of Statistical Learning - Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Big Data with MapReduce Adaptation and execution of machine learning algorithms in the map reduce framework
Apache BigTop: http://apachebigtop.pbworks.com/w/page/48434924/FrontPage
Machine Learning - Natural Language Text Documents Statistical algorithms for accomplishing machine learning tasks on texts
Following are the details for these courses:
Machine Learning 101 and 102
Machine Learning automatically recognizes complex, previously unknown, novel, and useful patterns and information in all types of data. Data driven algorithms are the wave of the future and their results improve as the amount of data increases. Machine Learning algorithms are used in search engines, image analysis, multimedia database retrieval, bioinformatics, industrial automation, speech recognition, and many other fields. These are survey courses covering the concepts and principles of a large variety of data mining methods. The courses will equip the students with a working knowledge of these techniques and prepares them to apply machine learning to real problems. At the end of this sequence the students will collaborate on a set of projects of their choosing.
Machine Learning 101 covers supervised techniques including various types of linear regression, decision trees, k-nearest neighbors, Naive Bayes, Support Vector Machines and ensemble methods. Machine Learning 102 addresses unsupervised techniques such as k-means, expectation maximization, anomaly detection, hierarchical clustering, and density based clustering.
These courses require a moderate level of computer programming proficiency, along with an elementary level background in probability, statistics, linear algebra, and calculus. These are hands-on courses using the statistical language R for class examples and homework assignments. No prior knowledge of R is assumed, and some of the basics of open source R language are covered.
Machine Learning 201 and 202
These courses cover topics in greater depth than Machine Learning 101 and 102. After finishing this series, participants are able to read the current literature and apply what they have read to their own work. At the end of this sequence, students present interesting machine learning projects using a wide variety of data sources.
Machine Learning 201 begins with ordinary least squares regression and extends this basic tool in a number of directions. Various regularization approaches are covered including ridge regression, lasso regression, lease angle regression and elastic net. Logistic regression including coding categorical inputs and outputs is discussed. Feature space expansions along with subset selection (both forward and backward step-wise) are detailed. These techniques naturally lead to generalizations of linear regression, known as the "generalized linear model" and the "generalized additive model".
Machine Learning 202 covers collaborative filtering including singular value decomposition and recommendation engines. A section on Bayesian belief networks including expectation maximization and factor analysis is delivered. Time is spent delving into decision trees including gradient boosting. The Friedman, Hastie, and Tibshirani paper, "Regularization Paths for Generalized Linear Models via Coordinate Descent" is an example of the papers presented and discussed. Support vector machines including details on kernel methods along with stochastic gradient descent are covered.
Machine Learning - Big Data
Participants learn to adapt and execute machine learning algorithms in the map reduce framework. Participants finishing the class are able to author their own machine learning algorithms for map reduce and to run them on Amazon Web Services. Amazon provided AWS credits for class participants.Participants learn to use python code to author mappers and reducers for “hadoop-streaming”. For most of the class “mrjob” - an open-source framework developed at Yelp is used. Employing mrjob enables class members to program mappers and reducers in python. The mrjob framework then submits the mapper-reducer to run locally without using hadoop, to run on Amazon Web Services, or to run them on a private hadoop cluster. This simplifies the programming tasks.
Topics included in this course are k-means with canopy clustering . Implementing expectation maximization in the map-reduce paradigm is developed. Generalized Linear Models with regularized regression are covered. Recommender systems along with singular value decomposition are discussed. Frequent Item Set Implementations are also provided.
Machine Learning - Natural Language Text Documents
Machine Learning applied to natural language text documents will be covered, including the use of statistical algorithms for accomplishing machine learning tasks on texts - not more traditional rule-based semantics, parsing, etc. The class starts with an introduction to basic text manipulations, and continues with comparisons of statistical techniques to semantic approaches, definition of problems in text mining, and simple text manipulations. Various algorithms for dealing with standard text mining problems, such as indexing, automatic classification (e.g. span filtering) part of speech identification, topic modeling, sentiment extraction, etc.
Cloud Stock Presentation Materials - It was great to see everyone so excited about the cloud and LOTS of DATA!
Link to MyDataSet
A unified view on the rotational symmetry of equilibria of nematic polymers, dipolar nematic polymers and polymers in higher dimensional space, Communications in Mathematical Sciences 6, 949-974Patricia Hoffman PhD pdf
A Distributed Architecture for the C3 I (Command, Control, Communications, and Intelligence) Collection Management Expert System pdf
Web SitesLeave a comment
Software as a Platform Panel Discussion October 15, 2011
The panel will discuss the benefits of moving to a software platform distributed over a large number of processors. The risks involved in the move along with advice in avoiding the pitfalls will be given. What are the costs involved with moving to one of these platforms? The popular platforms along with where the market is likely to go will be addressed. What software developer backgrounds are companies looking to hire? Do companies have in house programs to develop this talent?
Suggested Questions to Cover
- Is big getting smaller? (i.e. is Moore's law allowing hardware to catch up with big data)
- Do we all need new shoes? (should we throw all legacy code onto a bon-fire?)
- If not, how can we inter-operate?
- How valuable is it to "own" your own infrastructure all the way down to the bits?
- What kind of data mining analytics are the panelist doing?
-Are there any surprising applications that aren't what people usually list when they describe data mining?
Dr. Patricia Hoffman, Research Scientist
Ted Dunning has been involved with a number of startups, including MusicMatch, and Veoh Networks with the latest being MapR Technologies where he is Chief Application Architect. He is also a PMC member for the Apache Zookeeper and Mahout projects. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.
Bryan Duxbury leads Rapleaf's Analysis Team, which is responsible for maintaining and analyzing a database of over 200 billion people data records. He is also the project chair of Apache Thrift.
Erik Andrejko is a software engineer currently working on large scale statistical climatology and related
agronomic models at Weatherbill. In the past he has built systems at scale for collaborative filtering, search, rare event modeling and various other related problems.
Jay Kreps is a Principal Engineer and Engineering Manager at LinkedIn. One of LinkedIn's software platforms is called Voldemort: http://project-voldemort.com/ Voldemort is the open source data store used extensively atLinkedIn for online queries built to overcome the inherent scalability limitations of a relational database
Jimmy Retzlaff is a senior software engineer at Yelp, working on ad targeting. Jimmy regularly gives talks about mrjob, Yelp's open source library for doing MapReduce in Python. Before Yelp, Jimmy worked on the Amazon Kindle for nearly 5 years and also developed a system for generating interactive geographic visualizations of mutual fund sales activity.
Bayesian Techniques Panel Discussion October 15, 2011
Bayesian Techniques are used to model uncertainty, for inference (to explain data), Decision Making (Decision Theory), and Risk Reduction (Predicting Future). A huge advantage of Bayesian Techniques is the ability to use all relevant information and to unify various methods in a probabilistic framework. This panel will discuss the types of problems that are ideally suited for Bayesian Techniques. The current research and new developments will be addressed. The panel will provide guidance for managers considering using these techniques to solve their problems.
Questions to be addressed include:
How can Bayesian Techniques improve the solutions to problems? How do Bayesian Techniques compare with other Data Mining methods? What types of problems are Bayesian Networks ideally suited to solving? What about Bayesian nonparametric models? What are recommendation to follow when implementing a Bayesian Technique? In practice how is it possible to quantify uncertainty?
Dr. Patricia Hoffman, Research Scientist
PanelistsJohn Mark Agosta, PhD, previously was Chief Scientist at Impermium, a real time web service for span elimination on social networks. Before that, Research Scientist at Intel Labs, working on opinion mining on the web and distributed detection of computer viruses; Edify Corporation (automating customer interaction using statistical natural language and automated workflow), Knowledge Industries - Diagnostic Bayes Nets, and SRI. John Mark did his thesis work at Stanford, on Bayes networks models for visual recognition.
Lionel Jouffe, PhD, cofounder and CEO of France-based Bayesia S.A.S. Lionel Jouffe holds a Ph.D. in Computer Science and has been working in the field of Artificial Intelligence since the early 1990s. He and his team have been developing BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is highlighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007
David Draper, PhD Professor at the University of California at Santa Cruz. Past President of the International Society for Bayesian Analysis. Authored more than 100 articles in Journals of American Statistical Association, Royal Statistical Society, Bayesian Analysis, and both the New England Journal of Medicine, and the Journal of American Medical Association. His seminal article has been sited more than 800 times.
Expert Panel Discussion November 13, 2010
The panel was just as lively as last year’s discussion. Follow the full video with all the Q/A to see what was interesting to the experts and students in attendance.
Dr. Patricia Hoffman, Research Scientist
and their signature question
Dr. Neel Sundaresan, Sr. Director and Head, eBay Research Labs at eBay
How does data mining apply to social and incentive networks?
Mr. Dean Abbott, Chief Scientist and Co-Founder at SmarterRemarketer, LLC
What tools are available to the data miner today?
Dr. Mike Bowles, Research Scientist and Start-up Executive
How has the field of stock market prediction changed over the past two years?
Dr. Hans Dolfing, Pattern Recognition Manager at Apple
What are the current challenges in Speech and handwriting recognition?
Dr. Susan Holmes, Standford Professor, Statistics Department
What recommendations do you have for people starting out in data mining today?
Dr. Omid Madani. Senior Computer Scientist at SRI International
How has data mining changed as the scale of data has increased so dramatically in the last few years?
Dr. Lionel Jouffe, President/CEO at BAYESIA
Can you give us a brief example where a Bayesian network did a great job in diagnostics? [poor audio, the answer to this question not included in video]