poikniok's blog

By poikniok, history, 9 years ago, In English

I was wondering if any people here had experience with Kaggle? I am just now considering starting out there and trying to learn about machine learning, and wanted to know what opinions people here have. Are the contests and problems as interesting and informative as on algorithm sites like here?

  • Vote: I like it
  • +53
  • Vote: I do not like it

| Write comment?
»
9 years ago, # |
Rev. 2   Vote: I like it +8 Vote: I do not like it

Kaggle requires very different skills compared to Codeforces! If you want to start with machine learning competitions without any prior experience I would recommend that you start by doing this course:

https://www.edx.org/course/analytics-edge-mitx-15-071x-0

It has a huge amount of examples of how to apply standard ML algorithms on a lot of different datasets and it even uses Kaggle as a part of the grading. It doesn't show the details on how to implement the algorithms but it is easy to google it if you are interested, most of it is very straightforward compared to CF algorithms.

  • »
    »
    9 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    This might be a stupid question but....is C++ or Java viable for kaggle instead of R? My guess is probably not because there's a lot of missing library functionality, right?

    • »
      »
      »
      9 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      You really want to use a scripting language since you will want to play around with the data. Also you need ways to visualize the data or you will stumble around in the dark, so libraries for making plots and reports are a must.

      So you don't need to use R, you could use languages like Python or Matlab instead. But Java and C++ are not really viable, there are machine learning libraries for them but the focus here isn't algorithms but how well you can familiarize yourself with the data.

      • »
        »
        »
        »
        9 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Or, it depends on what competition it is. Some are more focused on performance and you can technically do everything you need to do in languages like Java/C++. But it is very nice to have a scripting language to quickly try out different models and then you can implement it in a lower level language to get better performance if you really need to.

      • »
        »
        »
        »
        9 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        I'm not really familiar with this subject, so forgive me if that's an easy question, but besides available libraries, what is the main difference between scripting vs non-scripting language which makes playing with data easier in scripting ones?

        • »
          »
          »
          »
          »
          9 years ago, # ^ |
            Vote: I like it +16 Vote: I do not like it

          It might be hard to understand if you don't understand the problem structure. In essence the problems are tables with hundreds to millions of rows with one variable you are meant to predict and many other variables that you should use to predict it. The variables can be numbers, texts, images etc, usually you have many different kinds mixed together in the same problem because that is what you get in real life.

          So firstly, to understand the actual problem you want to take a look at the data. Since scripting languages have a REPL you can do it in real time as you write your code. This process is a lot smoother than compiling and rerunning the whole thing at each step especially since a lot of ML algorithms takes a long time to run.

          Secondly, statically typed languages doesn't handle mixed data tables well. Either you will have to make the mixed types yourself (getting a poor mans dynamic types) or you have to define every struct by hand and alter your algorithms so that they can work with these new structs every time.

          • »
            »
            »
            »
            »
            »
            9 years ago, # ^ |
              Vote: I like it +8 Vote: I do not like it

            Well, REPL is not exclusive just to scripting languages.

        • »
          »
          »
          »
          »
          9 years ago, # ^ |
            Vote: I like it +22 Vote: I do not like it

          IMO, the main advantage of scripting languages for data science is interactivity: usually you have no idea what you're doing, in a sense that you don't know what the approach is going to be by just looking at problem statement. So interactive shell where you can try random ideas helps a lot (for python, it is IPython notebook, I guess there's smth for R too). You don't need to re-load/re-process data and re-compute features, you just try random ideas, visualize, try smth else, and so on.

          Another advantage of scripting languages is that it is easy to hack around: you produce really crappy code, but it doesn't matter, since 99% of it will be thrown away, and you can clean up the remaining 1% later.

          • »
            »
            »
            »
            »
            »
            9 years ago, # ^ |
              Vote: I like it +23 Vote: I do not like it

            IMO, the main advantage of scripting languages for data science is interactivity: usually you have no idea what you're doing, in a sense that you don't know what the approach is going to be by just looking at problem statement. So interactive shell where you can try random ideas helps a lot (for python, it is IPython notebook, I guess there's smth for R too). You don't need to re-load/re-process data and re-compute features, you just try random ideas, visualize, try smth else, and so on.

            The difference is not that big usually. In case of small datasets, the loading/preprocessing time is near-instant. In case of big datasets that don't fit in the memory you're screwed anyway. So, we're left with middle-sized datasets. And with those, the biggest bottleneck is running your models, and thus it doesn't matter anyway. FYI, I'm speaking from ML perspective (building good predictive model), not a data science one. In case of DS, I'd agree that scripting languages might have a bigger advantage here.

            Another advantage of scripting languages is that it is easy to hack around: you produce really crappy code, but it doesn't matter, since 99% of it will be thrown away, and you can clean up the remaining 1% later.

            C++ is not that far away. It's definitely matter of experience. I went from C++ to Python (for data science/ML stuff). And then, I went back from Python to C++, because I'm way much more productive (in the long term) with C++. In my case, most of the Python's advantages come from having better built-in library and slightly better syntax. But, you can "patch" your C++ with few macros and/or add some functions to offset that. Since, you're writing one-time use scripts only, readability for others people is not that important.

            The biggest difference between C++/Java vs Python/R is the lack libraries for the first one. Especially when it comes down to "cutting-edge" stuff. Which currently mostly boils down to overhyped NNs, but it's still a big area where C++/Java is (and will be) behind. Most of the libraries for Python/R are already written in C++, so speed of the libraries is usually not an issue. The only place where speed matters is when you want to do some complex data preprocessing/feature extraction (which actually might happen quite often, depending on the type of data you're working with).

            The biggest difference are the libraries. You can look at Python/R as tools where you have everything in one place, while it takes a lot of time to properly setup C++/Java (and it's still going to lack few things).

            The whole issue of using C++/Java vs Python/R for ML/DS is actually quite complex. Most of the early DS people were non-programmers. This coupled with the hype for Python used as a first language (everyone hates C++, even me and I use it for like last 15 years), resulted in Python having just more stuff for DS-related things. The quality of those libraries is another topic to rant about, but at least they are usually easy to use.

            TL;DR: It's not that simple to tell what is better.