Converting Dataframe Columns to Vectors in Spark









up vote
-1
down vote

favorite












I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.



For example in the documentation for scala for doing PCA is:



"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"


etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca



Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
If anything was unclear please let me know.
Thanks!










share|improve this question









New contributor




Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.























    up vote
    -1
    down vote

    favorite












    I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.



    For example in the documentation for scala for doing PCA is:



    "val data = Array(
    Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
    Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
    Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
    val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
    val pca = new PCA().fit(df)"


    etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca



    Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
    Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
    If anything was unclear please let me know.
    Thanks!










    share|improve this question









    New contributor




    Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





















      up vote
      -1
      down vote

      favorite









      up vote
      -1
      down vote

      favorite











      I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.



      For example in the documentation for scala for doing PCA is:



      "val data = Array(
      Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
      val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
      val pca = new PCA().fit(df)"


      etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca



      Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
      Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
      If anything was unclear please let me know.
      Thanks!










      share|improve this question









      New contributor




      Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.



      For example in the documentation for scala for doing PCA is:



      "val data = Array(
      Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
      val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
      val pca = new PCA().fit(df)"


      etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca



      Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
      Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
      If anything was unclear please let me know.
      Thanks!







      scala apache-spark apache-spark-ml






      share|improve this question









      New contributor




      Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited yesterday









      desertnaut

      15k53060




      15k53060






      New contributor




      Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked yesterday









      Lucas Newman

      1




      1




      New contributor




      Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote













          What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.



          Help docs are here



          1. vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler


          2. ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html


          If you need more than PCA you can use low-level RDD transformations.






          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );






            Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.









             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219230%2fconverting-dataframe-columns-to-vectors-in-spark%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.



            Help docs are here



            1. vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler


            2. ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html


            If you need more than PCA you can use low-level RDD transformations.






            share|improve this answer


























              up vote
              1
              down vote













              What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.



              Help docs are here



              1. vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler


              2. ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html


              If you need more than PCA you can use low-level RDD transformations.






              share|improve this answer
























                up vote
                1
                down vote










                up vote
                1
                down vote









                What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.



                Help docs are here



                1. vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler


                2. ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html


                If you need more than PCA you can use low-level RDD transformations.






                share|improve this answer














                What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.



                Help docs are here



                1. vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler


                2. ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html


                If you need more than PCA you can use low-level RDD transformations.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 4 hours ago









                sramalingam24

                6741614




                6741614










                answered yesterday









                ookboy24

                563




                563




















                    Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.









                     

                    draft saved


                    draft discarded


















                    Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.












                    Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.











                    Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.













                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219230%2fconverting-dataframe-columns-to-vectors-in-spark%23new-answer', 'question_page');

                    );

                    Post as a guest














































































                    這個網誌中的熱門文章

                    Is there any way to eliminate the singular point to solve this integral by hand or by approximations?

                    Why am i infinitely getting the same tweet with the Twitter Search API?

                    Solve: $(3xy-2ay^2)dx+(x^2-2axy)dy=0$