noscript
2025. 3. 5. 오후 5:03 녹음읽기전용

Throughout this semester. In this semester, we have invited 14 outstanding speakers from various fields of IT technology, powering devices, and software. Today we are honored to welcome an exceptionally truly distinguished speaker, professor chan-woo kim from the Department of Artificial Intelligence at Korea University, Boryeo-dae. let me briefly introduce Professor Kim. Professor Charles Kim earned his Ph.D. in electrical engineering from Seoul National University in 1998 and 2001. He then pursued his PhD at the Vantage Technologies Institute of Carnegie Mellon, which he completed in December 2010. And Dr. Kim's impressive career includes serving as a speech scientist and Microsoft from 2011 to 2013, and as a senior software engineer at Google's speech from 2013 to 2018. In 2018, he joined Samsung Research, where he has been leading advance in speech recognition, natural language understanding, and speech enhancement as an executive vice president. And simply good, he has been at the forefront of developing K-Bix. Please join me in giving a warm welcome to Professor Chan-Hee. Thank you very much. It's my great honor to have a chance to talk here. As introduced, my name is Kim and I'm currently a professor at Uryong University. But in my case, actually, I spent 17 years in industry including Microsoft Google and Samsung Electronics and I joined academia only the last year. so I spent one year at korean university. so i'm experiencing a lot of change in my lifestyle. so these days I'm spending a lot of time to teach students and to write proposals and idealize the digital. So today I will talk about LLM landscape and the impact. So intentionally I didn't include any equations so this might be quite an introductory material. Okay, so as introduced, I work for some companies including Microsoft and Google. So I was primarily involved in voice assistance. so when I worked for Google, I worked for Google Home and google assistant. and i was the head of the language and voice team at samsung research. so i was in charge of developing speech recognition engine, text-to-speech, and natural language understanding. In 2022, we started building our own large-range model, so we could commercialize it for S24. I worked for those kinds of Google assistants, Samsung BCB, and so on. Last year, there was an IT perspective article about my profile. So topics to be discussed. So I'll give a short overview about LLM. So what LLM means. I will discuss LLM training and influencing. And also I will discuss of LLM. so I will briefly touch the transform architecture and the port and the GPT. and I will compare some well-known LLM architectures like ChetGPT from OpenAI and also open source LAMA from Meta and also google gemini. and i will also briefly talk about Samsung Gauss and As you probably know, these days, you probably heard about DeepSeq from China. They claim that they only spent 5 or 6 million USD to build a top performing browser engine model. So I will discuss DeepSeq LLM a little bit. At Samsung, we focus on on-device AI. So I will briefly talk about model compression tackling and how we can implement AI functionalities for on-device applications. And I will briefly discuss multimodal world-range models. And also, I think that industry monetization is very important. So I will I will finish my talk with short conclusions. So let me first start with the concept of a gargantuan model. So before 2014, when I was a PhD student, for speech recognition, we used the hidden Markov model and the Gaussian mixture model. And I believe that some of you probably heard about the hidden Markov model and the Gaussian mixture model. Inferencing, we use the bit-termly decoding. Bit-termly decoding is a quite relevant technique in digital communication. But everything changed after 2010. roughly 2012 or 2013 when I was working for Google. Suddenly in computer vision area, deep learning became very popular because it showed remarkable performance by increasing the number of layers. And in speech and translation and - All right, energy processing. important paper is encoder-encoder structure. so there were two very important papers published in 2014. so you probably know Professor Kyunghyun Cho, and I think that he's a professor at new york university. and he published a paper about the encoder-encoder structure I'm sure that many of you might be familiar with the encoder-encoder structure concept. And you also probably heard about Ilya Zhukov, and he is the co-founder of OpenAI, and he also published a very similar paper. But the structure is slightly different, but the concepts are basically the same. So in machine translation area, they used fully neural network structure and in professor Kyung-Yeon Joo's approach, they used encoder and decoder. So the decoder has the feedback path, so it's called the autoregressive structure. The encoder can construct kind of abstraction from the input sequence. The decoder has the capability of generating another sequence. If we connect the encoder and the decoder, then it has the power of converting input sequence into another sequence. So using that characteristic and by using the sequence cross entropy, I think you probably heard about it. cross entropy loss is widely used in machine learning training. So, cross entropy is highly related to KL So it's basically measuring the distance between two probability density functions. So by using the sequence cross entropy, then we can train the encoder-decoder structure to generate the expected output. So they use this kind of structure. So encoding is done first and using the decoder, it can generate So there was a big shift in the translation area. So as an example, the input sequence is I am a student and I don't know how I can pronounce it. So it's another, I actually Anyhow, if we use those kind of data, then we can train the model. But in decoder-only structure, it doesn't have encoder. So if we use the decoder, we can also use the same parameter to do encoding as well. So in Ilya's paper, he didn't explicitly use the encoder structure, but he used the decoder-only structure. During the encoding phase, the same parameter was used for encoding purpose and in the decoding phase, he used feedback task that can generate the sequence. So he used this kind of structure for sequence to sequence mapping. But anyway, those two structures are very similar. So you can clearly see that these days large language models are based on the decoder only structure and you can see that the One of the authors of the sequence paper is Ilya Toskova. He established OpenAI and at OpenAI, they started using the self-supervised training approach and used a much larger structure and eventually he started showing remarkable performance and probably That might be the beginning of generating the model structure. So before talking about the gallery lamp in detail, let's define what might be the input and the output of the Brazil energy model. As you probably guessed, the input and the output is text. Usually the unit itself is called a token. So the unit is a token and a token is a sub-word unit. So in human language, actually the text sequence consists of words. But there was no guarantee that world might be still optimal in handling those kind of motion learning based system. So people started using some algorithms and famous algorithms include which stands for byte pair encoding and another approach is unigram approach so in large range model people usually use the VP approach because in case of VP we first start with a small set of words which is basically the grapheme and we increase the number of vocabularies so Anyhow, you can just assume that a token is a kind of sub-world unit. In range processing, when we are using a network structure, we are usually using a token. So, the length of a token is usually a little bit shorter than the length of a world. And as you probably know, there was a big shift in neural network structure. So when I worked for Google at that time, we used the feed-folded network structure in 2013 or '14. And I remember that in 2015, we used the RNN-based structure and one famous structure, the LSTM. So any of you-- And I remember that in 2017, Transformer paper was published and the title is Attention is All You Need, so as you probably know. In 2018, people started using self-supervised training using the encoder-only structure. In the previous slide, I mentioned the encoder-recoder structure. If you use only the encoder and if you perform self-supervised training, then that's the BERT structure. BERT means bidirectional imported representation from transformer. So transformer is actually very simple structure. So it consists of the vertical attention and feed-forward network and in the feed-forward network basically there are two layers of dense structure and there is a layer between two dense layers and basically in the attention structure then we are using uh three inputs which is often called the key value of quality value and we are calculating the correlation between uh quality and we are obtaining the attention weight and we are multiplying attention weight with the So the idea of attention is very simple. But because attention doesn't contain any kind of non-reality, we are using the relative and we are adding proof. That's why we are here. transformer structure we are using the attention layer both in encoder and decoder structure as well so when encoder decoder structure first appeared people didn't use the attention approach at all but in 2015 people started using the attention structure between encoder and encoder and in 2017 encoder and encoder they only use those kind of That's the start of the transformer architecture. And for sequence processing, transformer shows significantly better results than LSTM structure. And actually the reason is quite obvious because in case of LSTM, it doesn't keep the history of all the previous values but in transformable structure basically you are storing the entire history memory and you are comparing, you are calculating correlation between Cori and Ki. So because of the calculation is kind of more brute force approach ensures better performance. But of course, the disadvantage is that if we use the transformer structure in terms of computational cost, the computational cost is a big O over L squared. So when the different ranks is represented by capital L, then the computational cost is a big O over L squared. And probably, that's why NVIDIA owns so much money. we want a longer sequence but if we increase the next twice then actually the computational cost will increase four times Then I'd like to briefly talk about the previous voice assistant structure. Actually, the voice assistant structure is really complicated. It's just a very simplified diagram. The first stage is speech recognition. Speech recognition is basically a block which can convert the acoustic feature sequence into So we can obtain text from the acoustic feature sequence and we are using the NLU block. NLU stands for natural language understanding. So the objective is finding the user's intention. So as an example, user said how's the weather in Seoul this afternoon then in this case the user's intention is finding the weather information so in voice assistance like google assistant or in bixby we predict if we define user's intention. As an example, in case of Samsung PX3, there are more than 1,000 predefined intentions. And also, we needed to find out some attributes associated with the intention. So the attributes means we need some additional information like location. So the In this case, the location slot is the soul and the time slot is this afternoon. The traditional NLU is basically finding the intention and finding the slot values. That's the rule of NLU. And after finding out users' intention and slot values, then we need to find out relevant information. Those are called actions. So we need to do action planning and we also generate the this path using NLG, NLG stands for natural generation and after a text is created then using synthesis then we can generate the waveform So this is the basic structure, but the actual structure is much more complicated than this one. So we have something like domain classification and named entity dictionary. And in the current structure, as an example, in the case of Samsung PCB, we have a few hundred capsules. So for different domains, like whether or not we are using different types of captures, then at the first stage is domain classifier. So based on the user's input, we first need to identify which capture should handle the user's input. So once we assign a specific capture to handle user's input, then in each capture it has its own natural language understanding block. So that's the traditional structure, but it has a huge disadvantage as well. So one disadvantage is that in some cases, in user's speech, it might contain multiple intentions. As one example, the user can say, What's the weather on today's day and also I want to see my flight information. In that case, there are two intentions but current BICSB cannot handle this properly. And also it doesn't have context information. So as an example, if you say how is the weather like in Seoul today, then Bixby might answer in some way. After listening to the answer, you might ask, How about tomorrow? Then Bixby doesn't remember the previous question. So it doesn't have the power to keeping the context information. But if you have used Chechipiti, then it can handle multiple intentions and also it keeps the history. So, large-range model is much more flexible than the traditional natural language understanding. So in this slide, I'm showing some recent advances in speech and language processing. And I used different colors to represent advances in speech. So speech algorithms are represented by yellow colors. And some important algorithms in natural language processing are-- And the attention. encoder-recorder approach appeared in 2015. And when I worked for Google, there was an intern named William Chan, and I remember that at that time in 2015, he was applying the attention-based encoder-recorder structure for speech recognition as well. So at first, people used attention-based encoder-recorder structure At the time there were still a lot of limitations, right? vanilla or attention structure doesn't have the streaming capability. For speech recognition, streaming capability is really really important. But if you use attention based encoder-recoder, basically decoder will not operate until the entire encoding is finished. Which means that streaming is impossible. But, they are on the search for those kinds of technical issues one by one. And one important approach is transform approach. So I briefly talked about it. transformer approach and people started using self-supervised approach in damage processing roughly from 2018. Some of you might know the BERT structure. BERT stands for bi-directional encoder representation from So in the original transformer structure, it consists of the encoder and decoder. And in BERT, they only use the encoder structure. And the training objective is a masked language. model. Actually, in the original paper, they used two criteria. One criteria is the mass-language model, which is often abbreviated as MLM. And the second criteria is MSP, next sentence prediction. But in future papers like Lou Volta, it turns out that MSP might not be that important. So actually, MLM is the most important part in BERT training. So in BERT training in MLM, it means that when you are given a sentence, then you randomly mask out some tokens and the model is trained to predict those kinds of masked tokens. So, the biggest difference between the conventional approach and the step-by-step approach is that when you use the conventional approach, the model itself is trained from scratch to perform a specific task. but if we use the self-provided training, then at first the model itself learns the language itself. So the training is not targeted for a specific task, but the model itself is trained to understand the language itself. And once the pre-training is done, therefore specific task, just using a relatively small amount of labeled data, So, BERT approach became very popular because it showed much better performance than previous approaches like question and answering tasks and natural language understanding tasks. At that time, OpenAI also published a paper about generative pre-trained transformer and in GPT, the idea is actually quite similar. So they also use self-supervised training but instead of using the next language model, they use the next token prediction. So basically the model is trained to predict the future token. So that's the GPT structure and BERT is the encoder only structure and GPT is the decoder only structure. And Google also published a paper about Google T5 and in this case, self-supervised training approach was applied to the entire encoder-decoder structure. So, all of them are closely related. So, those might be the start of the self-supervised training and BERT is for encoder structure and GPT is using the decoder structure and Google T5 is using the encoder-decoder structure. Roughly 2018, I think that Clamor might be the most famous structure because the structure is quite compact and it showed very good performance, especially for natural energy understanding and span detection. But at that time the mother size was less than 1 billion parameters. But in 2020, you probably remember the appearance of GPT-3. And in GPT-3 the number of parameters is 175 billion. if I remember correctly. And when the number of parameters increased and when they started using a large amount of data, the language model showed very interesting characteristics in context learning and it turned out that it can handle various kinds of natural language processing functionality. In case of GPT-3, primarily only the pre-training is done. They published a paper about instruct GPT. In instruct GPT, they started using instruction following which is supervised fine-tuning which is often abbreviated as SFT and they started using the reinforcement learning as well. So based on human feedback they train reward model and using that reward model they can improve the original model itself to follow the user's preference. From GPT-3.5 or GPT-4, OpenAI no longer reveals the architectural details. they don't publish a paper about the structure but people are guessing that GPT-4 was initially based on the mixture of the export structure and the number of farmers might be more than 1 trillion in total also started using multimodal model, so GPT-4-O and also speech GPT. And in speech recognition area, also some important structure might be conformal transducer and also they also started using the self-supervised training on techno as well. So these days, self-supervised training and using large amount of data and performing fine-tuning might be the standard norm in building AI blocks. So, during the training, the objective was predicting the next token. So, the next token means that when you have a sentence writer, it is unbelievable. Then we have the input sequence of the here, then the objective is the next. predicting able and during the inference phase then we have the auto regressive structure so the output is fed back as the input so then it has the capability when generating sequence so the sequence will be generated until it generates the special token and the center token so in the different space we will have the autoregressive structure And briefly I mentioned that these days the training procedure is pre-training. So during the pre-training phase, I mentioned that the objective is predicting the next token. And we perform self-provided fine tuning. And after that, we perform the enforcement. But of course, there are... These days we are also applying on MLPT or EAPT stands for domain adaptive pre-training tapt means task adaptive pre-training so you perform additional pre-training MAPT means modality adaptive pre-training so you might have some additional pre-training stages and in supervised fine-tuning then you are doing fine-tuning website 3.0 was released in 2020. At that time, OpenAI claimed that they used 0.3 trillion tokens to train the Benchmarker. It was a huge amount of data, but these days, companies are using more than 10 trillion tokens to train. So, in a short period of time, the training exercise increased a lot. And we also did reinforcement learning to improve the model performance. So in the future, the market size will increase a lot. So can be placed like in call center. So when I work for Samsung, Samsung's had roughly like how much? So also in Korea Many people are spending money to run a 4-inch vanity. So when I work for Samsung, then many people use it like I could. English tutoring service, having a phone conversation with native speakers, they needed to pay, I provided an exact amount, $50 per hour. And when I was a bachelor student, when students wrote papers, there were a lot of errors in grammar, so they sent paper the company actually corrects all the grammatical mistakes and also sometimes improves the sentence itself. So they used a lot of those kind of services. But after Cheung GPT appeared, I heard that a lot of those companies became bankrupt because people no longer I think that because this day, module engine model also has the capability of speech recognition and also speech synthesis, also those models can be used for damage tutoring as well. So I think that in the future, people will probably use a phone conversation service with native speakers. So customer service, content generation, code generation, chatbot, and then translation. So the bulkhead will increase a lot. So annually the increase will be Roughly 36%, that's actually really a lot. You probably know that GDP growth rate of Korea might be this rate, this is like over 2%, but evident market will increase almost 40 every year. So I'd like to talk about the power of a module-enriching model. So in old days, when we developed a Google Assistant or Samsung Bixby that we needed to prepare large number of specific AI models, right? Subordination models, spell correction model, and the schedule extraction model, copy classification model, tree and aim model. So I'm just showing you some of those examples. Those are all discrete components. But if we use large range model, then everything can be handled by using a single multi-modal large range model. So if we use the traditional approach, then we need to have multiple models to build with assistance system and also it takes a very long time to develop all those models and also internal cost is high but if we use a large-range model even though at first we need to spend a lot of money to build the model but one single model can do everything Then the question might be why on earth so many different AI model or MetaLama. So for different domains, the requirements are all different. So there are different domains like health domain or for Samsung that we care about device control because Samsung is a device company and for Samsung, Some customers are more interested in English tutoring and in some applications like banking. vocabulary and the questions might be very different from other domains. In some areas, customers are more interested in performance, but in some other areas, cost might be more important, and in some areas, and in some other areas privacy might be more important. So as one example, when I worked for Samsung, we also wanted to use a large range model inside the company, but if we use a 3G PT then there are some concerns about secrets. So that was one of the motivation why we started developing our own large-ranging model. So for different domains and different types of tasks, each element has a different kind of advantages. So that might be the reason why one single element cannot dominate the entire domains. So I'm showing some comparison between their own large range models. So you probably-- know about GPT 4.0 so it's a multi-model model and the manufacturer is OpenAR and the advantage is its performance is really really good and OpenAR O1 is the best model in reasoning and also in terms of safety it's very good and I think that Universities including my lab, in academia people are using LAMA and one reason is because it's open source. So open AI models are proprietary so we cannot download and we cannot fine tune on the model using our internal data easily. But in case of Lama, it's open source. And Google developed on Gemini, and Gemini also showed a very good performance and also in terms of coding skill, it's one of the best and it's also So there are some areas where one specific model can be better than the other models. So those might be the most famous models these days. If we compare the performance, we can see that the open area is actually leading the race. But this is actually the difference is relatively small. We can see some well-known benchmarks like MM-LU which is natural language understanding benchmark and GPT-3-MAS. So for different domains, usually GPT theories might be already better than LAMA or Gemini or Cloud 3 but the difference is not huge. So other big tech companies are also building very high performance models. At the university, I think that we are usually using the LAMA structure because the performance is sufficiently good and also it's open source. There is a time for it. I think that LAMA 1 will be released in February 2020. So they initially released four different models ranging from 7 billion up to 65 billion. At the time I was working for Samsung and you were quite surprised about the result. 7 billion and 13 billion model is really small model but for training they used a lot of data so if I remember correctly to train those small models they used one trillion and for larger model they used up to 1.4 trillion token and in their paper it was shown that the performance keep improving as they increase the amount of training data Compared to LAMA, OpenAI GPT-3 is based on 175 billion per hour structure, so the model size is much larger, but they used only 0.3 trillion token for training. It shows that the previous GPT-3 might be kind of under-trained. So they didn't use sufficient amount of data. LAMA-2 structure was released in July 2023. LAMA-3 was released in the last year. LAMA-3 shows also good performance for COVID-19. They also released LAMA 3.2 which also includes a very small model, 1 million model. So in case of 1 million model, you can find the model using low cost GPU like RTX 4090. So unlike open AI models, because the LAMA structure is open source, there are a lot of variations. So LAMA, alpha, and Econa. So all of them are animal things, but I think they look all very similar. So the difference is some people are surprised, So they are a lot of variation of the So here I'm showing some well-known GPT structure like GPT-1 and GPT-2. So GPT burn size is only about 100 billion parameters. But at that time, 100 billion parameters is already considered to be large. I remember the size of a turkey is also similar. If we see the birth base model, then I think that the number of coroners might be roughly 100 million. Then the size of the speech recognition model might be roughly 20 million in case of the confirmers. of parameters increased to 1.5 million. And he depicted, as I mentioned earlier, of the size increase to 175 billion. And David is a common crawl text. so And they didn't release the number of partners, but people believe that the number of partners size might be more than 1 trillion. They probably use the mixture of X-force structure. So not all of those parameters are activated at the same time. And 54 on 12-volt is a more compact model. And they also use... model in the last year. OpenAI also released the OpenAI O1 model and the O3 model. So, the O1 model, they improved the reasoning So I compare the advantages of different modeling models. So OpenAI LLMs have been showing the world's best So I have a wide range of almost world-best models. So the performance might be slightly behind open AF models, but the difference is not that large. But they have advantages in the on-device model. So you've probably heard about Google TechNet now. And I think that-- Samsung also incorporated Gemini as well. So because Google has its own voice assistant, Google Gemini is well integrated with Google voice assistant. So in Galaxy S25, so the default voice assistant was changed from Bixby to Google Gemini. DataRama provides a wide range of open source models individual companies can use those kind of open source model for fine tuning purpose. And Alibaba also created a high performance Brazilian model and their performance is especially good for Asian bandwidth. So in our lab, we are actually using a PrEP-2 model because the model size is only 1.4 billion parameters over 1.4 billion. But it shows quite good performance. And it actually probably know high performing model. And they are claiming that they spent only $60 million to build the model. But of course, I think that there might be some exaggeration as well. So that was quite sharp because-- If anybody can build a very high-performing model, but spend only $60 million, then there's no reason that the media can't sell a lot of it. So, they use low cost NVIDIA H800 GPUs but performance is actually very close to that of open AI OOM model. So, as a result, NVIDIA's price characterization reduced by almost 600 billion USD. So if we compare the DeepSync model performance with the OpenAI model, then the performance difference is very very small. But I think that 16 million is a kind of exaggeration. So they can create only one time training, but to build the model, there should be a lot of trial and errors and they didn't include the data preparation, collection, and so on. So they didn't include those kind of but they just calculated one time training course. And also there are some rumors that they probably used the text generated from the open AM model. So if we use a high performing large range model then it's relatively easier to build another high performing Using some techniques like knowledge distillation. And when we are doing these sorts of following that we usually need a high quality data. Then generate those kind of high quality data that if we use the existing model then we can write the model easily. So there are some speculation that probably open AI might have used LIPSC might have those open AR models, which might be considered as a kind of license But anyhow, it's true that the Dipsy performance is really impressive. So because we don't have time to speak, so they use the extra export structure and write the open AI model, Dipsy model, the open source model, and they efficiently use the knowledge restoration in the other model. So I mentioned that there are some So actually I believe so because these days many companies are using OpenAI in visual. Thank you. to create the findings. And some organizations are also concerned about the data privacy issue. So in case of the Gipsy models, it's true that they are performing really well, but at the same time, when they say that they only spent $60 million, I think that there's kind of degeneration, there are some concerns but at the same time it's true that they didn't deal with real so you should have acknowledged that they are probably very well but at the same time there are also some concerns and also from their own situation as well and and i think that these days processing might be popular And then, at Samsung, we care also about own device processing because Samsung is a kind of device company. Unlike Google or Microsoft, we are focused on device. So Samsung is a base manufacturing 500 million devices every year. So in own device processing, we care about So, advanced include latency, privacy, server cost and personalization. And more than 1 billion smartphones are shipped every year. So, I think that in the future, the importance of phone device AI will become even more and more important. So, for model compression, we learn on techniques including quantization, pruning, so deep pruning and error structured pruning and structural pruning, but for large range model, structural pruning, This is the AP continuation because usually ARM processors are not equipped with 4 bit ALU. this is a 4 bit of JL user only 15 MPU so 4 bit this is you probably heard of the binnet of Microsoft and they even released 1 bit and 2 bit as well early distillation and low mac approximation might be Google and release them. So I think there's 10 billion parameters that will be commercialized for smart phones as well. So if we use a virtual model, then we can take text, images, speech, video, and also can generate not only text, but also images, audio, and video as well. So I think that in the future can be used for education, answering, and automatic reporting, data analysis. So there are a lot of applications in the future. So I will skip this because we are almost done. So I think that the future trend of LLM might be multi-modal LLM and I think in the future robots will become more important. So I think that more tight integration with robots might be more important. So like this day's visual rendering model is popular and also action planning might be a very important application of our So then this is the conclusion of today's talk. So today I talked about large range modeling. and I talked about the Trump status and what is the future of the boundary model and we also mentioned that the algorithm can be used for anti-AID as well and the algorithm can be used for commercial applications as well and also there might be some US-China Thank you very much for your time. All right. Thank you. Thank you. Thank you. It is.

다글로가 직접 받아쓰고 요약한 자료입니다
녹음, 동영상, 유튜브까지 전부 다 다글로에서 한 번에 정리할 수 있어요
AI 정리 시작하기