WinUI（WASDK）使用ChatGPT和攝像頭手勢識別結合TTS讓機器人更智能-有解無憂

前言

之前寫過一篇基于ML.NET的手部關鍵點分類的博客，可以根據圖片進行手部的提取分類，于是我就將手勢分類和攝像頭資料結合，集成到了我開發的電子腦殼軟體里，

電子腦殼是一個為稚暉君開源的桌面機器人ElectronBot提供一些軟體功能的桌面程式專案，它是由綠蔭阿廣也就是我開發的，使用了微軟的WASDK框架，

電子腦殼算是本人學習WinUI開發的練習專案了，通過根據一些開源的專案的學習，將一些功能進行整合，比如手勢識別觸發語音轉文本，然后接入ChatGPT結合文本轉語音的方式，實作機器人的對話，

此博客算是實戰記錄了，替大家先踩坑，

下圖鏈接為機器人的演示視頻，通過對話，讓ChatGPT給我講了一個駱駝祥子的故事，只不過這個故事有點離譜，本來前部分還正常，后面就開始瞎編了，比如祥子有了一頭驢，最后還成為了商人，

大家觀看覺得不錯的話給點個贊，

B站視頻演示鏈接

具體的實作方案

1. 方案思路敘述

整體的流程如下圖，圖畫的不一定標準，但是大體如圖所示：
識別流程圖

處理攝像頭幀事件，通過將攝像頭的幀資料處理進行手勢的匹配，
手勢識別結果處理方法呼叫語音轉文本邏輯，
轉的文本通過呼叫ChatGPT API實作智能回復，
將回復結果文本通過TTS播放到機器人上的揚聲器，完成一次對話，

2. 所用技術說明

WASDK
MediaPipe offers open source cross-platform, customizable ML solutions for live and streaming media.
ML.NET 開放源代碼的跨平臺機器學習框架

上面的技術堆疊在我上面文章里有講述，這里就不展開了，大家有興趣的可以點擊之前的文章查看，

WinUI（WASDK）使用MediaPipe檢查手部關鍵點并通過ML.NET進行手勢分類

代碼講解

1. 專案介紹

電子腦殼專案本身是一個標準的MVVM的WinUI專案，使用微軟的輕量級DI容器管理物件的生命周期，MVVM使用的是社區工具包提供的框架，支持代碼生成，簡化VM的代碼，

project

2. 核心代碼講解

實時視頻流決議手勢，通過命名空間Windows.Media.Capture下的MediaCapture類和Windows.Media.Capture.Frames命名空間下的MediaFrameReader類，創建物件并注冊幀處理事件，在幀處理事件中處理視頻畫面并傳出到手勢識別服務里進行手勢識別，主要代碼如下，

//幀處理結果訂閱
private void Current_SoftwareBitmapFrameCaptured(object? sender, SoftwareBitmapEventArgs e)
{
    if (e.SoftwareBitmap is not null)
    {

        if (e.SoftwareBitmap.BitmapPixelFormat != BitmapPixelFormat.Bgra8 ||
              e.SoftwareBitmap.BitmapAlphaMode == BitmapAlphaMode.Straight)
        {
            e.SoftwareBitmap = SoftwareBitmap.Convert(
                e.SoftwareBitmap, BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);
        }
        //手勢識別服務獲取
        var service = App.GetService<GestureClassificationService>();
        //呼叫手勢分析代碼
        _ = service.HandPredictResultUnUseQueueAsync(calculator, modelPath, e.SoftwareBitmap);
    }
}

涉及到的代碼如下：

MainViewModel

CameraFrameService

語音轉文本的實作，WinUI（WASDK）繼承了UWP的現代化的UI，也可以很好的使用WinRT的API進行操作，主要涉及的物件為命名空間Windows.Media.SpeechRecognition下的SpeechRecognizer物件，

官網檔案地址語音互動定義自定義識別約束

以下是語音轉文本的部分代碼詳細代碼點擊文字

//創建識別為網路搜索
var webSearchGrammar = new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.WebSearch, "webSearch", "sound");
        //webSearchGrammar.Probability = SpeechRecognitionConstraintProbability.Min;
        speechRecognizer.Constraints.Add(webSearchGrammar);
        SpeechRecognitionCompilationResult result = await speechRecognizer.CompileConstraintsAsync();

        if (result.Status != SpeechRecognitionResultStatus.Success)
        {
            // Disable the recognition buttons.
        }
        else
        {
            // Handle continuous recognition events. Completed fires when various error states occur. ResultGenerated fires when
            // some recognized phrases occur, or the garbage rule is hit.
            //注冊指定的事件
            speechRecognizer.ContinuousRecognitionSession.Completed += ContinuousRecognitionSession_Completed;
            speechRecognizer.ContinuousRecognitionSession.ResultGenerated += ContinuousRecognitionSession_ResultGenerated;
        }

語音轉文本之后呼叫ChatGPT API進行對話回復獲取，使用ChatGPTSharp封裝庫實作，

代碼如下：

private async void ContinuousRecognitionSession_ResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args)
{
    // The garbage rule will not have a tag associated with it, the other rules will return a string matching the tag provided
    // when generating the grammar.
    var tag = "unknown";

    if (args.Result.Constraint != null && isListening)
    {
        tag = args.Result.Constraint.Tag;

        App.MainWindow.DispatcherQueue.TryEnqueue(() =>
        {
            ToastHelper.SendToast(tag, TimeSpan.FromSeconds(3));
        });


        Debug.WriteLine($"識別內容---{tag}");
    }

    // Developers may decide to use per-phrase confidence levels in order to tune the behavior of their 
    // grammar based on testing.
    if (args.Result.Confidence == SpeechRecognitionConfidence.Medium ||
        args.Result.Confidence == SpeechRecognitionConfidence.High)
    {
        var result = string.Format("Heard: '{0}', (Tag: '{1}', Confidence: {2})", args.Result.Text, tag, args.Result.Confidence.ToString());


        App.MainWindow.DispatcherQueue.TryEnqueue(() =>
        {
            ToastHelper.SendToast(result, TimeSpan.FromSeconds(3));
        });


        if (args.Result.Text.ToUpper() == "打開B站")
        {
            await Launcher.LaunchUriAsync(new Uri(@"https://www.bilibili.com/"));
        }
        else if (args.Result.Text.ToUpper() == "撒個嬌")
        {
            ElectronBotHelper.Instance.ToPlayEmojisRandom();
        }
        else
        {
            try
            {
                // 根據機器人客戶端工廠創建指定型別的處理程式 可以支持多種聊天API
                var chatBotClientFactory = App.GetService<IChatbotClientFactory>();

                var chatBotClientName = (await App.GetService<ILocalSettingsService>()
                     .ReadSettingAsync<ComboxItemModel>(Constants.DefaultChatBotNameKey))?.DataKey;

                if (string.IsNullOrEmpty(chatBotClientName))
                {
                    throw new Exception("未配置語音提供程式機密資料");
                }

                var chatBotClient = chatBotClientFactory.CreateChatbotClient(chatBotClientName);
                //呼叫指定的實作獲取聊天回傳結果
                var resultText = await chatBotClient.AskQuestionResultAsync(args.Result.Text);

                //isListening = false;
                await ReleaseRecognizerAsync();
                //呼叫文本轉語音并進行播放方法
                await ElectronBotHelper.Instance.MediaPlayerPlaySoundByTTSAsync(resultText, false);      
            }
            catch (Exception ex)
            {
                App.MainWindow.DispatcherQueue.TryEnqueue(() =>
                {
                    ToastHelper.SendToast(ex.Message, TimeSpan.FromSeconds(3));
                });

            }
        }
    }
    else
    {
    }
}

結果文本轉語音并進行播放，通過Windows.Media.SpeechSynthesis命名空間下的SpeechSynthesizer類，使用下面的代碼可以將文本轉化成Stream，

  using SpeechSynthesizer synthesizer = new();
            // Create a stream from the text. This will be played using a media element.

            //將文本轉化為Stream
            var synthesisStream = await synthesizer.SynthesizeTextToStreamAsync(text);

然后使用MediaPlayer物件進行語音的播報，


 /// <summary>
/// 播放聲音
/// </summary>
/// <param name="content"></param>
/// <returns></returns>
public async Task MediaPlayerPlaySoundByTTSAsync(string content, bool isOpenMediaEnded = true)
{
    _isOpenMediaEnded = isOpenMediaEnded;
    if (!string.IsNullOrWhiteSpace(content))
    {
        try
        {
            var localSettingsService = App.GetService<ILocalSettingsService>();

            var audioModel = await localSettingsService
                .ReadSettingAsync<ComboxItemModel>(Constants.DefaultAudioNameKey);

            var audioDevs = await EbHelper.FindAudioDeviceListAsync();

            if (audioModel != null)
            {
                var audioSelect = audioDevs.FirstOrDefault(c => c.DataValue =https://www.cnblogs.com/GreenShade/p/= audioModel.DataValue) ?? new ComboxItemModel();

                var selectedDevice = (DeviceInformation)audioSelect.Tag!;

                if (selectedDevice != null)
                {
                    mediaPlayer.AudioDevice = selectedDevice;
                }
            }
            //獲取TTS服務實體
            var speechAndTTSService = App.GetService();
            //轉化文本到Stream
            var stream = await speechAndTTSService.TextToSpeechAsync(content);
            //播放stream
            mediaPlayer.SetStreamSource(stream);
            mediaPlayer.Play();
            isTTS = true;
        }
        catch (Exception)
        {
        }
    }
}

至此一次完整的識別對話流程就結束了，軟體的界面如下圖，感興趣的同學可以點擊圖片查看專案原始碼地址查看其他的功能：

電子腦殼原始碼鏈接

個人感悟

個人覺得DotNET的生態還是差了些，尤其是ML.NET的輪子還是太少了，畢竟參與的人少，而且知識遷移也需要成本，熟悉其他機器學習框架的人可能不懂DotNET，

所以作為社區的一員，我覺得我們需要走出去，然后再回來，走出去就是先學習其他的機器學習框架，然后回來用DotNET進行應用，這樣輪子多了，社區就會越來越繁榮，

我也能多多的復制粘貼大家的代碼了，

參考推薦檔案專案如下：

電子腦殼有在使用的得意黑字體
專案模板——TemplateStudio
表盤參考專案——一個番茄鐘
社區工具集——CommunityToolkit
控制元件庫展示demo——WinUI-Gallery
影像處理庫——opencvsharp
Emoji8 表情識別例子
ChatGPTSharp
WASDK檔案地址
MediaPipe
MediaPipe.NET
ML.NET
hand-gesture-recognition-using-mediapipe
Control DJI Tello drone with Hand gestures

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/546507.html

標籤：.NET Core

上一篇：MAUI Blazor Android 輸入框軟鍵盤遮擋問題

下一篇：型別轉換